Opening Data in Kenya. My Method is to Hack.
Posted: August 9th, 2011 | Author: mikel | Filed under: kenya, tech | 2 Comments »A techy cross-post from Brain Off
There’s good reason to join the excitement about Open Data in Kenya. As Tariq says on the World Bank blog
Open data in Kenya is special: it comes at a time of national change; it’s got a head start on tools and expertise from the global open data community and it’s happening in a country where the information ecosystem is still maturing.
I’m proud that our work with Map Kibera has any relation to this at all. And it’s certainly due to the hard work of passionate people, in a tough environment, especially Dr. Bitange Ndemo (if you have the time, Dr. Ndemo’s talk at the World Bank is recommended).
Now that the launch has subsided, and I have a spare moment in the air from Tanzania, I want to look in depth at what data and how data has been released on OpenDataKE, the means of working with the data and collaborating on the data, and how this resource can relate to other open data sets in Kenyan society. Now that the government has made a bold move, I think it’s the responsibility of the software development community and civil society to really step up and test out the data, and suggest how this can become a really vibrant and social resource. Again, Tariq says this succinctly
the call for open data should go hand in hand with a call for better quality data: data that might be collected by official government agencies or in this age, by citizens themselves.
Transect across data
My “method” is to hack. I want to make an interesting simple visualization with some data from OpenDataKE, focusing on Nairobi, using openly available tools. Browsing data sets, the Population Density per Constiuency, derived from the 2009 census, seemed promising. The difference in density across the urban landscape Nairobi is extreme. For a sense of it, just look at the density of features in OpenStreetMap in the map of the slum of Mathare compared to nearby leafy Mathaiga. And to help the hack, the population density data set even has a handy location column.
Or maybe not. The usual practice in tabular data is to split the latitude and longitude into two columns, but here both values are formatted along with the unnecessary name of the province in which the constituency is located. Anyone who has had to work with data is used to little problems like this, and it’s easy enough (for a programmer) to write a quick script to clean this up. So I selected Export to CSV (side note, the other options presented by the platform seem hardly useful), filtered just the constituencies in Nairobi, and cleaned it up just by hand (I was too lazy to script this for just a handful of values).
Gaps and Errors
I uploaded the CSV to GeoCommons, which has facility to deal with many formats of data and easily layer together interactive maps, and was surprised to see that several points weren’t placed in Nairobi at all. Turns out there’s several errors in the location column, at least in Nairobi, and possibly in the rest of the country (I didn’t check). I’d have to correct these by hand. My knowledge of the location and extent of the constituencies is limited, so I needed another source, and that is not something you can find on OpenDataKE. It took some searching until I found scanned maps of contituencies on the Mars Group site. An overview map of all the constituencies was missing, so I used the adjacent constituency names in order to place the mistaken ones.
This worked well, but I’m left with questions. Why isn’t constituency boundary data available on OpenDataKE? How did Mars Group get these maps? And now that I’ve gone to the bother of correcting this data set, how can I contribute the changes back, or at least alert the holders of the data to the errors? There is a nomination section on OpenDataKE, which was wonderfully active until July 9, and then went quiet (did Socrata’s support contract expire then?). Anyway, I’m hopeful these will start getting attention again, so I’ve submitted two requests (pending approval to post), one for constiuency boundaries, and another for a way to correct the location column in the population density data set.
My second surprise was that when I made the annotation size relative to the population density, I didn’t see a big difference among the constituencies. The area where Kibera is located, Langata, is about the same density as Westlands, and both are less than CBD and Eastlands. What’s happening here is that constituencies aren’t aligned to uniform urban settlement patterns. Langata, the home constiuency of the Prime Minister, includes both the slum of Kibera and the wealthy and sparse suburb of Karen. A more useful and telling metric would be population density per Ward, the sub-unit of constituency which does have fairly good alignment to settlement patterns. The census can and has been aggregated to this level, because there was a large promotion of the census count of population in Kibera.
So again I’ve nominated a data set, for the population density aggregated at ward level. And I’ve also made a request for meta-information on the methodology of the census in Kibera and other informal settlements. While the 170,000 figure is surely more close to reality than the wild 1 million figures of the past, by comparing that number with estimates derived from other methods there is a discripency; the others agree on an average closer to 250,000. Additionally, and admittedly anecdotedly, many people in Kibera say they and their neighbors were never counted. Now this happens in any census, and it does not deligitimize the census, but in order to interpret data, openness on the methodology of data collection and analysis is also necessary.
The Civil Society of Data
Open government data exists in a wider ecosystem. Just a few months ago, Columbia University released amazing data sets of Nairobi, including high detail land use under open knowledge licenses. A truly beautiful and informative data set. Another place to find many a Kenyan civil society data set is Virtual Kenya. I thought the population density dataset would be interesting to layer with land use.
This data is distributed as Shapefiles, and I need tiles to use a base map. This is the purpose of MapBox, a rapidly developing tool set to make it easy to build beautiful map tiles. I loaded the Shapefiles in my locally running TileMill, styled the landuse categories based on Columbia’s pdf using carto, assigned interaction, and exported as mbtiles. These were dropbox’d, and posted to TileStream, as this map.
Mouseover or click on the map to get more detail about each parcel. This interaction technique is really interesting (as a geek), it’s entirely javascript and lightweight in the browser; it still has a few rough edges, but overall, a nice experience. There are limits, like TileMill doesn’t work with CSV, or permit multiple interactive layers, but it’s a great work in progress. Thanks to DevSeed for the TileStream account, and Dane Springmeyer, who spent some time with me hacking and bug hunting the interaction features of mapnik.
Like the OpenDataKE data set, and actually all data sets, there are errors … there is no such thing as a perfect map. The Ethiopian Church, across from YaYa, is not indicated nor is its land zoned as “public use” as other church lands in Nairobi are. And the Sarakasi Dome, home of our yoga practice in Nairobi, is not shown a unified plot at all. Now Columbia makes their contact information known on the site, and I’ve met them personally, so feedback here is direct over email, but I wonder from here … what is the method and intention to continually correct, update and discuss these data sets? Does it need to?
Of course, that is the primary approach of OpenStreetMap … geographic data in a wiki, that gets constantly examined, updated, and discussed, completely openly. OpenStreetMap can provide another overlay, so we can have some roads and points of reference for the final map. So on GeoCommons, I configured the tiles from the land use data on TileMill (this required some hidden configuration of the tile scheme), composited over semi-transparent OSM data (provided by GeoCommons through Acetate), and then finally, the population density points. This is the result for now of the data transect.
I hope I can improve this. You’ll see that the OSM streets don’t overlay precisely with land use. This I believe, but haven’t confirmed, to be the result of a project error in the Land Use data set. And an even better representation of the population density would have been a geo-join with area boundaries, had they been available. This would clearly show a thematic variation of population density. And of course, finer grained detail will be required to fulfill the original intention to show Nairobi’s vast differences in population density.
Where have we gone
Government data sets, authoratative civil society data sets, and completely crowd sourced data sets, layerd together in a single map, revealing a little more about Nairobi, and about the data itself. Each is collected, distributed, and updated in different methods. In some ways, I feel OSM leads the wild edge here of what’s possible, and what we want: a truly social environment for data. Data without community is data dry and unimportant. Of course, I’m not saying OSM is the final repository for all data: OSM doesn’t deal with demographic and private data of a census, and the methods to authoritatively certify versions of OSM data are just starting. But this hasn’t stopped several kinds of OSM and government interaction already beyond the “traditional” import, with the likes of Portland and the USGS interacting with the OSM community.
The ultimate promise of all this OpenDataKE is not necessarily in the data itself, but in the deep and wide serving conversations openness triggers. My own personal metric for this will be when government officials from OpenDataKE and slum dwellers from Kibera and Mathare (and Mukuru) openly collaborate and work together. Can’t wait to see this happen. To get there, I challenge you too … get geeky with some data and write about it!