Scraping baseline data: How ScraperWiki helped collect 300,000 data points

There is a basic rule in the data-science community: 80 per cent of the time you spend on data analysis is getting the data, and only 20 per cent is actually spent doing the analysis. From what I understand, data analysis in the humanitarian community is no different.

When OCHA started its Humanitarian Data Exchange (HDX) project in 2013, one of the problems it was trying to solve was the inefficiency in searching for baseline data, i.e. the data that tells you how things are going in a country before a crisis hits. This includes data such as GDP per capita, the crude mortality rate or the number of Internet users. The data tends to be from multiple sources, e.g. the World Bank or WHO, and is available through public websites.

This is where ScraperWiki comes in. We specialize in collecting data from the web and turning it into something useful for our clients. This is driven by our curiosity as to what the data might tell us and the challenge of seeing data “over there” that we want “over here, in this format”.

The HDX team came to us with a list of 150 commonly used baseline indicators and the links to the 20 or so websites where they could be found. Many of these indicators were national level and available yearly, going back for 20 or 30 years. The team imagined that if we could “scrape” all of this data so that it was retrievable in one place, it would save people time. This extra time could instead be used to find and share the operational data that is produced during a crisis response, but which is much harder to get.

For the uninitiated, web scraping involves the following steps:

  • Finding the web page on which the data resides, and then programming a computer to retrieve it. So basically teaching the computer how to get the data. This can be non-trivial, since the data may be published by country or by year on separate pages.
  • Transforming the data into the required format. The data we scraped went into a database, so we needed to “normalize” country names, dates and values. For example, data for the United Kingdom might be labelled “UK”, “GB”, “British Isles”, “the United Kingdom” and so forth. But in the database, all the data relating to the same geographic entity needs to have the same label. Similarly with dates: 12 July 2013, July 12th 2013, 2013-07-12, 12/07/2013 and 07/12/2013 all refer to the same day.

We collected about 300,000 data points going back over 60 years from 19 sources (see visual below). The code that retrieves the data is scheduled to run every day, so that when new data appears on a source’s webpage, it is collected almost immediately.

Even before we finished, HDX analysts were able to create basic visualizations showing how baseline data compares across countries. Here are a couple of examples highlighting the trend line for HDX pilot countries:

All of this data will soon be available through the HDX Repository. It will be presented by country with all indicators and by indicator for all countries in CSV and XLSX formats. If you would like to see the complete list of indicators, the HDX team has included a sample spreadsheet for Colombia here. Have we missed any indicators or sources? Please send feedback to hdx@un.org.

I hope that this data is useful to the humanitarian community. The HDX team was great to work with, and I’m looking forward to the next stage of the project.