Skip to content
David Megginson edited this page Jan 17, 2015 · 15 revisions

These recipes are part of the HXL cookbook. Each recipe describes a general problem related to data reporting, then shows how to use the HXL standard and the command-line tools to solve the problem.

Recipes

R.1. Find data outliers

Problem: data collected from multiple sources listing health-facility locations has inconsistent values in the #loctype (location type) column.

Recipe: use hxlcount (command) to find all the values used and determine which are the rarely-used outliers (representing possible typos or variant practices):

hxlcount --tags loctype \
  <dataset-in.csv >dataset-out.csv

Running this command on an actual dataset of health facilities in Liberia produced the following result (some rows omitted for space):

#loctype #x_count_num
Cinic 1
Clinic 500
Clinic (?) 1
Clinic? 2
ETC 2
Ebola Treatment Center 3
Hosp 7
Hospital 39

There are clearly some typos in this dataset (like "Cinic"), some undocumented conventions (like using "?" to represent a certainty level), and several sets of conventions, like "Hosp" or "Hospital" to represent an hospital.

R.2. Report on a donor's activities

Problem: an OCHA regional office is asked to provide a monthly report of how many aid activities are funded by DFID in each country/administrative-level-one/cluster combination.

Recipe: use hxlselect (command), hxlcount (command), and hxlrename (command) to filter and aggregate the regional 3W/4W data down to just a count of DFID activities in each #country/#adm1/#sector:

hxlselect --query funder=DFID <3w-data-in.csv | \
  hxlcount --tags country,adm1,sector | \
  hxlrename --rename x_count_num|activities_num \
  >dfid-report.csv
  1. The hxlselect command removes all rows from the dataset where the column tagged #funder does not have the value "DFID".
  2. The hxlcount command counts the number of times each unique combination of values in the #country, #adm1, and #sector columns appears, discarding all other columns, then adds a new column tagged #x_count_num with the total count for each combination.
  3. The hxlrename command changes the generic hashtag #x_count_num to the more-descriptive #activities_num.

The resulting output (entirely made up for this recipe) might start like this:

#country #adm1 #sector #activities_num
Mali Bamako Education 7
Mali Bamako WASH 12
Mali Tombouctou Protection 1

R.3. Generate a subset of data

Problem: produce a report on annual refugee movements from Syria to Lebanon, Jordan, and Turkey beginning in 2010.

Solution: use hxlselect (command) to filter UNHCR's largerpopulation-movements dataset:

hxlselect --query 'period_date>=2010' < popstats-in.csv | \
  hxlselect --query origin=SYR | \
  hxlselect --query country=LEB --query country=JOR --query country=TUR \
  >refugee-report-out.csv
  1. Filter the dataset to contain only records where the date is greater than or equal to 2010.
  2. Filter the result to remove every row where the #origin is not "SYR" (Syria).
  3. Filter the result to remove every row where the #country is not "LEB" (Lebanon), "JOR" (Jordan), or "TUR" (Turkey).

Here is the filtered result from a UNHCR dataset up to 2012:

#period_date #country #origin #refugee_num #reached_num
2010 JOR SYR 198 198
2010 LEB SYR 40 40
2010 TUR SYR 9 9
2011 JOR SYR 193 193
2011 LEB SYR 93 93
2011 TUR SYR 19 19
2012 JOR SYR 238798 118908
2012 LEB SYR 126755 126755
2012 TUR SYR 248466 248466

R.4. Analyse historical population movements

Problem: to satisfy a media request, the UN wants to publish an automatically-updating report listing the average, minimum, and maximum number of refugees from Syria hosted in Lebanon, Jordan, and Turkey annually from 2010 forward.

Solution: use hxlselect (command) to filter UNHCR's population-movements data as in the previous recipe, then hxlcount (command) to produce an aggregate report, and hxlcut (command) to remove unwanted columns:

hxlselect --query 'period_date>=2010' < popstats-in.csv | \
  hxlselect --query origin=SYR | \
  hxlselect --query country=LEB --query country=JOR --query country=TUR | \
  hxlcount --tags country --aggregate refugee_num | \
  hxlcut --exclude x_count_num,x_sum_num \
  >refugee-averages-out.csv

The output from this pipeline on a UNHCR dataset (up to 2012) is as follows:

#country #x_average_num #x_min_num #x_max_num
JOR 79729.66666666667 193.0 238798.0
LEB 42296.0 40.0 126755.0
TUR 82831.33333333333 9.0 248466.0

R.5. Alert for values past a threshold

TODO

R.6. Map activities for a specific cluster

TODO

Clone this wiki locally