-
Notifications
You must be signed in to change notification settings - Fork 11
Reporting recipes
These recipes are part of the HXL cookbook. Each recipe describes a general problem related to data reporting, then shows how to use the HXL standard and the command-line tools to solve the problem.
- R.1. Find data outliers
- R.2. Report on a donor's activities
- R.3. Generate a subset of data
- R.4. Analyse historical population movements
Problem: data collected from multiple sources listing health-facility locations has inconsistent values in the #loctype (location type) column.
Recipe: use hxlcount (command) to find all the values used and determine which are the rarely-used outliers (representing possible typos or variant practices):
hxlcount --tags loctype \
<dataset-in.csv >dataset-out.csv
Running this command on an actual dataset of health facilities in Liberia produced the following result (some rows omitted for space):
| #loctype | #x_count_num |
|---|---|
| Cinic | 1 |
| Clinic | 500 |
| Clinic (?) | 1 |
| Clinic? | 2 |
| ETC | 2 |
| Ebola Treatment Center | 3 |
| Hosp | 7 |
| Hospital | 39 |
There are clearly some typos in this dataset (like "Cinic"), some undocumented conventions (like using "?" to represent a certainty level), and several sets of conventions, like "Hosp" or "Hospital" to represent an hospital.
Problem: an OCHA regional office is asked to provide a monthly report of how many aid activities are funded by DFID in each country/administrative-level-one/cluster combination.
Recipe: use hxlselect (command), hxlcount (command), and hxlrename (command) to filter and aggregate the regional 3W/4W data down to just a count of DFID activities in each #country/#adm1/#sector:
hxlselect --query funder=DFID <3w-data-in.csv | \
hxlcount --tags country,adm1,sector | \
hxlrename --rename x_count_num|activities_num \
>dfid-report.csv
- The hxlselect command removes all rows from the dataset where the column tagged #funder does not have the value "DFID".
- The hxlcount command counts the number of times each unique combination of values in the #country, #adm1, and #sector columns appears, discarding all other columns, then adds a new column tagged #x_count_num with the total count for each combination.
- The hxlrename command changes the generic hashtag #x_count_num to the more-descriptive #activities_num.
The resulting output (entirely made up for this recipe) might start like this:
| #country | #adm1 | #sector | #activities_num |
|---|---|---|---|
| Mali | Bamako | Education | 7 |
| Mali | Bamako | WASH | 12 |
| Mali | Tombouctou | Protection | 1 |
Problem: produce a report on annual refugee movements from Syria to Lebanon, Jordan, and Turkey beginning in 2010.
Solution: use hxlselect (command) to filter UNHCR's largerpopulation-movements dataset:
hxlselect --query 'period_date>=2010' < popstats-in.csv | \
hxlselect --query origin=SYR | \
hxlselect --query country=LEB --query country=JOR --query country=TUR \
>refugee-report-out.csv
- Filter the dataset to contain only records where the date is greater than or equal to 2010.
- Filter the result to remove every row where the #origin is not "SYR" (Syria).
- Filter the result to remove every row where the #country is not "LEB" (Lebanon), "JOR" (Jordan), or "TUR" (Turkey).
Here is the filtered result from a UNHCR dataset up to 2012:
| #period_date | #country | #origin | #refugee_num | #reached_num |
|---|---|---|---|---|
| 2010 | JOR | SYR | 198 | 198 |
| 2010 | LEB | SYR | 40 | 40 |
| 2010 | TUR | SYR | 9 | 9 |
| 2011 | JOR | SYR | 193 | 193 |
| 2011 | LEB | SYR | 93 | 93 |
| 2011 | TUR | SYR | 19 | 19 |
| 2012 | JOR | SYR | 238798 | 118908 |
| 2012 | LEB | SYR | 126755 | 126755 |
| 2012 | TUR | SYR | 248466 | 248466 |
Problem: to satisfy a media request, the UN wants to publish an automatically-updating report listing the average, minimum, and maximum number of refugees from Syria hosted in Lebanon, Jordan, and Turkey annually from 2010 forward.
Solution: use hxlselect (command) to filter UNHCR's population-movements data as in the previous recipe, then hxlcount (command) to produce an aggregate report, and hxlcut (command) to remove unwanted columns:
hxlselect --query 'period_date>=2010' < popstats-in.csv | \
hxlselect --query origin=SYR | \
hxlselect --query country=LEB --query country=JOR --query country=TUR | \
hxlcount --tags country --aggregate refugee_num | \
hxlcut --exclude x_count_num,x_sum_num \
>refugee-averages-out.csv
The output from this pipeline on a UNHCR dataset (up to 2012) is as follows:
| #country | #x_average_num | #x_min_num | #x_max_num |
|---|---|---|---|
| JOR | 79729.66666666667 | 193.0 | 238798.0 |
| LEB | 42296.0 | 40.0 | 126755.0 |
| TUR | 82831.33333333333 | 9.0 | 248466.0 |
TODO
TODO
Standard: http://hxlstandard.org | Mailing list: hxlproject@googlegroups.com