Reporting recipes

These recipes are part of the HXL cookbook. Each recipe describes a general problem related to data reporting, then shows how to use the HXL standard and the command-line tools to solve the problem.

Recipes

R.1. Find data outliers
R.2. Report on a donor's activities
R.3. Generate a subset of data
R.4. Analyse historical population movements

R.1. Find data outliers

Problem: data collected from multiple sources listing health-facility locations has inconsistent values in the #loctype (location type) column.

Recipe: use hxlcount (command) to find all the values used and determine which are the rarely-used outliers (representing possible typos or variant practices):

hxlcount --tags loctype \
  <dataset-in.csv >dataset-out.csv

Running this command on an actual dataset of health facilities in Liberia produced the following result (some rows omitted for space):

#loctype	#x_count_num
Cinic	1
Clinic	500
Clinic (?)	1
Clinic?	2
ETC	2
Ebola Treatment Center	3
Hosp	7
Hospital	39

There are clearly some typos in this dataset (like "Cinic"), some undocumented conventions (like using "?" to represent a certainty level), and several sets of conventions, like "Hosp" or "Hospital" to represent an hospital.

R.2. Report on a donor's activities

Problem: an OCHA regional office is asked to provide a monthly report of how many aid activities are funded by DFID in each country/administrative-level-one/cluster combination.

Recipe: use hxlselect (command), hxlcount (command), and hxlrename (command) to filter and aggregate the regional 3W/4W data down to just a count of DFID activities in each #country/#adm1/#sector:

hxlselect --query funder=DFID <3w-data-in.csv | \
  hxlcount --tags country,adm1,sector | \
  hxlrename --rename x_count_num|activities_num \
  >dfid-report.csv

The hxlselect command removes all rows from the dataset where the column tagged #funder does not have the value "DFID".
The hxlcount command counts the number of times each unique combination of values in the #country, #adm1, and #sector columns appears, discarding all other columns, then adds a new column tagged #x_count_num with the total count for each combination.
The hxlrename command changes the generic hashtag #x_count_num to the more-descriptive #activities_num.

The resulting output (entirely made up for this recipe) might start like this:

#country	#adm1	#sector	#activities_num
Mali	Bamako	Education	7
Mali	Bamako	WASH	12
Mali	Tombouctou	Protection	1

R.3. Generate a subset of data

Problem: produce a report on annual refugee movements from Syria to Lebanon, Jordan, and Turkey beginning in 2010.

Solution: use hxlselect (command) to filter UNHCR's largerpopulation-movements dataset:

hxlselect --query 'period_date>=2010' < popstats-in.csv | \
  hxlselect --query origin=SYR | \
  hxlselect --query country=LEB --query country=JOR --query country=TUR \
  >refugee-report-out.csv

Filter the dataset to contain only records where the date is greater than or equal to 2010.
Filter the result to remove every row where the #origin is not "SYR" (Syria).
Filter the result to remove every row where the #country is not "LEB" (Lebanon), "JOR" (Jordan), or "TUR" (Turkey).

Here is the filtered result from a UNHCR dataset up to 2012:

#period_date	#country	#origin	#refugee_num	#reached_num
2010	JOR	SYR	198	198
2010	LEB	SYR	40	40
2010	TUR	SYR	9	9
2011	JOR	SYR	193	193
2011	LEB	SYR	93	93
2011	TUR	SYR	19	19
2012	JOR	SYR	238798	118908
2012	LEB	SYR	126755	126755
2012	TUR	SYR	248466	248466

R.4. Analyse historical population movements

Problem: to satisfy a media request, the UN wants to publish an automatically-updating report listing the average, minimum, and maximum number of refugees from Syria hosted in Lebanon, Jordan, and Turkey annually from 2010 forward.

Solution: use hxlselect (command) to filter UNHCR's population-movements data as in the previous recipe, then hxlcount (command) to produce an aggregate report, and hxlcut (command) to remove unwanted columns:

hxlselect --query 'period_date>=2010' < popstats-in.csv | \
  hxlselect --query origin=SYR | \
  hxlselect --query country=LEB --query country=JOR --query country=TUR | \
  hxlcount --tags country --aggregate refugee_num | \
  hxlcut --exclude x_count_num,x_sum_num \
  >refugee-averages-out.csv

The output from this pipeline on a UNHCR dataset (up to 2012) is as follows:

#country	#x_average_num	#x_min_num	#x_max_num
JOR	79729.66666666667	193.0	238798.0
LEB	42296.0	40.0	126755.0
TUR	82831.33333333333	9.0	248466.0

R.5. Alert for values past a threshold

TODO

R.6. Map activities for a specific cluster

TODO

Standard: http://hxlstandard.org | Mailing list: hxlproject@googlegroups.com

Home

For everyone
- Installation
- Command-line tools
For coders
Building blocks
HXL cookbook

Reporting recipes

Recipes

R.1. Find data outliers

R.2. Report on a donor's activities

R.3. Generate a subset of data

R.4. Analyse historical population movements

R.5. Alert for values past a threshold

R.6. Map activities for a specific cluster

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally