- 
                Notifications
    You must be signed in to change notification settings 
- Fork 11
Select rows filter
TODO: update to include Python and the HXL Proxy as well.
The hxlselect command-line tool creates a new copy of a HXL dataset including only the rows that match certain criteria. You can use this tool to create (for example) a dataset containing only information for a specific geographical region, sector, beneficiary group, etc.
This command is especially useful in pipelines: you can use it to select only certain rows for other commands to act on. If you provide multiple filters, they act as a logical OR; use hxlselect again in a pipeline to get a logical AND.
For example, to create a GeoJSON map layer of only rows related to the WASH sector, you can use a pipeline like
hxlselect --query sector=WASH my-data.csv | \
  hxl2geojson > map-layer.json
There are many more similar examples in the HXL cookbook.
usage: hxlselect [-h] [-q tag=value] [-v] [infile] [outfile]
Filter rows in a HXL dataset.
positional arguments:
  infile                HXL file to read (if omitted, use standard input).
  outfile               HXL file to write (if omitted, use standard output).
optional arguments:
  -h, --help            show this help message and exit
  -q tag=value, --query tag=value
                        expression for filtering (use multiple times for
                        logical OR): <hashtag><op><value>, where <op> is =,
                        !=, <, <=, >, >=, ~, or !~
  -v, --invert          Show only lines *not* matching criteria
Note that empty fields always fail a comparison. That means that --query adm1!=XXX will not match an empty field, only one that has an explicit value.
hxlselect supports the following comparison operators:
| = | The hashtag value is identical to the value provided. | 
|---|---|
| != | The hashtag value is not identical to the value provided. | 
| < | The hashtag value is less than the value provided (may be unpredictable for non-numeric values). | 
| <= | The hashtag value is less than or equal to the value provided (may be unpredictable for non-numeric values). | 
| > | The hashtag value is greater than the value provided (may be unpredictable for non-numeric values). | 
| >= | The hashtag value is greater than or equal to the value provided (may be unpredictable for non-numeric values). | 
| ~ | The hashtag value matches the [regular-expression pattern](https://docs.python.org/2/library/re.html#regular-expression-syntax) provided. | 
| !~ | The hashtag value does not match the regular-expression pattern provided. | 
Starting dataset:
| Implementing organisation | Cluster or sector | District | Beneficiaries | 
|---|---|---|---|
| #org | #sector | #adm1 | #targeted_num | 
| Org1 | Health | Coast | 10,000 | 
| Org1 | Education | Coast | 25,000 | 
| Org2 | Health | Mountains | 3,500 | 
Generate a new dataset, including only the rows where the #org is "Org1":
hxlselect --query org=Org1 \
  <dataset-in >dataset-out
Result:
| Implementing organisation | Cluster or sector | District | Beneficiaries | 
|---|---|---|---|
| #org | #sector | #adm1 | #targeted_num | 
| Org1 | Health | Coast | 10,000 | 
| Org1 | Education | Coast | 25,000 | 
Generate a new dataset where the rows are only those that don't contain a #sector equal to "Education":
hxlselect --invert --query sector=Education \
  <dataset-in.csv >dataset-out.csv
Result:
| Implementing organisation | Cluster or sector | District | Beneficiaries | 
|---|---|---|---|
| #org | #sector | #adm1 | #targeted_num | 
| Org1 | Health | Coast | 10,000 | 
| Org2 | Health | Mountains | 3,500 | 
Include only the rows where #target_num is greater than or equal to 10,000:
hxlselect --query 'targeted_num>=10000' \
  <dataset-in.csv >dataset-out.csv
| Implementing organisation | Cluster or sector | District | Beneficiaries | 
|---|---|---|---|
| #org | #sector | #adm1 | #targeted_num | 
| Org1 | Health | Coast | 10,000 | 
| Org1 | Education | Coast | 25,000 | 
Standard: http://hxlstandard.org | Mailing list: hxlproject@googlegroups.com