This repository provides a classification pipeline to categorize PDF pages from geological reports into document classes, with the goal of supporting document understanding and metadata extraction in the Assets platform. The solution can be used as a standalone API.
This classification helps to map individual pages in a document, which ultimately should facilitate the identification of borehole profiles and maps in PDFs to link between documents on Assets and boreprofiles on Boreholes.
Current API supports two endpoint versions V1 with the latest changes (e.g., extended classes and different response schema) and V0 for backwards compatability.
Endpoints for V0:
/
- main document selection endpoint/collect
- response collection
Endpoints for V1:
/v1
- main document selection endpoint/v1/collect
- response collection
The request JSON body structure for all the endpoints follows the same pattern: {"file": "filename.pdf"}
For each file a response is compiled classifying the page into one of the defined page classes.
Each page is categorized into one of the following:
Text
- Continuous text page.Boreprofile
- Boreholes.Maps
- Geological or topographic maps.Title_Page
- Title pages of original reports.Unknown
- Everything else.
Extended classes in available in V1 version are mapped to unknown
when running the V0 API version.
The V1 version containes extended classes from v0 and Each page is categorized into one of the following:
Text
- Continuous text page.Boreprofile
- Boreholes.Maps
- Geological or topographic maps.TitlePage
- Title pages of original reports.GeoProfile
- Geological cross-sections or longitudinal profiles.Table
- Tabular numeric/textual data.Diagram
- Scientific 2D graphs or plots.Unknown
- Everything else.
data/prediction.json
(if -w
/--write_result
) or returned as a Python object.
{
"has_finished": true,
"data": [
{
"filename": "input.pdf",
"metadata": {
"page_count": 1,
"languages": [
"de"
]
},
"pages": [
{
"page": 1,
"classification": {
"Text": 0,
"Boreprofile": 1,
"Maps": 0,
"Title_Page": 0,
"Unknown": 0
},
"metadata": {
"language": "de",
"is_frontpage": false
}
}
]
}
]
}
V0 Notes:
filename
: The name of the processed PDF file.metadata
: metadata about the file.pages
: list of dictionaries containing:page
: The page number (1-indexed).classification
: Classification of a current page:- 1: class was assigned to the page.
- 0: class was not assigned.
metadata
: metadata about the current page.
{
"has_finished": true,
"data": [
{
"filename": "742_6.pdf",
"metadata": {
"page_count": 1,
"languages": [
"de"
]
},
"pages": [
{
"predicted_class": "Boreprofile",
"page_number": 1,
"page_metadata": {
"language": "de",
"is_frontpage": false
}
}
]
}
]
}
V1 Notes:
filename
: The name of the processed PDF file.metadata
: metadata about the file.pages
: list of dictionaries containing:predicted_class
: The class name of the predicted class (e.g. "Boreprofile"). All possible classes are listed above in the section "Classes".page_number
: The page number (1-indexed).page_metadata
: metadata about the current page.
General Notes:
- The classifier supports batch input of multiple reports.
- Input must be preprocessed: PDFs should already have OCR.
- Classification is multi-class with a single label per page. Future updates may support multiple-labels.
Requirements: Python 3.10(recommended), OCR'ed PDFs.
python -m venv venv
source venv/bin/activate
pip install .
For development, install optional tools with:
pip install '.[deep-learning,test,lint,experiment-tracking]'
Make sure you have fasttext-predict
installed instead of fasttext
(see 5. Setup FastText Language Detection).
cp .env.template .env
For development:
- Set
MLFLOW_TRACKING=True
in.env
file for experiment tracking.
- Option A: Download a pre-trained model from the S3 bucket: stijnvermeeren-assets-data .
- Option B: Train your own model as described in Train your Model.
This project uses fasttext-predict, a lightweight, dependency-free wrapper exposing only the predict method. We use this because FastText is archived. Download the FastText language identification model lid.176.bin form this website:
mkdir -p models/FastText
curl -o models/FastText/lid.176.bin https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
Set in .env
:
FASTTEXT_MODEL_PATH=models/FastText/lid.176.bin
For development: Start MLflow UI:
mlflow ui
python main.py -i <input_path> -g <ground_truth_path> -c <classifier_name>
If no classifier is specified, the baseline classifier is used by default.
If classifier is layoutlmv3
or treebased
, --model_path
must be specified to locate the trained model.
Classifier Name | Description |
---|---|
baseline |
Default. Rule-based classifier using layout, keyword matching, and heuristics |
pixtral |
Uses the Pixtral Large via Amazon Bedrock to classify PDF pages |
layoutlmv3 |
Transformer model (pretrained or fine-tuned LayoutLMv3) |
treebased |
Feature-based model (RandomForest or XGBoost) |
Example
python main.py -i data/single_pages/ -g data/gt_single_pages.json -c baseline
To run classification using the Pixtral Large Model, you must configure your AWS credentials:
-
Ensure you have access to Amazon Bedrock and the Pixtral model.
-
Set up your credentials:
- AWS CLI
aws configure
- Manually via config files
Create or edit the following files ~/.aws/config
[default] region=eu-central-1 output=json
~/.aws/credentials
[default] aws_access_key_id=YOUR_ACCESS_KEY aws_secret_access_key=YOUR_SECRET_KEY
The dataset is stored in the S3 bucket stijnvermeeren-assets-data
, under the single_pages/
folder.
It contains categorized subfolders per class.
In addition, boreprofile data from the zurich
and geoquat/validation
folders used in the swissgeol-boreholes-dataextraction repository and stored in the S3 bucket stijnvermeeren-boreholes-data
can be classified and compared using existing ground truth.
- Single-page ground truths:
data/gt_single_pages.json
- External evaluation sets:
- Zurich:
data/gt_zurich.json
- GeoQuat:
data/gt_geoquat.json
- Zurich:
config/
: YAML configs (models, matching, prediction profiles)data/
: input data, predictions and ground truthsevaluation/
: Evaluation and metricsmodels/
: Models (e.g. FastText, LayoutLMv3, TreeBased)prompts/
: Pixtral promptssrc/
: Utility scripts and core logictests/
: Unit testsmain.py
: CLI entry pointapi/
: API
Split data into train and validation set.
python scripts/split_data.py
# creates:
# data/single_pages_split/train/
# data/single_pages_split/val/
To train a LayoutLMv3 model, run:
python -m src.models.layoutlmv3.train \
--config-file-path config/layoutlmv3_config.yaml \
--out-directory models/layoutlmv3_output \
# Optional argument:
--model-checkpoint models/layoutlmv3_pretrained_checkpoint
Arguments:
config_file_path
: Path to the YAML configuration file with model parameters and dataset paths.out_directory
: Directory where the trained model will be saved.model_checkpoint
(optional): Path to a pre-trained model checkpoint. If not provided, the model will be initialized from the Hugging Face hub based on the config.
The script supports freezing/unfreezing specific layers and uses the Hugging Face Trainer API under the hood.
To train a RandomForest or XGBoost classifier, use:
python -m src.models.treebased.train \
--config-file-path config/xgboost_config.yml \
--out-directory models/xgboost_model
config_file_path
: Path to the YAML config specifying hyperparameters and feature extraction settings.out_directory
: Output path for the trained model.
If you're training an XGBoost model on macOS, you may encounter issues related to OpenMP. To resolve this, install the OpenMP library using Homebrew:
brew install libomp
We use pre-commit hooks to format our code in a unified way.
Pre-commit comes in the venv environment (installed as described above). After activating the environment you have to install pre-commit in your terminal by running:
pre-commit install
This needs to be done only once.
After installing pre-commit, it will trigger 'hooks' upon each git commit -m ...
command. The hooks will be applied on all the files in the commit. A hook is nothing but a script specified in .pre-commit-config.yaml
.
We use ruffs pre-commit package for linting and formatting. It will apply the same formating as the vscode Ruff extension would (v0.12.0).
If you want to skip the hooks, you can use git commit -m "..." --no-verify
.
More information about pre-commit can be found here.