Page Classification for Geological Documents in Assets

Purpose

This repository provides a classification pipeline to categorize PDF pages from geological reports into document classes, with the goal of supporting document understanding and metadata extraction in the Assets platform. The solution can be used as a standalone API.

This classification helps to map individual pages in a document, which ultimately should facilitate the identification of borehole profiles and maps in PDFs to link between documents on Assets and boreprofiles on Boreholes.

API endpoints

Current API supports two endpoint versions V1 with the latest changes (e.g., extended classes and different response schema) and V0 for backwards compatability.

Endpoints for V0:

/ - main document selection endpoint
/collect - response collection

Endpoints for V1:

/v1 - main document selection endpoint
/v1/collect - response collection

The request JSON body structure for all the endpoints follows the same pattern: {"file": "filename.pdf"}

Classes

For each file a response is compiled classifying the page into one of the defined page classes.

V0 version

Each page is categorized into one of the following:

Text - Continuous text page.
Boreprofile - Boreholes.
Maps - Geological or topographic maps.
Title_Page - Title pages of original reports.
Unknown - Everything else.

Extended classes in available in V1 version are mapped to unknown when running the V0 API version.

V1 version

The V1 version containes extended classes from v0 and Each page is categorized into one of the following:

Text - Continuous text page.
Boreprofile - Boreholes.
Maps - Geological or topographic maps.
TitlePage - Title pages of original reports.
GeoProfile - Geological cross-sections or longitudinal profiles.
Table - Tabular numeric/textual data.
Diagram - Scientific 2D graphs or plots.
Unknown - Everything else.

Output Format

data/prediction.json (if -w/--write_result) or returned as a Python object.

Example Output (v0)

{
	"has_finished": true,
	"data": [
		{
			"filename": "input.pdf",
			"metadata": {
				"page_count": 1,
				"languages": [
					"de"
				]
			},
			"pages": [
				{
					"page": 1,
					"classification": {
						"Text": 0,
						"Boreprofile": 1,
						"Maps": 0,
						"Title_Page": 0,
						"Unknown": 0
					},
					"metadata": {
						"language": "de",
						"is_frontpage": false
					}
				}
			]
		}
	]
}

V0 Notes:

filename: The name of the processed PDF file.
metadata: metadata about the file.
pages: list of dictionaries containing:
- page: The page number (1-indexed).
- classification: Classification of a current page:
  - 1: class was assigned to the page.
  - 0: class was not assigned.
- metadata: metadata about the current page.

Example Output (v1)

{
	"has_finished": true,
	"data": [
		{
			"filename": "742_6.pdf",
			"metadata": {
				"page_count": 1,
				"languages": [
					"de"
				]
			},
			"pages": [
				{
					"predicted_class": "Boreprofile",
					"page_number": 1,
					"page_metadata": {
						"language": "de",
						"is_frontpage": false
					}
				}
			]
		}
	]
}

V1 Notes:

filename: The name of the processed PDF file.
metadata: metadata about the file.
pages: list of dictionaries containing:
- predicted_class: The class name of the predicted class (e.g. "Boreprofile"). All possible classes are listed above in the section "Classes".
- page_number: The page number (1-indexed).
- page_metadata: metadata about the current page.

General Notes:

The classifier supports batch input of multiple reports.
Input must be preprocessed: PDFs should already have OCR.
Classification is multi-class with a single label per page. Future updates may support multiple-labels.

Development quick start

Requirements: Python 3.10(recommended), OCR'ed PDFs.

1. Create and activate a virtual environment

python -m venv venv
source venv/bin/activate

2. Install dependencies

pip install .

For development, install optional tools with:

pip install '.[deep-learning,test,lint,experiment-tracking]'

Make sure you have fasttext-predict installed instead of fasttext (see 5. Setup FastText Language Detection).

3. Copy .env.template and specify your paths:

cp .env.template .env

For development:

Set MLFLOW_TRACKING=True in .env file for experiment tracking.

4. (Optional) Use a pre-trained model:

Option A: Download a pre-trained model from the S3 bucket: stijnvermeeren-assets-data .
Option B: Train your own model as described in Train your Model.

5. Setup FastText Language Detection

This project uses fasttext-predict, a lightweight, dependency-free wrapper exposing only the predict method. We use this because FastText is archived. Download the FastText language identification model lid.176.bin form this website:

mkdir -p models/FastText
curl -o models/FastText/lid.176.bin https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

Set in .env:

 FASTTEXT_MODEL_PATH=models/FastText/lid.176.bin

6. (Optional) Start the MLflow UI

For development: Start MLflow UI:

mlflow ui

7. Run the classification:

python main.py -i <input_path> -g <ground_truth_path> -c <classifier_name>

If no classifier is specified, the baseline classifier is used by default. If classifier is layoutlmv3 or treebased, --model_path must be specified to locate the trained model.

Classifier Name	Description
`baseline`	Default. Rule-based classifier using layout, keyword matching, and heuristics
`pixtral`	Uses the Pixtral Large via Amazon Bedrock to classify PDF pages
`layoutlmv3`	Transformer model (pretrained or fine-tuned LayoutLMv3)
`treebased`	Feature-based model (RandomForest or XGBoost)

Example

python main.py -i data/single_pages/ -g data/gt_single_pages.json -c baseline

AWS Setup for pixtral Classifier

To run classification using the Pixtral Large Model, you must configure your AWS credentials:

Ensure you have access to Amazon Bedrock and the Pixtral model.

Set up your credentials:

AWS CLI

aws configure

Manually via config files

Create or edit the following files ~/.aws/config

[default]
region=eu-central-1
output=json

~/.aws/credentials

[default]
aws_access_key_id=YOUR_ACCESS_KEY
aws_secret_access_key=YOUR_SECRET_KEY

Data

The dataset is stored in the S3 bucket stijnvermeeren-assets-data, under the single_pages/ folder. It contains categorized subfolders per class. In addition, boreprofile data from the zurich and geoquat/validation folders used in the swissgeol-boreholes-dataextraction repository and stored in the S3 bucket stijnvermeeren-boreholes-data can be classified and compared using existing ground truth.

Ground Truth

Single-page ground truths: data/gt_single_pages.json
External evaluation sets:
- Zurich: data/gt_zurich.json
- GeoQuat: data/gt_geoquat.json

Repository Structure

config/: YAML configs (models, matching, prediction profiles)
data/ : input data, predictions and ground truths
evaluation/: Evaluation and metrics
models/: Models (e.g. FastText, LayoutLMv3, TreeBased)
prompts/: Pixtral prompts
src/: Utility scripts and core logic
tests/: Unit tests
main.py: CLI entry point
api/: API

Train your Model

Split data

Split data into train and validation set.

python scripts/split_data.py
# creates:
# data/single_pages_split/train/
# data/single_pages_split/val/

Train LayoutLMv3

To train a LayoutLMv3 model, run:

python -m src.models.layoutlmv3.train \
    --config-file-path config/layoutlmv3_config.yaml \
    --out-directory models/layoutlmv3_output \
    # Optional argument:
    --model-checkpoint models/layoutlmv3_pretrained_checkpoint

Arguments:

config_file_path: Path to the YAML configuration file with model parameters and dataset paths.
out_directory: Directory where the trained model will be saved.
model_checkpoint (optional): Path to a pre-trained model checkpoint. If not provided, the model will be initialized from the Hugging Face hub based on the config.

The script supports freezing/unfreezing specific layers and uses the Hugging Face Trainer API under the hood.

Train TreeBased (RandomForest or XGBoost)

To train a RandomForest or XGBoost classifier, use:

python -m src.models.treebased.train \
    --config-file-path config/xgboost_config.yml \
    --out-directory models/xgboost_model

config_file_path: Path to the YAML config specifying hyperparameters and feature extraction settings.
out_directory: Output path for the trained model.

If you're training an XGBoost model on macOS, you may encounter issues related to OpenMP. To resolve this, install the OpenMP library using Homebrew:

brew install libomp

Pre-Commit

We use pre-commit hooks to format our code in a unified way.

Pre-commit comes in the venv environment (installed as described above). After activating the environment you have to install pre-commit in your terminal by running:

pre-commit install

This needs to be done only once.

After installing pre-commit, it will trigger 'hooks' upon each git commit -m ... command. The hooks will be applied on all the files in the commit. A hook is nothing but a script specified in .pre-commit-config.yaml.

We use ruffs pre-commit package for linting and formatting. It will apply the same formating as the vscode Ruff extension would (v0.12.0).

If you want to skip the hooks, you can use git commit -m "..." --no-verify.

More information about pre-commit can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 417 Commits
.github		.github
api		api
config		config
data		data
examples		examples
models/stable		models/stable
prompts		prompts
src		src
tests		tests
.dockerignore		.dockerignore
.env.template		.env.template
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
README.md		README.md
api.http		api.http
docker-compose.yml		docker-compose.yml
main.py		main.py
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Page Classification for Geological Documents in Assets

Purpose

API endpoints

Classes

V0 version

V1 version

Output Format

Example Output (v0)

Example Output (v1)

Development quick start

1. Create and activate a virtual environment

2. Install dependencies

3. Copy .env.template and specify your paths:

4. (Optional) Use a pre-trained model:

5. Setup FastText Language Detection

6. (Optional) Start the MLflow UI

7. Run the classification:

AWS Setup for pixtral Classifier

Data

Ground Truth

Repository Structure

Train your Model

Split data

Train LayoutLMv3

Train TreeBased (RandomForest or XGBoost)

Pre-Commit

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors 7

Uh oh!

Languages

swisstopo/swissgeol-assets-dataextraction

Folders and files

Latest commit

History

Repository files navigation

Page Classification for Geological Documents in Assets

Purpose

API endpoints

Classes

V0 version

V1 version

Output Format

Example Output (v0)

Example Output (v1)

Development quick start

1. Create and activate a virtual environment

2. Install dependencies

3. Copy .env.template and specify your paths:

4. (Optional) Use a pre-trained model:

5. Setup FastText Language Detection

6. (Optional) Start the MLflow UI

7. Run the classification:

AWS Setup for pixtral Classifier

Data

Ground Truth

Repository Structure

Train your Model

Split data

Train LayoutLMv3

Train TreeBased (RandomForest or XGBoost)

Pre-Commit

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors 7

Uh oh!

Languages

Packages