This project focuses on the automated analysis and processing of anamnesis forms commonly used in medical documentation.
The goal is to extract printed content from PDF forms, evaluate checkboxes, and match results against a predefined catalog of statements.
โ ๏ธ Due to the sensitive nature of the data and documents, the source code of this repository is not publicly available.
The pipeline consists of several stages:
- PDF Conversion: Convert PDF files into high-resolution images.
- Image Preprocessing: Binarize and deskew images to enhance OCR performance.
- Text Extraction (OCR): Use Tesseract to extract text and detect checkboxes.
- Checkbox Analysis: Recognize and classify "Yes"/"No" checkbox markings.
- Catalog Matching: Match extracted sentences to a reference statement catalog.
- Export: Output the structured results into CSV files.
pip install pdf2image
pip install pytesseract
pip install opencv-python
pip install numpy
pip install Levenshtein
Poppler is required for PDF-to-image conversion.
Windows:
- Download Poppler
- and add
bin
-Ordner to your system PATH
macOS:
brew install poppler
Linux:
sudo apt install poppler-utils
pdf_to_images(pdf_path)
binarize_image(image)
deskew_image(img_path)
get_sentences(img)
get_keywords(words, df)
get_checkboxes(...)
filter_checkboxes_for_outliers(...)
export_to_csv(data, file)
- OCR Accuracy depends on scan quality and font.
- Checkbox Detection may fail with very small or unclear boxes.
- At the moment only supports PDF and JPG inputs.
This project was developed as part of a university project at FHNW. The project presentation is attached as a PDF file and provides an overview of the key processes, features, and outcomes.