Page segmentation

Detecting text and images on heritage documents

Keywords: page segmentation, document layout analysis, text line detection

Approaches: convolutional neural networks, synthetic data

Tools: docExtractor, dhSegment

Example

Goals

Extraction of illustrations and text on heritage material for information retrieval, etc.

Educational resources

docExtractor

docExtractor is a generic approach for extracting visual elements such as text lines or illustrations from historical documents. It can be used as an off the shelf system or fine-tuned on specific dataset. It relies on a fast generator of rich synthetic documents for the training and a fully convolutional network for the extraction. See this github.

EnHerit (Enhancing Heritage Image Databases) project, ANR (2018-2022), LIGM Laboratoire d'Informatique Gaspard-Monge, France

dhSegment

dhSegment is a tool for Historical Document Processing. Its generic approach allows to segment regions and extract content from different type of documents. See this github.

DHLAB-EPFL, Switzerland

NLP

HTR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Page segmentation

Page segmentation

Goals

Educational resources

docExtractor

dhSegment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NLP

HTR

OCR

Document Analysis

Computer Vision

Clone this wiki locally