Skip to content

Page segmentation

CENL-AI-WG edited this page Feb 8, 2021 · 16 revisions

Page segmentation Statut

Detecting text and images on heritage documents

Keywords: page segmentation, document layout analysis, text line detection

Approaches: convolutional neural networks, synthetic data

Tools: docExtractor, dhSegment


Example

Goals

Extraction of illustrations and text on heritage material for information retrieval, etc.

Educational resources

docExtractor

docExtractor is a generic approach for extracting visual elements such as text lines or illustrations from historical documents. It can be used as an off the shelf system or fine-tuned on specific dataset. It relies on a fast generator of rich synthetic documents for the training and a fully convolutional network for the extraction. See this github.

EnHerit (Enhancing Heritage Image Databases) project, ANR (2018-2022), LIGM Laboratoire d'Informatique Gaspard-Monge, France

dhSegment

dhSegment is a tool for Historical Document Processing. Its generic approach allows to segment regions and extract content from different type of documents. See this github.

DHLAB-EPFL, Switzerland

Clone this wiki locally