Skip to content

Page segmentation

CENL-AI-WG edited this page Dec 9, 2020 · 16 revisions

Page segmentation Statut

Detecting text and images on heritage documents

Keywords: page segmentation, document layout analysis, text line detection

Approaches: convolutional neural networks, synthetic data

Tools: docExtractor


docExtractor is a generic approach for extracting visual elements such as text lines or illustrations from historical documents. It can be used as an offthe-shelf system or fine-tuned on specific dataset. It relies on a fast generator of rich synthetic documents for the training and a fully convolutional network for the extraction.

Example

Goals

Extraction of illustrations and texte on heritage material for information retrieval, etc.

Educational resources

See this github.

Clone this wiki locally