- 
                Notifications
    You must be signed in to change notification settings 
- Fork 0
Page segmentation
Detecting text and images on heritage documents
Keywords: page segmentation, document layout analysis, text line detection
Approaches: convolutional neural networks, synthetic data
Tools: docExtractor, dhSegment

Extraction of illustrations and text on heritage material for information retrieval, etc.
docExtractor is a generic approach for extracting visual elements such as text lines or illustrations from historical documents. It can be used as an off the shelf system or fine-tuned on specific dataset. It relies on a fast generator of rich synthetic documents for the training and a fully convolutional network for the extraction. See this github.
EnHerit (Enhancing Heritage Image Databases) project, ANR (2018-2022), LIGM Laboratoire d'Informatique Gaspard-Monge, France
dhSegment is a tool for Historical Document Processing. Its generic approach allows to segment regions and extract content from different type of documents. See this github.
DHLAB-EPFL, Switzerland