pdf-title-page-splitter is command line tool (with limited UI support) to splits a pdf based on identified title pages. The title pages are identified using machine learning model. The tool supports both training model and using trained model to split pdf files.
There are a plethora of tools that allow splitting of pdf files, so why this tool?
Consider the pdf like this: Economic and Political Weekly - Volume 26. It has 1200 pages and 210.4 MB of data. This makes these pdf files notoriously difficult to read and handle. Actually this pdf file is contains multiple issues of volume 26 of Economic and Political Weekly. So ideally this pdf file should be split into multiple pdf files - each for one issue of the volume.
But this still didn't answer the original question. One can easily split pdf using any other tool. However, if you have to do this for hundreds of such files the task becomes daunting. Where this tool helps is basically to train a ML model that will identify title pages within the given pdf and split pdf into multiple pdfs, each for a single issue.
In the above specific example, the following is a title page:
Using pdf-title-page-splitter we train model to identify such title pages and split pdf into multiple issues.
pdf-title-page-splitter requires python 3 and a bunch of other dependencies mentioned in requirements.txt
file.
Before running pdf-title-page-splitter the python environment needs to be setup correctly. Here we are creating a python virtual environment and installing all the dependencies. The instructions are provided for Linux, but ideally these should be identical for any UNIX like operating system.
The following Change to the folder/directory containing
python -m venv venv
. venv/bin/activate
pip install -r requirements.txt
Creating virtual environment and installing dependencies is one time process. In subsequent runs you just need to activate the virtual environment:
. venv/bin/activate
To deactivate the virtual environment run the command: deactivate
.
Model is trained using create command. It supports the following command line options:
$ python pdf-title-page-splitter.py create -h
usage: pdf-title-page-splitter.py create [-h] [-s SAVE_PATH] [-p PARALLELISM] [--pdf pdf [title-pages ...]] files [files ...]
positional arguments:
files Pdf files to be used
options:
-h, --help show this help message and exit
-s, --save-path SAVE_PATH
Save path (default: model.pkl)
-p, --parallelism PARALLELISM
Number of parallel pages to process (default: number of cores)
--pdf pdf [title-pages ...]
Specify a file and comma separated title pages pair. Can be used multiple times.
What the above command will do, is create a model-file model.pkl trained using specified pdf files to predict title pages for an unseen pdf.
Consider the case where you have identified title pages of some pdfs. You can create model like so:
python3 pdf-title-page-splitter.py \
create \
--save-path model.pkl \
--pdf 'first-pdf.pdf' 5 69 143 201 239 312 \
--pdf 'second-pdf.pdf' 2 45 100 189 234 301
Here page numbers 5, 69, 143, 201, 239 and 312 are title pages identified in first-pdf.pdf. Likewise for the other pdf.
Instead if you have a bunch of pdf files that have title page as the first page of pdf (essentially which you have already split) you can use the following command to create a model file model.pkl.
python3 pdf-title-page-splitter.py \
create \
--save-path model.pkl \
'first-pdf.pdf' \
'second-pdf.pdf' \
'third-pdf.pdf' \
'fourth-pdf.pdf'
The predict command can be used to predict title pages for a bunch of pdfs given a model file generated from create command. The title pages identified are saved in JSON format (by default filename is titles.json) for subsequent processing.
$ python pdf-title-page-splitter.py predict -h
usage: pdf-title-page-splitter.py predict [-h] [-m MODEL_PATH] [-s SAVE_PATH] [-b BEGINNING_PAGE] [-e ENDING_PAGE]
[-p PARALLELISM]
pdf [pdf ...]
positional arguments:
pdf Pdf files to be used
options:
-h, --help show this help message and exit
-m, --model-path MODEL_PATH
Model file path (default: model.pkl)
-s, --save-path SAVE_PATH
Save path (default: titles.json)
-b, --beginning-page BEGINNING_PAGE
Starting page of pdf file (default=1)
-e, --ending-page ENDING_PAGE
Ending page of pdf file (default=last page)
-p, --parallelism PARALLELISM
Number of parallel pages to process (default: number of cores)
The following command identifies title pages in all pdf files in /tmp directory and saves the result to my-titles.json.
python3 pdf-title-page-splitter.py \
predict \
--model-path model.pkl \
--save-path my-titles.json \
/tmp/*.pdf
A sample titles.json file generated could be:
{
"Economic.And.Political.Volume.xviii.No27.pdf": [],
"Economic And Political Weekly Vol.-xxii-no.49-inernetdli2015121078.pdf": [
2,
58,
114,
170
],
"Economic And Political Weekly Vol-XXVIII -- Sachin Chaudhuri -- 1995 -- Economic And Political Weekly Vol-XXVIII -- aa94fb82ae80ea6297ec3a739a400d22 -- Anna\u2019s Archive.pdf": [
5,
77,
149,
261,
325,
389,
495,
555,
623
]
}
Essentially it contains page numbers of identified title pages. If no title page was identified it will be empty list (for example: Economic.And.Political.Volume.xviii.No27.pdf above).
Once the title pages have been identified by pdf-title-page-splitter the next step is to visually see if the identified title pages are corrent and potentially correct any mistakes.
The show commands presents to you each title page and you can either accept, reject or substitute the given title page for each pdf file in titles.json. The show command itself is split into two sub commands, viz. run and from.
The subcommand run is essentially to combine predict and show commands into a single step.
The subcommand from is used to read titles.json file from predict command as described in prior section, and present to the user UI as described above in this section.
Note, though this is not recommended, you can skip this step and directly go on to splitting of pdfs.
Command line options for show command:
$ python pdf-title-page-splitter.py show -h
usage: pdf-title-page-splitter.py show [-h] {run,from} ...
positional arguments:
{run,from} Available sub commands
run Run predict and show pages
from Load saved data from file and show pages
options:
-h, --help show this help message and exit
The supported command line options are as follow:
$ python pdf-title-page-splitter.py show from -h
usage: pdf-title-page-splitter.py show from [-h] [-l LOAD_FROM_FILE] [-s SAVE_PATH]
options:
-h, --help show this help message and exit
-l, --load-from-file LOAD_FROM_FILE
Load title pages and pdf from file (default: titles.json)
-s, --save-path SAVE_PATH
Save path (default: splits.json)
Here -l option loads predicted title page data as generated using using create command.
-s option specifies the file (by default splits.json) where the title pages left after user has done accepting, rejecting and/or substituting of title pages are stored. Essentially this file is used to split the pdf files. Note that splits.json has same format as titles.json.
For each title page user is presented with a window showing the title page. User can take the following actions:
- -> (right arrow key): moves to next title page (current page is retained)
- <- (left arrow key): moves to previous title page (current page is retained)
- x: delete current title page (during split this will be treated as non title page)
- r: replace current title page (this will enter user into page replacement mode)
- n: moves to next pdf file (if there is no next file - you will be asked if you want to save changes)
- s: save and quit (all changes are saved into file specified using -s command line option)
- q: quit without saving (no changes are saved)
In replacement mode (using r key above) the following actions are supported:
- -> (right arrow key): moves to next page
- <- (left arrow key): moves to previous page
- s: save current page as replacement page for the title page
- q: quit page replacement mode (the same title page is retained - you will be dropped to same title page - which can be retained, rejected or substituted again, if desired)
Title Page Wrong Title Page Replacement Page
As a final step you can proceed to split pdf files. Till now no actual pdf files were written.
The supported command line options are as follows:
$ python pdf-title-page-splitter.py split -h
usage: pdf-title-page-splitter.py split [-h] {run,from} ...
positional arguments:
{run,from} Available sub commands
run Run predict and split pages
from Load saved data from file and show pages
options:
-h, --help show this help message and exit
The run command executes predict, and then splits the files. There is no show command.
The from command splits the pdf files based on splits.json file.
from command supports the following options:
$ python pdf-title-page-splitter.py split from -h
usage: pdf-title-page-splitter.py split from [-h] [-l LOAD_FROM_FILE] [--force] [--move-original-to MOVE_ORIGINAL_TO]
[--split-destination SPLIT_DESTINATION] [--noop] [--move-singles MOVE_SINGLES]
options:
-h, --help show this help message and exit
-l, --load-from-file LOAD_FROM_FILE
Load title pages and pdf from file (default: splits.json)
--force Force overwriting of split files (default skips file if it exists)
--move-original-to MOVE_ORIGINAL_TO
Post split move the file to specified diretory (default do not move)
--split-destination SPLIT_DESTINATION
Destination directory for split files (default same as source file)
--noop Make no actual changes (default make changes)
--move-singles MOVE_SINGLES
Move files that contain only single title page and that too as first page into the specified
directory (default is to not move)
python3 pdf-title-page-splitter.py split \
from \
--move-original-to splitted \
--split-destination splits \
--move-singles splits \
--load-from-file splits.json
After running the command the output would be like so:
$ ls -1 splits
'Economic & Political Weekly June 16-23 1900: Vol 25 24-25-economicpoliticalweekly_june16231900_25_2425.pdf'
Economic.And.Political.Weekly.Vol-XXV.No-27_split_0000.pdf
Economic.And.Political.Weekly.Vol-XXV.No-27_split_0001.pdf
Economic.And.Political.Weekly.Vol-XXV.No-27_split_0002.pdf
Economic.And.Political.Weekly.Vol-XXV.No-27_split_0003.pdf
Economic.And.Political.Weekly.Vol-XXV.No-27_split_0004.pdf
Economic.And.Political.Weekly.Vol-XXV.No-27_split_0005.pdf
Economic.And.Political.Weekly.Vol-XXV.No-27_split_0006.pdf
Economic.And.Political.Weekly.Vol-XXV.No-27_split_0007.pdf
Economic.And.Political.Weekly.Vol-XXV.No-27_split_0008.pdf
Economic.And.Political.Weekly.Vol-XXV.No-27_split_0009.pdf
Economic.And.Political.Weekly.Vol-XXV.No-27_split_0010.pdf
Economic.And.Political.Weekly.Vol-XXV.No-27_split_0011.pdf
Economic.And.Political.Weekly.Vol-XXV.No-27_split_0012.pdf
Economic.And.Political.Weekly.Vol-XXV_split_0001.pdf
Economic.And.Political.Weekly.Vol-XXV_split_0002.pdf
Economic.And.Political.Weekly.Vol-XXV_split_0003.pdf
Economic.And.Political.Weekly.Vol-XXV_split_0004.pdf
Economic.And.Political.Weekly.Vol-XXV_split_0005.pdf
Economic.And.Political.Weekly.Vol-XXV_split_0006.pdf
Economic.And.Political.Weekly.Vol-XXV_split_0007.pdf
Economic.And.Political.Weekly.Vol-XXV_split_0008.pdf
Economic.And.Political.Weekly.Vol-XXV_split_0009.pdf
Economic.And.Political.Weekly.Vol-XXV_split_0010.pdf
Economic.And.Political.Weekly.Vol-XXV_split_0011.pdf
Economic.And.Political.Weekly.Vol-XXV_split_0012.pdf
Note in the above output files named _0000.pdf are files created from page number 1 to first title page. If for any file first title page is page number 1 there is no _0000.pdf file. Also, if a file has no splits defined it will be kept as is.