multi-file-renamer is used to rename multiple files using spacy rule based matching or using trained models.
To give you a taste, with single command line, following can be achieved:
| Original Name | New Name |
|---|---|
| 8139 Modern Review Volno-39(1926)-inernetdli2015467807.pdf | The_Modern_Review_,Volume_039,(1926).pdf |
| dli.bengal.10689.11376-THE MODERN REVIEW VOL.122(JULY-DECEMBER)1967-dlibengal1068911376.pdf | The_Modern_Review_,Volume_122,(1967),(July-December).pdf |
| in.ernet.dli.2015.114056-The Modern Review Vol Lxxi-inernetdli2015114056.pdf | The_Modern_Review_,Volume_071.pdf |
| Modern Review 1947-03: Vol 1 Iss 1 -modernreviewalcia_194703_1_1.pdf | The_Modern_Review_,Volume_001,No_1,(1947),(March).pdf |
| Modern Review Summer 1949: Vol 3 Iss 1 -modernreviewalcia_summer1949_3_1.pdf | The_Modern_Review_,Volume_003,No_1,(1949).pdf |
| THE MODERN REVIEW VOL.113(JANUARY-JUNE)1963-dlibengal1068916872.pdf | The_Modern_Review_,Volume_113,(1963),(January-June).pdf |
As can be seen above, multi-file-renamer is able to rename files with very different naming structure and conventions into a consistent naming pattern. Depending upon user input multi-file-renamer is able to extract relevant information from original file name and use it to construct a new file name.
For renaming files using rule based matching, user needs to supply spacy match rules that can be used to extract relevant information, input-output mapping rules and template pattern.
For renaming files using trained model, first a trained model is created that will identify named entities. The model is created by generating named entities for a relatively small set of file names. Once model has been created it can be used to predict new file names using user supplied input-output mapping rules and template pattern.
multi-file-renamer requires python 3 and a bunch of other dependencies mentioned in requirements.txt file.
Before running multi-file-renamer the python environment needs to be setup correctly. Here we are creating a python virtual environment and installing all the dependencies. The instructions are provided for Linux, but ideally these should be identical for any UNIX like operating system.
The following Change to the folder/directory containing
python -m venv venv
. venv/bin/activate
pip install -r requirements.txtCreating virtual environment and installing dependencies is one time process. In subsequent runs you just need to activate the virtual environment:
. venv/bin/activateTo deactivate the virtual environment run the command: deactivate.
The way multi-file-renamer works is as a first step one generates new file names. Next user verifies that names are correct (or even fix them by hand, if required) by checking the file_names.json file. And finally rename the files as the last step.
For generating files we have two options:
- Using rule based matching: In this method new file names are extracted using predefined static rules specified in file
patterns.yaml. - Using trained model: In this method a model is trained first. For generating training data rule based matching is used. Once this training data has generated, next step is to create a model which can be used any number of times to predict the new file names.
User needs to create a patterns.yaml file similar to one in samples directory. This file contains patterns as per spacy match rules syntax that can be used to extract named entities. Additionally it has input-output mapping rules that lead to generation of file name attribute dict. The file name attribute dict together with user supplied template is used to generate new files names.
This is the flow in the daigram form:
graph TD;
A["**Original**<br><small>file name</small>"]-->C["Named Entities"];
B["**patterns** from patterns.yaml file"]-->C;
C-->F["file name<br>*attribute dict*"];
E["**input**-**output**<br>mapping rules<br>from patterns.yaml"]-->F;
F-->H;
G["**template**<br>from command line"]-->H["**New**<br><small>file name</small>"];
More information about it is available in Configuration section of this page.
In this step we create a file (by default file_names.json) that contains mapping between old file names and new file names. User supplies a patterns.yaml file that is used to identify named entities and input-output mapping rules to help extract the relevant data from named entities. Additionaly user supplies a template which is used to generate new file names. The template can have jinja code allowing for conditional formatting and placeholders values are substituted using file name attribute dict.
python multi-file-renamer.py \
extract \
-l samples/patterns.yaml \
--excludes in.ernet.dli.2015. \
-s file_names.json \
-m volume \
-t "The_Review{% if volume is defined %}_,Volume_{{'%03d'|format(volume|int)}}{% endif %}{% if number is defined %},No_{{number}}{% endif %}{% if year is defined %},({{year}}){% endif %}{% if month is defined %},({{month}}){% endif %}.pdf" \
file1.pdf file2.pdf directoryIn the above command,
-lspecifies the path ofpatterns.yamlfile--excludesspecifies sub strings that are part of original file names but should be ignored as they would interfere with rule matching. In the above example, a file name likein.ernet.dli.2015.114056-The Modern Review Vol Lxxi-inernetdli2015114056.pdfwould lead to 2015 being identified as year (which is actually just file scan year). So this way, we prevent processing of part of file names.-sspecifies the file where original to new file name mapping should be stored.-mspecifies attribute names which are considered mandatory. That is if they are not found new file name is not generated at all.-tspecifies file name template to be used to generate file name. It supports jinja templating syntax.- last argument is list of files or directories to be renamed. Note if you provide directories they will be processed recursively.
Once the above command is executed it will generate a file file_names.json.
To help you understand here is original file name => file name attribute dict => new file name table for an example run with patterns.yaml:
| Original Name | File name attribute dict | New Name |
|---|---|---|
| 8139 Modern Review Volno-39(1926)-inernetdli2015467807.pdf | { "volume": "39", "year": "1926" } |
The_Modern_Review_,Volume_039,(1926).pdf |
| dli.bengal.10689.11376-THE MODERN REVIEW VOL.122(JULY-DECEMBER)1967-dlibengal1068911376.pdf | { "volume": "122", "year": "1967", "month": "July-December" } |
The_Modern_Review_,Volume_122,(1967),(July-December).pdf |
| in.ernet.dli.2015.114056-The Modern Review Vol Lxxi-inernetdli2015114056.pdf | { "volume": "71" } |
The_Modern_Review_,Volume_071.pdf |
| Modern Review 1947-03: Vol 1 Iss 1 -modernreviewalcia_194703_1_1.pdf | { "volume": "1", "year": "1947", "month": "March", "number": "1" } |
The_Modern_Review_,Volume_001,No_1,(1947),(March).pdf |
| Modern Review Summer 1949: Vol 3 Iss 1 -modernreviewalcia_summer1949_3_1.pdf | { "volume": "3", "year": "1949", "number": "1" } |
The_Modern_Review_,Volume_003,No_1,(1949).pdf |
| THE MODERN REVIEW VOL.113(JANUARY-JUNE)1963-dlibengal1068916872.pdf | { "volume": "113", "year": "1963", "month": "January-June" } |
The_Modern_Review_,Volume_113,(1963),(January-June).pdf |
The following diagram explains the entire process:
graph TD;
subgraph "Training model"
A["**Original**<br><small>file name</small>"]-->C["Named Entities"];
B["**patterns** from patterns.yaml file"]-->C;
C-->D["**Training data**<br>saved to file"];
end
subgraph "Predicting file names"
D-->G["file name<br>*attribute dict*"];
E["**unseen**<br><small>file name</small>"]-->G;
F["**input**-**output**<br>mapping rules<br>from patterns.yaml"]-->G;
G-->I["**New**<br><small>file name</small>"];
H["**template**<br>from command line"]-->I;
end
python multi-file-renamer.py \
generate \
-l patterns.yaml \
--excludes in.ernet.dli.2015. \
--training-save-path train_data.spacy \
--testing-save-path train_data_dev.spacy \
file1.pdf file2.pdf directoryIn the above command,
--training-save-pathspecifies the path where train data is saved--testing-save-pathspecifies the path where test data is saved
Run the following command to generate model in the ./output directory:
python -m spacy init config ./config.cfg --lang en --pipeline ner
python -m spacy train ./config.cfg --output ./output --paths.train ./train_data.spacy --paths.dev ./train_data_dev.spacypython multi-file-renamer.py \
predict \
--model output/model-best \
-l patterns.yaml \
--excludes in.ernet.dli.2015. \
-m volume \
-p "The_Modern_Review{% if volume is defined %}_,Volume_{{'%03d'|format(volume|int)}}{% endif %}{% if number is defined %},No_{{number}}{% endif %}{% if year is defined %},({{year}}){% endif %}{% if month is defined %},({{month}}){% endif %}.pdf"\
file1.pdf file2.pdf directory--model specifies the location of the model. Other options are explained elsewhere.
python multi-flile-renamer.py \
rename from \
-l file_names.json \
-s restore_data.jsonIn the above command,
-lspecifies the path of file containing old to new file name mapping.-sspecifies the path of file that will contain restoration data.
Note, renaming of files takes into account existence of another file with the same name, and will append suffix like -1 to make it unique.
Patterns file is a YAML file. It has the following structure:
{{match_entity_label}}:
input: {{input_rules}}
output: {{output_rules}}
patterns: {{spacy_match_patterns}} In the above, match_entity_label is the label to assigned to recognized entity using pattern spacy_match_patterns during the NER (Named Entity Recognition) phase. spacy_match_patterns is a list of patterns as specified by the spacy rule based matching. A sample file is available here.
input_rules and output_rules specify as to what should be the output produced from a given Named Entity.
| Input Type | Fields | Value and type | Description |
|---|---|---|---|
| single | index | - an int or - keyword start or - keyword end |
Returns text from specific index of the matched Span. Equivalent to span.doc[index].text, where- keyword start is same as span.start- keyword end is same as span.end |
| all | N/A | N/A | Returns all text from matched Span Equivalent to span.text |
| distinct | indexes | list where each item is either: - an int or - keyword start or - keyword end |
Returns a list of items, each of which is text from specific index of the matched Span. Equivalent to [span.doc[i].text for i in indexes], where- keyword start is same as span.start- keyword end is same as span.end- a positive number means an offset from span.start- a negative number means an offset from span.end |
| multi | start end |
Both can be either: - an int or - keyword start or - keyword end |
Returns a list of items, each of which is text from all the indexes between start (inclusive) and end (exclusive) of the matched Span. Equivalent to [span.doc[i].text for i in range(start, end)], where- keyword start is same as span.start- keyword end is same as span.end |
For the following output rule:
type: single
index: endthe table below summarizes the behaviour:
| Input | Output |
|---|---|
[ "volume", "46" ] |
"46" |
[ "vol", "xvii" ] |
"xvii" |
For the following output rule:
type: allthe table below summarizes the behaviour:
| Input | Output |
|---|---|
[ "volume", "46" ] |
"volume 46" |
[ "vol", "xvii" ] |
"vol xvii" |
For the following output rule:
type: distinct
indexes:
- start
- endthe table below summarizes the behaviour:
| Input | Output |
|---|---|
[ "apr", "to", "nov" ] |
[ "apr", "nov" ] |
[ "may", "jun" ] |
[ "may", "jun" ] |
For the following output rule:
type: multi
start: 1
end: endthe table below summarizes the behaviour:
| Input | Output |
|---|---|
[ "a", "b", "c", "d" ] |
[ "b", "c", "d" ] |
[ "a", "b" ] |
[ "b" ] |
| Output Type | Fields | Mandatory | Description |
|---|---|---|---|
| type | enum | always | Any one of the values - single - multi single means result is a dict single key: value pair multi means result is a dict of multiple key: value pairs |
| index | str | only if type is single | This is the key of result dictionary. In other words, this can be referred in file name pattern. |
| outputs | list of output rules | only if type is multi | Each item is a output rule which is applied against the input. If input is a python list or tuple, each corresponding output rule is applied against input item. If input is single text, each rule is applied against input. |
| handler | enum | no | This defines the special handler function (already implemented in python code to produce output as per the supplied args. Currently supported handlers are: - convert_roman_nums - date - joiner |
| args | dict | only in case a handler is defined and it requires arguments |
Example when type is single
For the following output rule:
type: single
index: yearthe table below summarizes the behaviour:
| Input | Output |
|---|---|
| 2014 | { "year": "2014" } |
| 1990 | { "year": "1990" } |
Example when type is multi
For the following output rule:
type: multi
outputs:
- type: single
index: year
handler: date
args:
format: "%Y"
- type: single
index: month
handler: date
args:
format: "%B"the table below summarizes the behaviour:
| Input | Output |
|---|---|
| 2014-04 | { "year": "2014", "month": "April" } |
| 1990-2 | { "year": "1990", "month": "February" } |
Handlers are specialized function that convert given input into a desired output. The handlers receive input text as input and can optionally have additional arguments.
This handler convert input roman numerals, for example xvi to its corresponding Indian/Hindu numeric value, viz. 16. If input is not a roman numeral, it is left unchanged.
Example
For the following output rule:
type: single
index: volume
handler: convert_roman_numsthe table below summarizes the behaviour:
| Input | Output |
|---|---|
| 14 | { "volume": "14" } |
| cxiv | { "volume": "116" } |
This handler converts input into format supported by strftime. This handler takes following arguments:
| Argument | Type | Mandatory | Description |
|---|---|---|---|
| format | str | yes | This parameter specifies the date formatting to use. More specifically, it is same as format parameter of python's strftime |
Example
For the following output rule:
type: single
index: month
handler: date
args:
format: "%B"the table below summarizes the behaviour:
| Input | Output |
|---|---|
| Dec | { "month": "December" } |
| december | { "month": "December" } |
This handler converts an input which is either python list or tuple into a single string joining them using supplied separator. This handler takes followin arguments:
| Argument | Type | Mandatory | Description |
|---|---|---|---|
| separator | str | yes | This is the seprator |
| outputs | list of output rules | no | This is list of output rules applied to each item in the input list before joining them together. See example below. |
| exclusions | list of str | no | This is a list of exclusion values. These values will not be considered for output. |
Remember join requires input to be either python's tuple or list.
Example
For the following output rule:
type: single
index: number
handler: joiner
args:
separator: "-"
exclusions: [":", "to"]the table below summarizes the behaviour:
| Input | Output |
|---|---|
[ "1", "to", "6" ] |
{ "number": "1-6" } |
[ "2", "5" ] |
{ "number": "2-5" } |
Another Example
For the following output rule:
type: single
index: month
handler: joiner
args:
separator: "-"
outputs:
- type: single
handler: date
args:
format: "%B"
- type: single
handler: date
args:
format: "%B"the table below summarizes the behaviour:
| Input | Output |
|---|---|
[ "jan", "apr" ] |
{ "month": "January-April" } |
[ "july", "october" ] |
{ "month": "July-October" } |
Note in this case joiner handler is calling date handler internally.

