Wikipedia-Database-Parser-Analyzer

A parser and analyzer for Wikipedia database dumps, written in Rust.

Generate the graph linking each article to its sources (= parents)
Find the shortest path between 2 Wikipedia articles using the BFS algorithm
Keep the most important vertexes using algorithms such as the PageRank algorithm
Render the graph using Gephi

Cluster dealing with computers in a 10,000 vertexes graph:

Getting started

Installation

# Clone the repository:
git clone git@github.com:rreemmii-dev/Wikipedia-Database-Parser-Analyzer.git

cd Wikipedia-Database-Parser-Analyzer

Ensure you have cargo installed.

Wikipedia dump download

Download a Wikipedia dump. A list can be found here, and a download tutorial here. The latest dump I used is English Wikipedia, 2025-06-01 (uncompressed size of 103 GB, contains more than 7 million parsed articles).
Extract the dump.
Set the WIKI_PATH constant in both src/main.rs and src/simple_main.rs to the dump file path, relative to Cargo.toml.

Run

Run the simple example file:

src/simple_main.rs does the following:

Generates required databases and saves them in files (using generate_databases, see #Provided tools/Generate and load link databases for more information).
Loads databases from generated files.
Executes a BFS to find the shortest path from the Wikipedia article to the Rust_(programming_language) article.
Filters the graph to keep only vertexes having more than 1000 children or parents.
Exports the filtered graph as a CSV file.

It can be run using:

cargo run --release --bin Example

Run the src/main.rs file

cargo run --release

Provided tools

Generate and load link databases

generate_databases in src/database/generate_database.rs generates the following databases:

graph: The graph with each article pointing towards its sources, stored as an adjacency list. Each article is represented by an ID.
name_from_id: The article name corresponding to the article ID.

Databases are saved in files, so you can run generate_databases once and for all.

load_graph and load_name_from_id_and_id_from_name in src/database/load_database.rs load the two previous databases, plus the id_from_name database.

Different graph types with built-in functions

There are 2 graph types, both being adjacency lists:

VecGraph (Vec<Vec<usize>>) for computation on the whole graph. See src/graph_utils/vec_graph_utils.rs.
HashmapGraph (HashMap<u32, HashSet<u32>>) if some vertexes were to be removed from the graph. See src/graph_utils/hashmap_graph_utils.rs.

The reason for having 2 graph types is that access to Vec elements is much faster than access to HashMap elements, but items cannot be removed from a Vec while leaving other indexes unchanged.

Render graphs using Gephi

To open the graph in Gephi, it has to be exported as a CSV file using export_as_csv in src/graph_utils/hashmap_graph_utils.rs.

It is then stored as an adjacency list. It is advised to use # as the CSV delimiter (Wikipedia article names cannot contain #).

It is not recommended to export graph having more than 10,000 vertexes. See https://gephi.org/users/requirements/.

Wikipedia documentation

Here is some documentation about Wikipedia articles:

Links parsing:
- [[PageName]] links: https://en.wikipedia.org/wiki/Help:Link#Wikilinks_(internal_links)
- {{section link|PageName}} links: https://en.wikipedia.org/wiki/Help:Link#Section_linking_(anchors) and https://en.wikipedia.org/wiki/Template:Section_link
- {{multi-section link|PageName}} links: https://en.wikipedia.org/wiki/Template:Multi-section_link
- <nowiki> tag: https://en.wikipedia.org/wiki/Help:Wikitext#Nowiki
- <syntaxhighlight> tag: https://en.wikipedia.org/wiki/Template:Syntaxhighlight
Article names:
- Namespaces (e.g.: article names such as "Category:XXX", or "Help:XXX"): https://en.wikipedia.org/wiki/Wikipedia:Namespace
- Name Restrictions: https://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_(technical_restrictions) and https://en.wikipedia.org/wiki/Wikipedia:Page_name#Technical_restrictions_and_limitations
Appendice and footnote sections: https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Layout#Standard_appendices_and_footers and https://en.wikipedia.org/wiki/Help:Footnotes

License

Distributed under the MIT License. See LICENSE.md

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE.md		LICENSE.md
README.md		README.md
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Wikipedia-Database-Parser-Analyzer

Table of content

Getting started

Installation

Wikipedia dump download

Run

Run the simple example file:

Run the src/main.rs file

Provided tools

Generate and load link databases

Different graph types with built-in functions

Render graphs using Gephi

Wikipedia documentation

License

About

Uh oh!

Releases

Packages

Languages

License

rreemmii-dev/Wikipedia-Database-Parser-Analyzer

Folders and files

Latest commit

History

Repository files navigation

Wikipedia-Database-Parser-Analyzer

Table of content

Getting started

Installation

Wikipedia dump download

Run

Run the simple example file:

Run the src/main.rs file

Provided tools

Generate and load link databases

Different graph types with built-in functions

Render graphs using Gephi

Wikipedia documentation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages