Skip to content

rreemmii-dev/Wikipedia-Database-Parser-Analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wikipedia-Database-Parser-Analyzer

A parser and analyzer for Wikipedia database dumps, written in Rust.

  • Generate the graph linking each article to its sources (= parents)
  • Find the shortest path between 2 Wikipedia articles using the BFS algorithm
  • Keep the most important vertexes using algorithms such as the PageRank algorithm
  • Render the graph using Gephi
Cluster dealing with computers in a 10,000 vertexes graph:
Cluster about computers

Table of content

Getting started

Installation

# Clone the repository:
git clone git@github.com:rreemmii-dev/Wikipedia-Database-Parser-Analyzer.git

cd Wikipedia-Database-Parser-Analyzer

Ensure you have cargo installed.

Wikipedia dump download

  1. Download a Wikipedia dump. A list can be found here, and a download tutorial here. The latest dump I used is English Wikipedia, 2025-06-01 (uncompressed size of 103 GB, contains more than 7 million parsed articles).
  2. Extract the dump.
  3. Set the WIKI_PATH constant in both src/main.rs and src/simple_main.rs to the dump file path, relative to Cargo.toml.

Run

Run the simple example file:

src/simple_main.rs does the following:

  1. Generates required databases and saves them in files (using generate_databases, see #Provided tools/Generate and load link databases for more information).
  2. Loads databases from generated files.
  3. Executes a BFS to find the shortest path from the Wikipedia article to the Rust_(programming_language) article.
  4. Filters the graph to keep only vertexes having more than 1000 children or parents.
  5. Exports the filtered graph as a CSV file.

It can be run using:

cargo run --release --bin Example

Run the src/main.rs file

cargo run --release

Provided tools

Generate and load link databases

generate_databases in src/database/generate_database.rs generates the following databases:

  • graph: The graph with each article pointing towards its sources, stored as an adjacency list. Each article is represented by an ID.
  • name_from_id: The article name corresponding to the article ID.

Databases are saved in files, so you can run generate_databases once and for all.

load_graph and load_name_from_id_and_id_from_name in src/database/load_database.rs load the two previous databases, plus the id_from_name database.

Different graph types with built-in functions

There are 2 graph types, both being adjacency lists:

The reason for having 2 graph types is that access to Vec elements is much faster than access to HashMap elements, but items cannot be removed from a Vec while leaving other indexes unchanged.

Render graphs using Gephi

To open the graph in Gephi, it has to be exported as a CSV file using export_as_csv in src/graph_utils/hashmap_graph_utils.rs.

It is then stored as an adjacency list. It is advised to use # as the CSV delimiter (Wikipedia article names cannot contain #).

It is not recommended to export graph having more than 10,000 vertexes. See https://gephi.org/users/requirements/.

Wikipedia documentation

Here is some documentation about Wikipedia articles:

License

Distributed under the MIT License. See LICENSE.md

Releases

No releases published

Packages

No packages published

Languages