Skip to content

graph-massivizer/graph-sampling-distributed-demo

Repository files navigation

Distributed Graph Sampling with Apache Spark

This repository demonstrates a robust workflow for executing distributed graph sampling using Apache Spark. It's designed to process large graphs stored in Parquet format, making it suitable for benchmarking and research in distributed graph processing.


Setup

The project uses Docker to create a self-contained Spark cluster. The setup includes:

  • Spark Master: Manages the cluster.
  • Two Spark Workers: Execute distributed tasks.
  • Shared Data Volume: A persistent, shared directory for input and output data.

Any graph specified as two Parquet files (vertices.parquet and edges.parquet) can be included in the ./data directory on your local machine and made available to the Spark cluster.


Getting Started

Follow these steps to generate a sample graph, run the distributed sampling job, and view the results.

1. Build and Start the Cluster

First, build the Docker images and start the services. The --build flag ensures your custom image with GraphFrames is used.

docker-compose down # Stop and remove any previous containers
docker-compose up --build -d

2. Run the Graph Generation Script

This step creates a small test graph and stores it in the shared data volume.

docker exec spark-master spark-submit \
  --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 \
  /opt/bitnami/spark/sample_graph_creation.py

After execution, the graph data will be located in the ./data/my_test_graph directory on your local machine.

3. Run the Distributed Sampling Job

This command submits the main application to the Spark cluster. It loads the graph, performs the sampling, and saves the new, smaller graph to a new directory.

docker exec spark-master spark-submit \
  --master spark://spark-master:7077 \
  --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 \
  /opt/bitnami/spark/graph_sampler_benchmark_winput.py \
  /opt/spark/data/my_test_graph

The sampled graph will be saved to ./data/my_test_graph_sampled.


Customization

To use your own graph, simply copy your vertices.parquet and edges.parquet files into a new directory inside the ./data folder and update the path in the spark-submit command. For example, if your graph is in ./data/my_custom_graph, your command would be:

docker exec spark-master spark-submit \
  --master spark://spark-master:7077 \
  --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 \
  /opt/bitnami/spark/graph_sampler_benchmark_winput.py \
  /opt/spark/data/my_custom_graph

About

A Spark cluster for distributed graph sampling. We demo the graph-sampling-distributed library.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published