This repository demonstrates a robust workflow for executing distributed graph sampling using Apache Spark. It's designed to process large graphs stored in Parquet format, making it suitable for benchmarking and research in distributed graph processing.
The project uses Docker to create a self-contained Spark cluster. The setup includes:
- Spark Master: Manages the cluster.
- Two Spark Workers: Execute distributed tasks.
- Shared Data Volume: A persistent, shared directory for input and output data.
Any graph specified as two Parquet files (vertices.parquet and edges.parquet) can be included in the ./data directory on your local machine and made available to the Spark cluster.
Follow these steps to generate a sample graph, run the distributed sampling job, and view the results.
First, build the Docker images and start the services. The --build flag ensures your custom image with GraphFrames is used.
docker-compose down # Stop and remove any previous containers
docker-compose up --build -dThis step creates a small test graph and stores it in the shared data volume.
docker exec spark-master spark-submit \
  --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 \
  /opt/bitnami/spark/sample_graph_creation.pyAfter execution, the graph data will be located in the ./data/my_test_graph directory on your local machine.
This command submits the main application to the Spark cluster. It loads the graph, performs the sampling, and saves the new, smaller graph to a new directory.
docker exec spark-master spark-submit \
  --master spark://spark-master:7077 \
  --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 \
  /opt/bitnami/spark/graph_sampler_benchmark_winput.py \
  /opt/spark/data/my_test_graphThe sampled graph will be saved to ./data/my_test_graph_sampled.
To use your own graph, simply copy your vertices.parquet and edges.parquet files into a new directory inside the ./data folder and update the path in the spark-submit command. For example, if your graph is in ./data/my_custom_graph, your command would be:
docker exec spark-master spark-submit \
  --master spark://spark-master:7077 \
  --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 \
  /opt/bitnami/spark/graph_sampler_benchmark_winput.py \
  /opt/spark/data/my_custom_graph