GitHub - graph-massivizer/graph-sampling-distributed-demo: A Spark cluster for distributed graph sampling. We demo the graph-sampling-distributed library.

Distributed Graph Sampling with Apache Spark

This repository demonstrates a robust workflow for executing distributed graph sampling using Apache Spark. It's designed to process large graphs stored in Parquet format, making it suitable for benchmarking and research in distributed graph processing.

Setup

The project uses Docker to create a self-contained Spark cluster. The setup includes:

Spark Master: Manages the cluster.
Two Spark Workers: Execute distributed tasks.
Shared Data Volume: A persistent, shared directory for input and output data.

Any graph specified as two Parquet files (vertices.parquet and edges.parquet) can be included in the ./data directory on your local machine and made available to the Spark cluster.

Getting Started

Follow these steps to generate a sample graph, run the distributed sampling job, and view the results.

1. Build and Start the Cluster

First, build the Docker images and start the services. The --build flag ensures your custom image with GraphFrames is used.

docker-compose down # Stop and remove any previous containers
docker-compose up --build -d

2. Run the Graph Generation Script

This step creates a small test graph and stores it in the shared data volume.

docker exec spark-master spark-submit \
  --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 \
  /opt/bitnami/spark/sample_graph_creation.py

After execution, the graph data will be located in the ./data/my_test_graph directory on your local machine.

3. Run the Distributed Sampling Job

This command submits the main application to the Spark cluster. It loads the graph, performs the sampling, and saves the new, smaller graph to a new directory.

docker exec spark-master spark-submit \
  --master spark://spark-master:7077 \
  --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 \
  /opt/bitnami/spark/graph_sampler_benchmark_winput.py \
  /opt/spark/data/my_test_graph

The sampled graph will be saved to ./data/my_test_graph_sampled.

Customization

To use your own graph, simply copy your vertices.parquet and edges.parquet files into a new directory inside the ./data folder and update the path in the spark-submit command. For example, if your graph is in ./data/my_custom_graph, your command would be:

docker exec spark-master spark-submit \
  --master spark://spark-master:7077 \
  --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 \
  /opt/bitnami/spark/graph_sampler_benchmark_winput.py \
  /opt/spark/data/my_custom_graph

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
graph_sampler_benchmark_winput.py		graph_sampler_benchmark_winput.py
sample_graph_creation.py		sample_graph_creation.py
submit-job.sh		submit-job.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed Graph Sampling with Apache Spark

Setup

Getting Started

1. Build and Start the Cluster

2. Run the Graph Generation Script

3. Run the Distributed Sampling Job

Customization

About

Uh oh!

Releases

Packages

Languages

License

graph-massivizer/graph-sampling-distributed-demo

Folders and files

Latest commit

History

Repository files navigation

Distributed Graph Sampling with Apache Spark

Setup

Getting Started

1. Build and Start the Cluster

2. Run the Graph Generation Script

3. Run the Distributed Sampling Job

Customization

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages