Federated k-Means Clustering with Scikit-learn

Overview

This project demonstrates Federated Learning (FL) with k-Means clustering using Scikit-learn. The aggregation follows the MiniBatch k-Means approach, where clients perform local training, and a central server aggregates their results to update global cluster centers.

Federated k-Means Process

Each FL round consists of:

Local Training: Clients initialize with global centers and train MiniBatchKMeans on local data.
Global Aggregation: The server collects cluster centers and counts from all clients, updates the global model, and redistributes it for the next round.

Initialization Strategy

Clients use k-means++ for initial cluster centers.
The server aggregates initial centers using a round of k-means to determine the global starting point.

Capturing Provenance with DfAnalyzer

DfAnalyzer is a library designed to capture provenance data, which includes:

Prospective Provenance: The "recipe" of the trial, capturing configurations before execution.
Retrospective Provenance: Logs and tracks execution results, including data transformations and FL training steps.

Setting Up DfAnalyzer

Load the DfAnalyzer Docker image:

   docker pull nymeria0042/dfanalyzer

Deploy the DfAnalyzer container:

cd dfanalyzer && docker compose up dfanalyzer

Ensure DfAnalyzer is running in the background before starting trials.

Create and activate a virtualenv with python=3.8

virtualenv venv --python=3.8
. venv/bin/activate
pip install -r requirements.txt

Install dfa-lib-python:

cd dfanalyzer/dfa-lib/python & make install

Run the prospective provenance script:
```
python fed-clustering/utils/prospective_provenance.py
```
- Responsible for capturing and recording metadata about the design and configuration of the trials before the runs.

Federated Learning with NVFlare

NVFlare is used to set up the federated learning infrastructure.

Setting up NVFLARE

Build the NVFlare image

docker build -t nvflare-service .

Running the trial

From fed-clustering folder.

Prepare trial

source start_trial.sh

If versioning_control is enabled in utils/start_trial.py, a new branch is created under the trials/ folder, named with the user and the trial’s start timestamp. A hash is generated and stored in trial_info.json. Additionally, a commit is created containing this hash. The hash can be used to query the provenance database for records related exclusively to that trial.
If versioning_control is disabled, the trial runs in the current folder and branch without creating a new branch or commit. Nonetheless, a hash is still generated and stored in trial_info.json, enabling tracking and querying in the provenance database.
Remember to reactivate the virtual environment

Preparing Data and Configuration

source prepare_data.sh
source prepare_job_config.sh

Provisioning

Run:

nvflare provision

This creates a workspace/fed_clustering directory with the following structure:

workspace
└── fed_clustering
    ├── prod_01
    │   ├── admin@nvidia.com
    │   ├── server1
    │   ├── site-1
    │   ├── site-2
    │   ├── overseer
    │   └── compose.yaml
    └── resources

Manually copy workspace/fed_clustering/prod_00/.env and workspace/fed_clustering/prod_00/compoose.yaml to the new workspace/fed_clustering/prod_01/ folder

Running NVFlare

Navigate to prod_01 and launch FL components:

docker compose up

Manually copy jobs/sklearn_2_uniform to workspace/fed_clustering/prod_01/admin@nvidia.com/transfer/.

Distributing Data to Clients

Create dataset folders inside the containers:

docker exec -it site-1 mkdir -p /tmp/nvflare/dataset

Copy data to each client:

docker cp /tmp/nvflare/dataset/des.csv site-1:/tmp/nvflare/dataset

Repeat for all sites.

Configuring the Hosts File

Get the local hostname:
```
hostname -I | awk '{print $1}'
```

Add it to /etc/hosts:

sudo vim /etc/hosts

Add:

{IP} server1 overseer

Running the FL Server

Inside prod_01, start the FL admin panel:

./admin@nvidia.com/startup/fl_admin.sh

Log in with:

admin@nvidia.com

Checking FL System Status

check_status [server|client]

Submitting and Running the trial

submit_job sklearn_kmeans_2_uniform

Monitor DfAnalyzer for provenance tracking.

Query the provenance database

docker exec -it dfanalyzer mclient -u monetdb -d dataflow_analyzer

The default password is monetdb.

Then, we can submit the queries, like:

SELECT client_id, silhouette_score FROM iClientValidation WHERE trial_id = {hash_trial};

Saving the retrospective provenance data using the `save_results` script

docker exec -it dfanalyzer mclient -u monetdb -d dataflow_analyzer -i save_results.sql

This creates a folder results inside the dfanalyzer directory with .csv files for each predefined provenance table

Conclusion

This project demonstrates federated k-Means clustering using NVFlare, Scikit-learn, and DfAnalyzer. Provenance data is captured throughout, ensuring transparency and reproducibility of FL trials.

Credits

This project builds upon NVidia Flare.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dfanalyzer		dfanalyzer
experiments_results		experiments_results
fed-clustering		fed-clustering
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Federated k-Means Clustering with Scikit-learn

Overview

Federated k-Means Process

Initialization Strategy

Capturing Provenance with DfAnalyzer

Setting Up DfAnalyzer

Federated Learning with NVFlare

Setting up NVFLARE

Running the trial

Prepare trial

Preparing Data and Configuration

Provisioning

Running NVFlare

Distributing Data to Clients

Configuring the Hosts File

Running the FL Server

Checking FL System Status

Submitting and Running the trial

Monitor DfAnalyzer for provenance tracking.

Query the provenance database

Saving the retrospective provenance data using the `save_results` script

Conclusion

Credits

About

Uh oh!

Releases

Packages

Languages

License

UFFeScience/flare_prov

Folders and files

Latest commit

History

Repository files navigation

Federated k-Means Clustering with Scikit-learn

Overview

Federated k-Means Process

Initialization Strategy

Capturing Provenance with DfAnalyzer

Setting Up DfAnalyzer

Federated Learning with NVFlare

Setting up NVFLARE

Running the trial

Prepare trial

Preparing Data and Configuration

Provisioning

Running NVFlare

Distributing Data to Clients

Configuring the Hosts File

Running the FL Server

Checking FL System Status

Submitting and Running the trial

Monitor DfAnalyzer for provenance tracking.

Query the provenance database

Saving the retrospective provenance data using the save_results script

Conclusion

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Saving the retrospective provenance data using the `save_results` script

Packages