This project demonstrates Federated Learning (FL) with k-Means clustering using Scikit-learn. The aggregation follows the MiniBatch k-Means approach, where clients perform local training, and a central server aggregates their results to update global cluster centers.
Each FL round consists of:
- Local Training: Clients initialize with global centers and train MiniBatchKMeans on local data.
- Global Aggregation: The server collects cluster centers and counts from all clients, updates the global model, and redistributes it for the next round.
- Clients use k-means++ for initial cluster centers.
- The server aggregates initial centers using a round of k-means to determine the global starting point.
DfAnalyzer is a library designed to capture provenance data, which includes:
- Prospective Provenance: The "recipe" of the trial, capturing configurations before execution.
- Retrospective Provenance: Logs and tracks execution results, including data transformations and FL training steps.
-
Load the DfAnalyzer Docker image:
docker pull nymeria0042/dfanalyzer
-
Deploy the DfAnalyzer container:
cd dfanalyzer && docker compose up dfanalyzer
-
Ensure DfAnalyzer is running in the background before starting trials.
-
Create and activate a virtualenv with python=3.8
virtualenv venv --python=3.8 . venv/bin/activate pip install -r requirements.txt
-
Install
dfa-lib-python
:
cd dfanalyzer/dfa-lib/python & make install
-
Run the prospective provenance script:
python fed-clustering/utils/prospective_provenance.py
- Responsible for capturing and recording metadata about the design and configuration of the trials before the runs.
NVFlare is used to set up the federated learning infrastructure.
- Build the NVFlare image
docker build -t nvflare-service .
From fed-clustering folder
.
source start_trial.sh
-
If versioning_control is enabled in
utils/start_trial.py
, a new branch is created under thetrials/
folder, named with the user and the trial’s start timestamp. A hash is generated and stored intrial_info.json
. Additionally, a commit is created containing this hash. The hash can be used to query the provenance database for records related exclusively to that trial. -
If versioning_control is disabled, the trial runs in the current folder and branch without creating a new branch or commit. Nonetheless, a hash is still generated and stored in
trial_info.json
, enabling tracking and querying in the provenance database. -
Remember to reactivate the virtual environment
source prepare_data.sh
source prepare_job_config.sh
- Run:
nvflare provision
This creates a workspace/fed_clustering
directory with the following structure:
workspace
└── fed_clustering
├── prod_01
│ ├── admin@nvidia.com
│ ├── server1
│ ├── site-1
│ ├── site-2
│ ├── overseer
│ └── compose.yaml
└── resources
- Manually copy
workspace/fed_clustering/prod_00/.env
andworkspace/fed_clustering/prod_00/compoose.yaml
to the newworkspace/fed_clustering/prod_01/
folder
- Navigate to
prod_01
and launch FL components:
docker compose up
- Manually copy
jobs/sklearn_2_uniform
toworkspace/fed_clustering/prod_01/admin@nvidia.com/transfer/
.
- Create dataset folders inside the containers:
docker exec -it site-1 mkdir -p /tmp/nvflare/dataset
- Copy data to each client:
docker cp /tmp/nvflare/dataset/des.csv site-1:/tmp/nvflare/dataset
- Repeat for all sites.
- Get the local hostname:
hostname -I | awk '{print $1}'
- Add it to
/etc/hosts
:Add:sudo vim /etc/hosts
{IP} server1 overseer
Inside prod_01
, start the FL admin panel:
./admin@nvidia.com/startup/fl_admin.sh
Log in with:
admin@nvidia.com
check_status [server|client]
submit_job sklearn_kmeans_2_uniform
docker exec -it dfanalyzer mclient -u monetdb -d dataflow_analyzer
The default password is monetdb
.
Then, we can submit the queries, like:
SELECT client_id, silhouette_score FROM iClientValidation WHERE trial_id = {hash_trial};
docker exec -it dfanalyzer mclient -u monetdb -d dataflow_analyzer -i save_results.sql
This creates a folder results
inside the dfanalyzer
directory with .csv files for each predefined provenance table
This project demonstrates federated k-Means clustering using NVFlare, Scikit-learn, and DfAnalyzer. Provenance data is captured throughout, ensuring transparency and reproducibility of FL trials.
This project builds upon NVidia Flare.