This PR updates the TorchElastic Lab configuration to leverage multi-GPU nodes more effectively by allocating 4 GPUs per worker pod instead of 1. This change enables more efficient distributed training on modern GPU clusters. #1

MagellaX · 2025-06-20T08:50:12Z

Changes Made

Updated kube/imagenet.yaml to request 4 GPUs per pod (nvidia.com/gpu: 4)
Modified PyTorch distributed training to launch 4 processes per node (--nproc_per_node=4)
Total training capacity increased from 3 GPUs to 12 GPUs (3 pods × 4 GPUs)

Benefits

4x GPU utilization per pod - better resource efficiency on multi-GPU nodes
Improved training speed - larger effective batch sizes and parallelism
Cost optimization - fewer pods needed for the same GPU count
Better scalability - aligns with modern cloud GPU instance types (e.g., AWS p3.8xlarge, Azure NC24s_v3)

Compatibility Notes

Requires cluster nodes with at least 4 GPUs available
Backward compatible - can be adjusted via --nproc_per_node parameter
No changes needed to the training script (main.py) - TorchElastic handles multi-GPU coordination automatically

Testing

Future Improvements

MagellaX · 2025-07-18T22:49:16Z

Hey @lenisha, can you merge this PR!! I have checked everything's good!!!

Initial commit

f9fb54a

Provide feedback