Skip to content

Conversation

MagellaX
Copy link

Changes Made

  • Updated kube/imagenet.yaml to request 4 GPUs per pod (nvidia.com/gpu: 4)
  • Modified PyTorch distributed training to launch 4 processes per node (--nproc_per_node=4)
  • Total training capacity increased from 3 GPUs to 12 GPUs (3 pods × 4 GPUs)

Benefits

  • 4x GPU utilization per pod - better resource efficiency on multi-GPU nodes
  • Improved training speed - larger effective batch sizes and parallelism
  • Cost optimization - fewer pods needed for the same GPU count
  • Better scalability - aligns with modern cloud GPU instance types (e.g., AWS p3.8xlarge, Azure NC24s_v3)

Compatibility Notes

  • Requires cluster nodes with at least 4 GPUs available
  • Backward compatible - can be adjusted via --nproc_per_node parameter
  • No changes needed to the training script (main.py) - TorchElastic handles multi-GPU coordination automatically

Testing

  • Tested on Kubernetes cluster with NVIDIA GPU support
  • Verified with nvidia-smi showing 4 GPUs per pod
  • Confirmed 4 training processes launch correctly per pod

Future Improvements

  • Add Helm chart for configurable GPU counts
  • Implement HPA based on GPU utilization
  • Create documentation for different GPU configurations

@MagellaX
Copy link
Author

Hey @lenisha, can you merge this PR!! I have checked everything's good!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant