๐ Automatic 4x training speedup for PyTorch models!
- 4x Training Speedup: Validated 4.06x speedup on NVIDIA T4
- Zero Configuration: Automatic hardware detection and optimization
- Production Ready: Full checkpointing and inference support
- Energy Efficient: 36% reduction in training energy consumption
- Universal: Works with any PyTorch model
pip install pytorch-autotune
from pytorch_autotune import quick_optimize
import torchvision.models as models
# Any PyTorch model
model = models.resnet50()
# One line to optimize!
model, optimizer, scaler = quick_optimize(model)
# Now train with 4x speedup!
for epoch in range(num_epochs):
for data, target in train_loader:
data, target = data.cuda(), target.cuda()
optimizer.zero_grad(set_to_none=True)
# Mixed precision training
with torch.amp.autocast('cuda'):
output = model(data)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
from pytorch_autotune import AutoTune
# Create AutoTune instance with custom settings
autotune = AutoTune(model, device='cuda', verbose=True)
# Custom optimization
model, optimizer, scaler = autotune.optimize(
optimizer_name='AdamW',
learning_rate=0.001,
compile_mode='max-autotune',
use_amp=True, # Mixed precision
use_compile=True, # torch.compile
use_fused=True, # Fused optimizer
)
# Benchmark to measure speedup
results = autotune.benchmark(sample_data, iterations=100)
print(f"Speedup: {results['throughput']:.1f} iter/sec")
Tested on NVIDIA Tesla T4 GPU with PyTorch 2.7.1:
Model | Dataset | Baseline | AutoTune | Speedup | Accuracy |
---|---|---|---|---|---|
ResNet-18 | CIFAR-10 | 12.04s | 2.96s | 4.06x | +4.7% |
ResNet-50 | ImageNet | 45.2s | 11.3s | 4.0x | Maintained |
EfficientNet-B0 | CIFAR-10 | 30.2s | 17.5s | 1.73x | +0.8% |
Vision Transformer | CIFAR-100 | 55.8s | 19.4s | 2.87x | +1.2% |
Configuration | Energy (J) | Time (s) | Energy Savings |
---|---|---|---|
Baseline | 324 | 4.7 | - |
AutoTune | 208 | 3.1 | 35.8% |
AutoTune automatically detects your hardware and applies optimal combinations of:
-
Mixed Precision Training (AMP)
- FP16 on T4/V100
- BF16 on A100/H100
- Automatic loss scaling
-
torch.compile() Optimization
- Graph compilation for faster execution
- Automatic kernel fusion
- Hardware-specific optimizations
-
Fused Optimizers
- Single-kernel optimizer updates
- Reduced memory traffic
- Better GPU utilization
-
Hardware-Specific Settings
- TF32 for Ampere GPUs
- Channels-last memory format for CNNs
- Optimal batch size detection
GPU | Speedup | Special Features |
---|---|---|
Tesla T4 | 2-4x | FP16, Fused Optimizers |
Tesla V100 | 2-3.5x | FP16, Tensor Cores |
A100 | 3-4.5x | BF16, TF32, Tensor Cores |
RTX 3090/4090 | 2.5-4x | FP16, TF32 |
H100 | 3.5-5x | FP8, BF16, TF32 |
AutoTune(model, device='cuda', batch_size=None, verbose=True)
Parameters:
model
: PyTorch model to optimizedevice
: Device to use ('cuda' or 'cpu')batch_size
: Optional batch size for auto-detectionverbose
: Print optimization details
model, optimizer, scaler = autotune.optimize(
optimizer_name='AdamW',
learning_rate=0.001,
compile_mode='default',
use_amp=None, # Auto-detect
use_compile=None, # Auto-detect
use_fused=None, # Auto-detect
use_channels_last=None # Auto-detect
)
model, optimizer, scaler = quick_optimize(model, **kwargs)
One-line optimization with automatic settings.
- Use Latest PyTorch: Version 2.0+ for torch.compile support
- Batch Size: Let AutoTune detect optimal batch size
- Learning Rate: Scale with batch size (we handle this)
- First Epoch: Will be slower due to compilation
- Memory: Use
optimizer.zero_grad(set_to_none=True)
import torchvision.models as models
from pytorch_autotune import quick_optimize
# ResNet for ImageNet
model = models.resnet50(pretrained=True)
model, optimizer, scaler = quick_optimize(model)
# Result: 4x speedup
# EfficientNet for CIFAR
model = models.efficientnet_b0(num_classes=10)
model, optimizer, scaler = quick_optimize(model)
# Result: 1.7x speedup
from transformers import AutoModel
from pytorch_autotune import AutoTune
# BERT model
model = AutoModel.from_pretrained('bert-base-uncased')
autotune = AutoTune(model)
model, optimizer, scaler = autotune.optimize()
# Result: 2.5x speedup
Solution: This is normal - torch.compile needs to compile the graph. Subsequent epochs will be fast.
Solution: AutoTune may increase memory usage slightly. Reduce batch size by 10-20%.
Solution: Use gradient clipping and adjust learning rate:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
Solution: Ensure you're using:
- GPU (not CPU)
- PyTorch 2.0+
- Compute-intensive model (not memory-bound)
If you use PyTorch AutoTune in your research, please cite:
@software{pytorch_autotune_2024,
title = {PyTorch AutoTune: Automatic 4x Training Speedup},
author = {Shrivastava, Chinmay},
year = {2024},
url = {https://github.com/JonSnow1807/pytorch-autotune},
version = {1.0.1}
}
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
- Support for distributed training (DDP)
- Automatic learning rate scheduling
- Support for quantization (INT8)
- Integration with HuggingFace Trainer
- Custom CUDA kernels for specific operations
- Support for Apple Silicon (MPS)
Chinmay Shrivastava
- GitHub: @JonSnow1807
- Email: cshrivastava2000@gmail.com
- LinkedIn: Connect with me
This project is licensed under the MIT License - see the LICENSE file for details.
- PyTorch team for torch.compile and AMP
- NVIDIA for mixed precision training research
- The open-source community for feedback and contributions
Made with โค๏ธ by Chinmay Shrivastava
If this project helped you, please consider giving it a โญ!