A Minimalist Optimizer Design for LLM Pretraining

Preliminary code release for our paper "A Minimalist Optimizer Design for LLM Pretraining", by Athanasios Glentis, Jiaxiang Li, Andi Han and Mingyi Hong.

SCALE Optimizer

We introduce our proposed optimizer, SCALE, detailed in Algorithm 1. The design of SCALE is motivated by empirical insights from our experiments, which highlighted the importance of stabilizing updates in the last layer and controlling gradient scale via column-wise normalization. Accordingly, SCALE integrates two components:

column-wise normalization of gradients
first order momentum restricted to the last layer

Despite its minimalist design, SCALE is highly effective, outperforming baselines such as Adam and achieving performance competitive with state-of-the-art Stable-SPAM and Muon, while using substantially less memory.

Results

Pareto frontier: Perplexity v.s. memory consumption among a number of state-of-the-art algorithms. SCALE lies on the Pareto frontier, achieving optimal memory efficiency while maintaining (near) state‐of‐the‐art perplexity.

Numerical Results: Bellow we include our numerical results for pretraining LLaMA models on the C4 dataset. We report evaluation perplexity as our metric (and model+optimizer memory in parenthesis).

LLaMA 60M-1B:

Model Size	60M	130M	350M	1B
Tokens	1.4B	2.6B	7.8B	13.1B
Adam	30.05 (0.35G)	23.13 (0.81G)	18.77 (2.21G)	16.52 (8.04G)
Stable-SPAM	28.77 (0.35G)	22.20 (0.81G)	16.80 (2.21G)	13.97 (8.04G)
Muon	28.86 (0.23G)	22.20 (0.54G)	16.70 (1.47G)	14.18 (5.36G)
GaLore	34.58 (0.28G)	25.31 (0.61G)	19.37 (1.59G)	15.57 (4.76G)
Fira	30.34 (0.28G)	22.96 (0.61G)	16.82 (1.59G)	15.10 (4.76G)
SWAN	30.00 (0.25G)	22.83 (0.46G)	17.14 (1.00G)	14.42 (3.20G)
APOLLO	30.94 (0.28G)	22.93 (0.61G)	16.75 (1.59G)	14.20 (4.76G)
APOLLO-Mini	31.85 (0.25G)	23.63 (0.46G)	17.11 (1.00G)	14.13 (3.20G)
SCALE (ours)	30.81 (0.15G)	22.57 (0.32G)	16.32 (0.80G)	14.25 (2.81G)

LLaMA 7B:

Steps	40K	80K	120K	150K
Tokens	5.2B	10.5B	15.7B	19.7B
APOLLO (16.14G)	17.55	14.39	13.23	13.02
APOLLO-Mini (14.53G)	18.03	14.60	13.32	13.09
SCALE (ours) (13.74G)	17.99	14.57	12.86	12.59

Usage

The SCALE optimizer code can be found in mem_eff_pt/pt_scale/scale_optimizer.py. The repository provides the scripts used in our experiments in the scripts/ directory.

Example script (350m_scale.sh):

torchrun --standalone --nproc_per_node 4 torchrun_main_DDP.py \
    --model_name 350m_scale \
    --model_config configs/llama_350m.json \
    --optimizer scale \
    --lr 1e-3 \
    --momentum 0.9 \
    --weight_decay 0.0 \
    --batch_size 128 \
    --total_batch_size 512 \
    --num_training_steps 60000 \
    --warmup_steps 6000 \
    --dtype bfloat16 \
    --eval_every 1000 \
    --save_every 99999 \
    --seed 42  \
    --scheduler cosine \
    --dataset_path /path/to/c4/en \

Citation

If you find this work useful for your research, please cite our paper:

@article{glentis2025minimalist,
  title={A Minimalist Optimizer Design for LLM Pretraining},
  author={Glentis, Athanasios and Li, Jiaxiang and Han, Andi and Hong, Mingyi},
  journal={arXiv preprint arXiv:2506.16659},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
configs		configs
imgs		imgs
mem_eff_pt		mem_eff_pt
scripts		scripts
README.md		README.md
requirements.txt		requirements.txt
torchrun_main_DDP.py		torchrun_main_DDP.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Minimalist Optimizer Design for LLM Pretraining

SCALE Optimizer

Results

LLaMA 60M-1B:

LLaMA 7B:

Usage

Citation

About

Uh oh!

Releases

Packages

Languages

OptimAI-Lab/Minimalist_LLM_Pretraining

Folders and files

Latest commit

History

Repository files navigation

A Minimalist Optimizer Design for LLM Pretraining

SCALE Optimizer

Results

LLaMA 60M-1B:

LLaMA 7B:

Usage

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages