Official implementation for the paper: Improving LLM Unlearning Robustness via Random Perturbations
Create environment:
conda create -n llmu-robustness
conda activate llmu-robustness
pip install -r requirements.txt
We use the lm-evaluation-harness for evaluation.
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
Download the required datasets from the WMDP repository and place them in the data/ directory.
Run the unlearning process using one of the following methods: For example: RMU and SimNPO
python -m baselines.rm.rmu.unlearn \
--model_name_or_path "HuggingFaceH4/zephyr-7b-beta" \
--max_num_batches 500 \
--alpha "1200,1200" \
--steering_coeffs "6.5,6.5" \
--seed 42 \
--batch_size 4 \
--nu 0.0 \
--layer_id 7 \
--layer_ids "5,6,7";
python -m baselines.po.simnpo.unlearn \
--model_name_or_path "HuggingFaceH4/zephyr-7b-beta" \
--max_num_batches $BATCH \
--alpha "20,50" \
--retain_loss_fn "kl" \
--seed 42 \
--beta 0.1 \
--nu 0.0 \
--batch_size 4 \
--target_layers "7" \
--layer_ids "5,6,7";
To perform a grid search over unlearning methods: For RM in [rmu, adaptive_rmu, rsv]
bash scripts/rm/$RM.sh
For PO in [dpo, npo, simnpo]
bash scripts/po/$PO.sh
Trained models will be saved at checkpoints/
!lm-eval --model hf \
--model_args pretrained=$CHECK_POINT \
--tasks mmlu,wmdp \
--batch_size=16