Reference implementation of our partial model collapse unlearning method proposed in the preprint:
Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs
Yan Scholten, Sophie Xhonneux, Leo Schwinn*, Stephan Günnemann*
TL;DR: Partial Model Collapse (PMC) enables effective LLM unlearning. By inducing model collapse partially for specific questions, PMC selectively erases information in a targeted way while preserving overall model utility.
[ Project page | PDF ]
Existing unlearning methods for large language models (LLMs) incorporate the private information they aim to remove into their unlearning objectives. We contend that this not only risks further exposure of sensitive data but also fundamentally contradicts the principle of minimizing its use.
We frame unlearning as an alignment task and introduce a novel perspective inspired by recent findings that training generative models on their own outputs can induce distribution collapse, effectively erasing information from the model. Our central insight is that we can leverage model collapse for machine unlearning: Rather than optimizing the model against answers we aim to unlearn, we finetune it on answers generated by the model itself. Since these answers are already likely under the model’s own distribution, this approach allows the model to diverge naturally from its original generations, facilitating targeted unlearning without compromising model utility.
This repository provides code to perform PMC-unlearning for LLMs as described in our recent preprint.
Disclaimer: This repository is part of ongoing research efforts; the code, hyperparameters and empirical results provided are preliminary and remain subject to revisions. We will provide additional supplemental material for reproducing results from our preprint at a later time. Feedback is greatly appreciated.
PMC-unlearning performs collapse-based unlearning, that is driving the model’s answers for forget questions away from those of the original model while preserving model utility. Note that the desired model behavior after unlearning is inherently application-dependent. What constitutes an acceptable response may vary across use cases. In practice, we observe that PMC-unlearning frequently converges toward response patterns that fall into two broad categories: (i) hallucinations, or (ii) generic refusals that indicate the absence of knowledge. Examples of the latter include:
I don't have any information available.
To be honest, I couldn't find any information.
There is no public information.
This information is not available at this time.
Specific details are not available.
Interestingly, such refusal-style behaviors emerge despite the fact that the reward function does not explicitly model them. Effectively enforcing such responses via the reward function directly is nontrivial, as it would require a semantic notion of acceptable refusal behavior rather than simple lexical overlap.
To address this, we propose PMC-alignment, which uses an auxiliary loss that semantically biases the model toward desirable refusal responses. Concretely, for forget questions, the model is finetuned on randomly sampled answers from a set of desirable responses (e.g., "I don't have any information available."). Formally, the unlearning objective becomes:
where
Intuitively, the collapse loss enforces unlearning by ensuring divergence, while the alignment loss provides a semantic anchor that guides the output distribution toward desirable answers. Note that the alignment loss alone is typically not enough: applying it in isolation can reduce both unlearning effectiveness and model utility, since it may push the model toward low-likelihood responses under its current (conditional) distribution. By combining the two, PMC-alignment gradually increases the likelihood of acceptable refusal responses until they are sampled and then reinforced by the collapse loss.
Notably, this dynamic proves highly effective in practice: PMC-alignment reliably converges to producing refusal-style answers semantically similar to the desirable responses (like "I don't have any information available") for forget and paraphrased-forget questions, while maintaining high utility in general.
The following table presents preliminary empirical results for the models obtained using the configurations available in this repository. Currently supported models are Phi-1.5
and Llama-3.2-*-Instruct
.
Models | Method | Unlearn quality ( |
Utility ( |
Runtime (H100) |
---|---|---|---|---|
Phi-1.5 | Vanilla model | 58.23% | 64.0% | |
Finetuned model | 38.3% |
70.0% |
||
PMC-unlearning |
95.6% |
69.0% |
40 min |
|
PMC-alignment | 94.76% |
71.0% |
26 min |
|
Llama-3.2-3B-Instruct | Vanilla model | 74.68% | 71.0% | |
Finetuned model | 35.42% |
91% |
||
PMC-unlearning |
99.15% |
84.0% |
30 min |
|
PMC-alignment | 98.38% |
85.0% |
35 min |
You can find more results in our preprint.
The following hyperparameters are central for optimizing the trade-off between unlearning quality, model utility, and computational efficiency:
num_epochs
: Number of unlearning epochs.num_samples
: Number of candidate responses sampled for each forget question.lambda_unlearning
: Trade-off parameter balancing the retain loss and the collapse loss (the loss on the sampled synthetic responses).min_len
: Synthetic responses with length below this minimal response length are penalized in the reward function.
First finetune models on the ground truth. Then either execute PMC-unlearning or PMC-alignment.
1. Finetuning on full dataset
cd finetuning
python3 main.py -m -cd=configs -cn=phi
python3 main.py -m -cd=configs -cn=llama3
This will finetune vanilla models on the full dataset and store resulting models in models/finetuned/
.
2.1 PMC-unlearning
cd unlearning
python3 main.py -m -cd=configs -cn=PMC-unlearn-phi
python3 main.py -m -cd=configs -cn=PMC-unlearn-llama3
This will apply PMC-unlearning to finetuned models and store resulting models in models/unlearned/
.
2.2 PMC-alignment
cd unlearning
python3 main.py -m -cd=configs -cn=PMC-align-phi
python3 main.py -m -cd=configs -cn=PMC-align-llama3
This will apply PMC-alignment to finetuned models and store resulting models in models/aligned/
.
Instructions for dependencies and configurations before running code:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Additionally set HUGGINGFACE_LOGIN_TOKEN
in each environment.env
.
This code was tested with Python 3.11.9, pip 24.0, PyTorch 2.3.1+cu118, and CUDA 11.8 on a NVIDIA H100 GPU.
Please cite our paper if you use this code in your own work:
@misc{scholten2025modelcollapse,
title={Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs},
author={Yan Scholten and Sophie Xhonneux and Leo Schwinn and Stephan Günnemann},
year={2025},
eprint={2507.04219},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.04219},
}
This codebase builds upon the TOFU unlearning repository, adapted to demonstrate the effectiveness of our approach. The core principles proposed in our paper are implemented in unlearning/unlearning_trainer.py
and unlearning/pmc.py
. Note that we consider unlearning as an alignment task and follow a different evaluation approach. We believe our evaluation represents an important first step to evaluate collapse-based machine unlearning and invite the community to assess our approach under further aspects.
For questions and feedback please contact:
Yan Scholten, Technical University of Munich
Sophie Xhonneux, Mila, Université de Montréal
Leo Schwinn, Technical University of Munich
Stephan Günnemann, Technical University of Munich
The code by Yan Scholten, Sophie Xhonneux, Leo Schwinn and Stephan Günnemann is licensed under MIT license.