Skip to content

Implementation of our unlearning method "Partial Model Collapse" introduced in the paper: "Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs" (Preprint).

License

Notifications You must be signed in to change notification settings

yascho/partial-model-collapse-unlearning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

Reference implementation of our partial model collapse unlearning method proposed in the preprint:

Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs
Yan Scholten, Sophie Xhonneux, Leo Schwinn*, Stephan Günnemann*

TL;DR: Partial Model Collapse (PMC) enables effective LLM unlearning. By inducing model collapse partially for specific questions, PMC selectively erases information in a targeted way while preserving overall model utility.

[ Project page | PDF ]

Overview

Existing unlearning methods for large language models (LLMs) incorporate the private information they aim to remove into their unlearning objectives. We contend that this not only risks further exposure of sensitive data but also fundamentally contradicts the principle of minimizing its use.

We frame unlearning as an alignment task and introduce a novel perspective inspired by recent findings that training generative models on their own outputs can induce distribution collapse, effectively erasing information from the model. Our central insight is that we can leverage model collapse for machine unlearning: Rather than optimizing the model against answers we aim to unlearn, we finetune it on answers generated by the model itself. Since these answers are already likely under the model’s own distribution, this approach allows the model to diverge naturally from its original generations, facilitating targeted unlearning without compromising model utility.

This repository provides code to perform PMC-unlearning for LLMs as described in our recent preprint.

Disclaimer: This repository is part of ongoing research efforts; the code, hyperparameters and empirical results provided are preliminary and remain subject to revisions. We will provide additional supplemental material for reproducing results from our preprint at a later time. Feedback is greatly appreciated.

Partial model collapse with alignment

PMC-unlearning performs collapse-based unlearning, that is driving the model’s answers for forget questions away from those of the original model while preserving model utility. Note that the desired model behavior after unlearning is inherently application-dependent. What constitutes an acceptable response may vary across use cases. In practice, we observe that PMC-unlearning frequently converges toward response patterns that fall into two broad categories: (i) hallucinations, or (ii) generic refusals that indicate the absence of knowledge. Examples of the latter include:

I don't have any information available.
To be honest, I couldn't find any information.
There is no public information.
This information is not available at this time.
Specific details are not available.

Interestingly, such refusal-style behaviors emerge despite the fact that the reward function does not explicitly model them. Effectively enforcing such responses via the reward function directly is nontrivial, as it would require a semantic notion of acceptable refusal behavior rather than simple lexical overlap.

To address this, we propose PMC-alignment, which uses an auxiliary loss that semantically biases the model toward desirable refusal responses. Concretely, for forget questions, the model is finetuned on randomly sampled answers from a set of desirable responses (e.g., "I don't have any information available."). Formally, the unlearning objective becomes:

$$ \mathcal{L}_{collapse} + (1-\gamma) \mathcal{L}_{alignment}$$

where $L_{collapse}$ denotes the collapse loss driving the model away from answers to unlearn, and $L_{alignment}$ denotes the alignment term encouraging convergence toward responses semantically similar to the ones in the set of desirable responses. The discount factor $\gamma$ corresponds to the average reward score of the batch, nullifying the alignment term once the model collapsed to unlearned responses.

Intuitively, the collapse loss enforces unlearning by ensuring divergence, while the alignment loss provides a semantic anchor that guides the output distribution toward desirable answers. Note that the alignment loss alone is typically not enough: applying it in isolation can reduce both unlearning effectiveness and model utility, since it may push the model toward low-likelihood responses under its current (conditional) distribution. By combining the two, PMC-alignment gradually increases the likelihood of acceptable refusal responses until they are sampled and then reinforced by the collapse loss.

Notably, this dynamic proves highly effective in practice: PMC-alignment reliably converges to producing refusal-style answers semantically similar to the desirable responses (like "I don't have any information available") for forget and paraphrased-forget questions, while maintaining high utility in general.

Empirical results

The following table presents preliminary empirical results for the models obtained using the configurations available in this repository. Currently supported models are Phi-1.5 and Llama-3.2-*-Instruct.

Models Method Unlearn quality ($\uparrow$) Utility ($\uparrow$) Runtime (H100)
Phi-1.5 Vanilla model 58.23% 64.0%
Finetuned model 38.3% $\pm$ 0.12 70.0% $\pm$ 0.98
PMC-unlearning 95.6% $\pm$ 0.57 69.0% $\pm$ 1.34 40 min $\pm$ 2
PMC-alignment 94.76% $\pm$ 0.2 71.0% $\pm$ 0.86 26 min $\pm$ 2
Llama-3.2-3B-Instruct Vanilla model 74.68% 71.0%
Finetuned model 35.42% $\pm$ 0.23 91% $\pm$ 0.73
PMC-unlearning 99.15% $\pm$ 0.3 84.0% $\pm$ 2.72 30 min $\pm$ 1
PMC-alignment 98.38% $\pm$ 0.64 85.0% $\pm$ 2.51 35 min $\pm$ 15

You can find more results in our preprint.

Hyperparameters

The following hyperparameters are central for optimizing the trade-off between unlearning quality, model utility, and computational efficiency:

  • num_epochs: Number of unlearning epochs.
  • num_samples: Number of candidate responses sampled for each forget question.
  • lambda_unlearning: Trade-off parameter balancing the retain loss and the collapse loss (the loss on the sampled synthetic responses).
  • min_len: Synthetic responses with length below this minimal response length are penalized in the reward function.

Usage instructions

First finetune models on the ground truth. Then either execute PMC-unlearning or PMC-alignment.

1. Finetuning on full dataset

cd finetuning
python3 main.py -m -cd=configs -cn=phi
python3 main.py -m -cd=configs -cn=llama3

This will finetune vanilla models on the full dataset and store resulting models in models/finetuned/.

2.1 PMC-unlearning

cd unlearning
python3 main.py -m -cd=configs -cn=PMC-unlearn-phi
python3 main.py -m -cd=configs -cn=PMC-unlearn-llama3

This will apply PMC-unlearning to finetuned models and store resulting models in models/unlearned/.

2.2 PMC-alignment

cd unlearning
python3 main.py -m -cd=configs -cn=PMC-align-phi
python3 main.py -m -cd=configs -cn=PMC-align-llama3

This will apply PMC-alignment to finetuned models and store resulting models in models/aligned/.

Installation

Instructions for dependencies and configurations before running code:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Additionally set HUGGINGFACE_LOGIN_TOKEN in each environment.env.

This code was tested with Python 3.11.9, pip 24.0, PyTorch 2.3.1+cu118, and CUDA 11.8 on a NVIDIA H100 GPU.

Cite

Please cite our paper if you use this code in your own work:

@misc{scholten2025modelcollapse,
      title={Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs}, 
      author={Yan Scholten and Sophie Xhonneux and Leo Schwinn and Stephan Günnemann},
      year={2025},
      eprint={2507.04219},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.04219}, 
}

Acknowledgements

This codebase builds upon the TOFU unlearning repository, adapted to demonstrate the effectiveness of our approach. The core principles proposed in our paper are implemented in unlearning/unlearning_trainer.py and unlearning/pmc.py. Note that we consider unlearning as an alignment task and follow a different evaluation approach. We believe our evaluation represents an important first step to evaluate collapse-based machine unlearning and invite the community to assess our approach under further aspects.

Contact

For questions and feedback please contact:

Yan Scholten, Technical University of Munich
Sophie Xhonneux, Mila, Université de Montréal
Leo Schwinn, Technical University of Munich
Stephan Günnemann, Technical University of Munich

License

The code by Yan Scholten, Sophie Xhonneux, Leo Schwinn and Stephan Günnemann is licensed under MIT license.

About

Implementation of our unlearning method "Partial Model Collapse" introduced in the paper: "Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs" (Preprint).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages