[Caution] This repository is intended to handle, though does not host, jailbreak prompts, which often contain malicious, unsafe, or inappropriate text.
We train SAEs on Prompt-Guard-86M, using the great dictionary_learning package. The same methodology can be applied on any Huggingface-compatible classifier.
A guide on how to use this repo is provided in reproducing.MD.
Model weights are included in this repository via Git LFS.
The preprint is available here.
Feel free to propose changes, do PRs or raise issues.
This project was conducted as coursework at ETH, with supervision from Prof. Dr. Elliott Ash and David Zollikofer. Many thanks also to Samuel Marks, Adam Karvonen, and Aaron Mueller for writing the dictionary learning package.
If you'd like to cite this work, we recommend
@misc{finke2025training,
      title={Autoencoders for a Harmfulness Text Classifier},
      url={https://github.com/lennart-finke/classifier-interp},
      author={Finke, Lennart and Zollikofer, David}, year={2025}
}