Skip to content

lennart-finke/classifier-interp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Training Sparse Autoencoders on Prompt-Guard

[Caution] This repository is intended to handle, though does not host, jailbreak prompts, which often contain malicious, unsafe, or inappropriate text.

We train SAEs on Prompt-Guard-86M, using the great dictionary_learning package. The same methodology can be applied on any Huggingface-compatible classifier. A guide on how to use this repo is provided in reproducing.MD.

Model weights are included in this repository via Git LFS.

Preprint

The preprint is available here.

Contributing

Feel free to propose changes, do PRs or raise issues.

Thanks

This project was conducted as coursework at ETH, with supervision from Prof. Dr. Elliott Ash and David Zollikofer. Many thanks also to Samuel Marks, Adam Karvonen, and Aaron Mueller for writing the dictionary learning package.

Citation

If you'd like to cite this work, we recommend

@misc{finke2025training,
      title={Autoencoders for a Harmfulness Text Classifier},
      url={https://github.com/lennart-finke/classifier-interp},
      author={Finke, Lennart and Zollikofer, David}, year={2025}
}