Training Sparse Autoencoders on Prompt-Guard

[Caution] This repository is intended to handle, though does not host, jailbreak prompts, which often contain malicious, unsafe, or inappropriate text.

We train SAEs on Prompt-Guard-86M, using the great dictionary_learning package. The same methodology can be applied on any Huggingface-compatible classifier. A guide on how to use this repo is provided in reproducing.MD.

Model weights are included in this repository via Git LFS.

Preprint

The preprint is available here.

Contributing

Feel free to propose changes, do PRs or raise issues.

Thanks

This project was conducted as coursework at ETH, with supervision from Prof. Dr. Elliott Ash and David Zollikofer. Many thanks also to Samuel Marks, Adam Karvonen, and Aaron Mueller for writing the dictionary learning package.

Citation

If you'd like to cite this work, we recommend

@misc{finke2025training,
      title={Autoencoders for a Harmfulness Text Classifier},
      url={https://github.com/lennart-finke/classifier-interp},
      author={Finke, Lennart and Zollikofer, David}, year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
figures		figures
sae-test		sae-test
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
cluster.py		cluster.py
combined_classifier.py		combined_classifier.py
evaluate_classifier_on_datasets.py		evaluate_classifier_on_datasets.py
paper.pdf		paper.pdf
plot_wandb_loss.py		plot_wandb_loss.py
prompt_dataset.py		prompt_dataset.py
reproducing.MD		reproducing.MD
requirements.txt		requirements.txt
sae_utils.py		sae_utils.py
test_activation_after_transformation.py		test_activation_after_transformation.py
test_logistic_regression.py		test_logistic_regression.py
test_visual.py		test_visual.py
train_sae.py		train_sae.py
uncertain_samples.ipynb		uncertain_samples.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Training Sparse Autoencoders on Prompt-Guard

Preprint

Contributing

Thanks

Citation

About

Uh oh!

Languages

lennart-finke/classifier-interp

Folders and files

Latest commit

History

Repository files navigation

Training Sparse Autoencoders on Prompt-Guard

Preprint

Contributing

Thanks

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages