

12·
4 days agoAlso, you might want to research this Heretic project, which aims to remove safeguards from local models as those might be similar to what’s in the larger versions. Figuring out the phrases they test the safeguards with might have some decent results.
From my other comment it looks like this dataset contains various strings that trigger refusal: https://huggingface.co/datasets/mlabonne/harmful_behaviors