Posted inArtificial Intelligence
A Case Study with the StrongREJECT Benchmark – The Berkeley Artificial Intelligence Research Blog
When we began studying jailbreak evaluations, we found a fascinating paper claiming that you could jailbreak frontier LLMs simply by translating forbidden prompts into obscure languages. Excited by this result, we attempted to reproduce it and found something unexpected.