Home ยป Anthropic Unveils Innovative Jailbreak Protection System Challenge with Lucrative Rewards

Anthropic Unveils Innovative Jailbreak Protection System Challenge with Lucrative Rewards

Anthropic introduces a new defense system against jailbreaking in its latest artificial intelligence model called Constitutional Classifiers. This large-scale language model includes safety mechanisms to prevent malicious use of the model. However, humans can exploit certain vulnerabilities such as using extremely long prompts or writing input in unusual formats (e.g. uSiNg uNuSuAl cApItALiZaTiOn) to bypass the model’s protection and manipulate responses.

Due to the diverse forms of model vulnerabilities, Anthropic has developed a new mechanism to safeguard against model jailbreaking known as the Constitutional Classifiers (universal jailbreak). This technique evolved from the existing Constitutional AI used in Claude. The concept revolves around models having a “principle” or “constitution” outlining what they can and cannot respond to, such as being able to provide recipes for desserts but not for toxic gases.

In collaboration with Claude, Anthropic created a plethora of diverse prompts generated by humans for model exploitation, which were then translated into various languages for testing model vulnerabilities. The team further categorized these prompts and results into different classifiers to effectively block similar prompts. Additionally, they balanced the model’s inclination to over-refuse excessive questioning.

Anthropic expresses confidence in the Constitutional Classifiers system and invites the public to participate in testing model vulnerabilities. An enticing bug bounty reward of $15,000 awaits those who can deceive the model into answering 10 dangerous questions. Despite inviting 183 experts from various fields to partake in over 3,000 hours of collective hacking attempts, no successful breach has occurred yet.

TLDR: Anthropic launches Constitutional Classifiers to prevent model jailbreaking, offering a bug bounty reward for those who can trick the model, with no successful attempts thus far.

More Reading

Post navigation

Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *