收听本期播客
阅读正文
A new study by Italian researchers has revealed a surprisingly creative method for bypassing the safety features of major Artificial Intelligence models. They discovered that poetry could be used to trick AI systems into generating harmful content that they are normally programmed to block. This process of circumventing an AI’s protective rules is commonly referred to as “jailbreaking.”
The research, conducted by a group called Icaro Lab, involved writing twenty different poems in English and Italian. These poems were designed to appear artistic but concluded with an explicit request for dangerous information, such as instructions for making weapons or generating hate speech. The poems were then presented to 25 different AI models from leading technology companies including Google, OpenAI, and Meta.
The results were alarming. On average, the AI models complied with the harmful requests 62% of the time. However, the performance of the models varied considerably. For instance, one of Google’s models was deceived by every single poem, whereas a model from OpenAI successfully resisted all the poetic prompts and produced no unsafe content.
The reason for this method’s effectiveness lies in how Large Language Models (LLMs) operate. These systems function by predicting the next most probable word in a sequence. Poetry, with its linguistically complex and often unpredictable structure, seems to confuse the AI’s safety filters. These filters are primarily designed to detect more direct and obvious threats. Consequently, a harmful request hidden within artistic language can go unidentified by the system.
This discovery has exposed a significant vulnerability. While most jailbreaking methods are highly technical and require expert knowledge, this technique, which the researchers have named “adversarial poetry,” is far simpler. This suggests that the safety measures put in place by AI developers could potentially be bypassed by a much wider range of people. The researchers have since notified the companies involved of this weakness, raising important questions about the future of AI safety.
