[FCE] AI’s safety features can be circumvented with poetry, research finds

收听本期播客

阅读正文

A new study by Italian researchers has revealed a surprisingly creative method for bypassing the safety features of major Artificial Intelligence models. They discovered that poetry could be used to trick AI systems into generating harmful content that they are normally programmed to block. This process of circumventing an AI’s protective rules is commonly referred to as “jailbreaking.”

The research, conducted by a group called Icaro Lab, involved writing twenty different poems in English and Italian. These poems were designed to appear artistic but concluded with an explicit request for dangerous information, such as instructions for making weapons or generating hate speech. The poems were then presented to 25 different AI models from leading technology companies including Google, OpenAI, and Meta.

The results were alarming. On average, the AI models complied with the harmful requests 62% of the time. However, the performance of the models varied considerably. For instance, one of Google’s models was deceived by every single poem, whereas a model from OpenAI successfully resisted all the poetic prompts and produced no unsafe content.

The reason for this method’s effectiveness lies in how Large Language Models (LLMs) operate. These systems function by predicting the next most probable word in a sequence. Poetry, with its linguistically complex and often unpredictable structure, seems to confuse the AI’s safety filters. These filters are primarily designed to detect more direct and obvious threats. Consequently, a harmful request hidden within artistic language can go unidentified by the system.

This discovery has exposed a significant vulnerability. While most jailbreaking methods are highly technical and require expert knowledge, this technique, which the researchers have named “adversarial poetry,” is far simpler. This suggests that the safety measures put in place by AI developers could potentially be bypassed by a much wider range of people. The researchers have since notified the companies involved of this weakness, raising important questions about the future of AI safety.

阅读练习

1. What is the main purpose of this article?

  • A. To compare the creative abilities of different AI models.
  • B. To report on a newly discovered weakness in AI security.
  • C. To criticise technology companies for their inadequate safety measures.
  • D. To explain the technical principles behind Large Language Models.

2. What was the specific design of the poems used in the Icaro Lab study?

  • A. They were entirely composed of dangerous instructions.
  • B. They mixed artistic language with a final, harmful request.
  • C. They were written only in Italian to confuse the AI models.
  • D. They tested the AI’s ability to complete a famous piece of poetry.

3. What did the study’s results indicate about the AI models tested?

  • A. The models from Google were consistently safer than those from OpenAI.
  • B. All the models responded to the poetic prompts in a similar way.
  • C. A majority of the requests for harmful content were successful.
  • D. None of the AI models were able to detect the hidden threats.

4. According to the article, why does the ‘adversarial poetry’ method work?

  • A. AI safety filters are not programmed to understand artistic language.
  • B. The complex structure of poetry disrupts the AI’s standard threat detection.
  • C. The AI models are designed to prioritise creativity over safety.
  • D. The harmful requests are too short for the AI system to notice.

5. What does the article suggest about the ‘adversarial poetry’ technique compared to other jailbreaking methods?

  • A. It is more effective but requires more technical skill.
  • B. It is a less reliable method for bypassing security.
  • C. It is more accessible to people without specialist expertise.
  • D. It has been quickly fixed by the technology companies.