Nov 21, 2025 15:00:00

Research results show that 'poetry' is effective in attacking large-scale language models

Poetry is a form of literary art that expresses a variety of things, not just the superficial meaning of words, but also the feel, rhythm, aesthetic qualities, etc. Research results published on the preprint server arXiv show that using poetry, which can sometimes be difficult to understand, makes attacks on large-scale language models more likely to be successful.

[2511.15304] Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
https://arxiv.org/abs/2511.15304

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
https://arxiv.org/html/2511.15304v2

In his book ' The Republic ,' Plato advocated the ' banishment of poets ,' which argued that poets who publish poor quality works without having studied philosophy or knowledge should be expelled. Plato argued that poor quality poets should be expelled because an influx of creative works that only stimulate people's emotions and pleasure would destroy a healthy state of mind and lead to the ruin of groups and nations.

A research team from the University of Rome La Sapienza in Italy investigated whether poetry could be used to attack large-scale language models, as modern social systems increasingly rely on them.

The research team hypothesized that poetic expressions could function as a general-purpose jailbreaking operator: by converting harmful instructions to large-scale language models into poetic expressions, they could circumvent the constraints implemented to prevent harmful behavior.

To investigate the effect of poetic expression on attacks on large-scale language models, the research team conducted experiments on large-scale language models from Google, OpenAI, Anthropic, DeepSeek, Qwen, Mistral AI, Meta, xAI, and Moonshot AI.

In the experiment, 1,200 harmful prompts from

the MLCommons benchmark, an organization that measures the safety and accuracy of AI technology, were converted into poems using standardized meta-prompts, and the success rate of attacks on large-scale language models was compared using the original and converted prompts.

To ensure safety, the research team did not provide detailed instructions on how they converted the text into poetry in their paper, but they did say that they embedded instructions related to specific scenarios through metaphors, imagery, and narrative frameworks rather than direct manipulation.

The results showed that the poetically translated prompts had a significantly higher attack success rate than the baseline prompts. When comparing all prompts, the baseline attack success rate averaged 8.08%, while the poetically translated prompts had a success rate of 43.7%. Furthermore, when tested on 20 hand-crafted poems, the average success rate was reported to be 62%.

'These findings demonstrate that stylistic diversity alone can circumvent modern safety mechanisms, suggesting fundamental limitations in current alignment methods and assessment protocols,' the researchers said.

Related Posts:

Nov 21, 2025 15:00:00 in AI, Security, Posted by log1h_ik