Here's what happens when you run the AI benchmark 'Draw a Pelican on a Bicycle' on LLama 3.3 70B or GPT 4.1

There are various benchmarks to measure the performance of AI, but one that is a bit unusual is 'Draw a Pelican on a Bicycle,' devised by engineer Simon Willison. In a keynote speech at the AI Engineer World Fair held in June 2025, Willison reported on his latest 'Pelican on a Bicycle.'
The last six months in LLMs, illustrated by pelicans on bicycles
The contents of the 'Pelican on a Bicycle' benchmark compiled by Willison in December 2024 are as follows.
When you try the benchmark to draw 'Pelican on a bicycle' in SVG format on GPT-4o or Google Gemini, it looks like this - GIGAZINE

Six months later, Willison presented new data at a keynote speech at the AI Engineers World Fair in San Francisco. The first new result was Amazon's Nova, released in November 2024. There are three models, of which Nova-micro is the cheapest model that Willison is tracking. Unfortunately, he doesn't seem to be good at drawing pelicans.

The 'Llama 3.3 70B' is the final model in Meta's Llama 3 series, released in December 2024. Meta claims that it has the same performance as its largest model, the 'Llama 3.1 405B,' but while the Llama 3.1 405B was able to draw something that looked like a bicycle, the Llama 3.3 70B was only able to draw something that was neither a bicycle nor a pelican, showing that there is a significant difference.

DeepSeek released a new model at Christmas. Thanks to the $5.5 million (about 800 million yen) spent on training costs, it depicted birds and bicycles, although not pelicans.

The DeepSeek-R1, which will be released in early 2025, will further improve Pelican's image capture capabilities. It will output a bicycle that is recognizable as a bicycle at a glance.
In February 2025, Anthropic released 'Claude 3.7 Sonnet,' a 'pelican on a bicycle' that can only be described as 'stunning.'

Each model of OpenAI's GPT 4.1 looks like this. The nano and mini models have some concerns about the bike's shape.

And in the near future, in May 2025, it looks like this. 'Claude Sonnet 4' is riding a bicycle while feeling like a pelican or a duck. And 'gemini-2.5-pro-preview-05-06' succeeded in outputting an undisputed pelican.

'I started my benchmark as a joke, but it's actually starting to become a bit useful,' Willison said. 'Unless the big AI labs catch up, I think the Pelican benchmark will continue to be useful for a while.'
Related Posts:
in Note, Posted by logc_nt