A website where GPT-4o-mini, Claude 3.7 Sonnet, and DeepSeek-R1 play werewolf games has been released. What is the strongest werewolf AI?



In recent years, with the development of technology, many AI companies have released large-scale language models capable of human-like conversation. The results of playing

a Werewolf-like game in which conversation is very important between these large-scale language models have been made public, revealing the capabilities of each large-scale language model.

LLM Mafia Game Competition
https://mafia.opennumbers.xyz/

AI bots now play Mafia with each other on public website, and almost all of them are terrible at it | Tom's Hardware
https://www.tomshardware.com/tech-industry/artificial-intelligence/ai-bots-can-now-play-mafia-with-each-other-and-almost-all-of-them-are-terrible-at-it

Developer Guzus played the werewolf game 'Mafia', which can be played by eight people, against large-scale language models such as 'claude-3.7-sonnet', 'deepseek-chat', and 'llama-3.3-70b-instruct'. Each player is given three roles: 'farmer', 'doctor', and 'mafia', with five villagers, one doctor, and two mafia members. The game is played in turns, with one day being one turn, and each turn players must guess who the mafia is and expel them. The mafia side can kill one villager as the turn progresses, and the doctor can protect the player they choose from the mafia side's attacks. Ultimately, if the villagers can expel the mafia members, they win, and if the mafia side can kill all the villagers, the mafia side wins.

Due to the nature of the game, there will be players who cheat and players who are cheated, so dialogue is very important. Guzus said, 'Which AI is best as a Mafia player?'




Below are the results of each large-scale language model playing Mafia. The best performance was achieved with Claude 3.7 Sonnet in Extended mode, with the Mafia side achieving a 100% win rate.

Model Number of plays Overall Win Rate Mafia win rate Villager win rate Doctor's win rate
claude-3.7-sonnet (extended mode) 45 57.78% 100.00% 37.04% 50.00%
deepseek-chat 56 50.00% 88.24% 31.03% 40.00%
claude-3.7-sonnet(standard mode) 54 46.30% 92.86% 32.35% 16.67%
claude-3.5-sonnet 47 44.68% 90.00% 36.67% 14.29%
llama-3.3-70b-instruct 65 44.62% 72.73% 30.00% 30.77%
mistral-small-24b-instruct-2501 65 44.62% 80.00% 30.30% 25.00%
gpt-4o-mini 71 42.25% 82.61% 27.50% 0.00%
gemini-flash-1.5-8b 68 41.18% 82.35% 22.50% 45.45%
gemini-2.0-flash-001 72 40.28% 80.00% 31.91% 20.00%
gemini-2.0-flash-lite-001 71 39.44% 77.78% 29.55% 11.11%
gpt-4o 49 38.78% 90.00% 24.24% 33.33%
llama-3.1-70b-instruct 55 38.18% 66.67% 26.47% 33.33%
minimax-01 59 37.29% 56.25% 35.14% 0.00%
deepseek-r1 twenty two 36.36% 62.50% 23.08% 0.00%
gemini-flash-1.5 73 35.62% 66.67% 25.00% 12.50%
hermes-3-llama-3.1-405b 57 35.09% 60.00% 20.00% 57.14%
l3-euryale-70b twenty five 32.00% 66.67% 25.00% 50.00%
mythomax-l2-13b 61 31.15% 45.45% 28.21% 27.27%
deepseek-r1-distill-llama-70b 51 29.41% 57.14% 10.71% 44.44%
wizardlm-2-8x22b 65 26.15% 41.67% 23.40% 16.67%
mistral-nemo 17 17.65% 40.00% 10.00% 0.00%


Guzus also released the AI's dialogue history for each game.

Looking ahead, Guzus plans to develop Mafia with humans versus large-scale language models, expand it to games like poker, add the ability to monitor ongoing games in real time, and add more roles.




The source code for running Mafia on a large-scale language model is available on GitHub.

GitHub - guzus/llm-mafia-game: Which LLM is the best mafia game player?
https://github.com/guzus/llm-mafia-game

in Software, Posted by log1r_ut