Mar 10, 2025 14:00:00

A website where GPT-4o-mini, Claude 3.7 Sonnet, and DeepSeek-R1 play werewolf games has been released. What is the strongest werewolf AI?

In recent years, with the development of technology, many AI companies have released large-scale language models capable of human-like conversation. The results of playing

a Werewolf-like game in which conversation is very important between these large-scale language models have been made public, revealing the capabilities of each large-scale language model.

LLM Mafia Game Competition
https://mafia.opennumbers.xyz/

AI bots now play Mafia with each other on public website, and almost all of them are terrible at it | Tom's Hardware
https://www.tomshardware.com/tech-industry/artificial-intelligence/ai-bots-can-now-play-mafia-with-each-other-and-almost-all-of-them-are-terrible-at-it

Developer Guzus played the werewolf game 'Mafia', which can be played by eight people, against large-scale language models such as 'claude-3.7-sonnet', 'deepseek-chat', and 'llama-3.3-70b-instruct'. Each player is given three roles: 'farmer', 'doctor', and 'mafia', with five villagers, one doctor, and two mafia members. The game is played in turns, with one day being one turn, and each turn players must guess who the mafia is and expel them. The mafia side can kill one villager as the turn progresses, and the doctor can protect the player they choose from the mafia side's attacks. Ultimately, if the villagers can expel the mafia members, they win, and if the mafia side can kill all the villagers, the mafia side wins.

Due to the nature of the game, there will be players who cheat and players who are cheated, so dialogue is very important. Guzus said, 'Which AI is best as a Mafia player?'

Which AI is the best mafia (werewolf) game player?

You can see the whole script of LLMs playing mafia games.

They deceive, debate, and kill each other to win.

link below pic.twitter.com/vfR47nLrrY
— guzus (@uncanny_guzus) March 3, 2025

Below are the results of each large-scale language model playing Mafia. The best performance was achieved with Claude 3.7 Sonnet in Extended mode, with the Mafia side achieving a 100% win rate.

Model	Number of plays	Overall Win Rate	Mafia win rate	Villager win rate	Doctor's win rate
claude-3.7-sonnet (extended mode)	45	57.78%	100.00%	37.04%	50.00%
deepseek-chat	56	50.00%	88.24%	31.03%	40.00%
claude-3.7-sonnet(standard mode)	54	46.30%	92.86%	32.35%	16.67%
claude-3.5-sonnet	47	44.68%	90.00%	36.67%	14.29%
llama-3.3-70b-instruct	65	44.62%	72.73%	30.00%	30.77%
mistral-small-24b-instruct-2501	65	44.62%	80.00%	30.30%	25.00%
gpt-4o-mini	71	42.25%	82.61%	27.50%	0.00%
gemini-flash-1.5-8b	68	41.18%	82.35%	22.50%	45.45%
gemini-2.0-flash-001	72	40.28%	80.00%	31.91%	20.00%
gemini-2.0-flash-lite-001	71	39.44%	77.78%	29.55%	11.11%
gpt-4o	49	38.78%	90.00%	24.24%	33.33%
llama-3.1-70b-instruct	55	38.18%	66.67%	26.47%	33.33%
minimax-01	59	37.29%	56.25%	35.14%	0.00%
deepseek-r1	twenty two	36.36%	62.50%	23.08%	0.00%
gemini-flash-1.5	73	35.62%	66.67%	25.00%	12.50%
hermes-3-llama-3.1-405b	57	35.09%	60.00%	20.00%	57.14%
l3-euryale-70b	twenty five	32.00%	66.67%	25.00%	50.00%
mythomax-l2-13b	61	31.15%	45.45%	28.21%	27.27%
deepseek-r1-distill-llama-70b	51	29.41%	57.14%	10.71%	44.44%
wizardlm-2-8x22b	65	26.15%	41.67%	23.40%	16.67%
mistral-nemo	17	17.65%	40.00%	10.00%	0.00%

Guzus also released the AI's dialogue history for each game.

Looking ahead, Guzus plans to develop Mafia with humans versus large-scale language models, expand it to games like poker, add the ability to monitor ongoing games in real time, and add more roles.

github repository revealing soon.

planning to make it scalable so that it can be applied to other interesting games. could be developed to generate a movie script someday
— guzus (@uncanny_guzus) March 3, 2025

The source code for running Mafia on a large-scale language model is available on GitHub.

GitHub - guzus/llm-mafia-game: Which LLM is the best mafia game player?
https://github.com/guzus/llm-mafia-game

Related Posts:

Mar 10, 2025 14:00:00 in Software, Posted by log1r_ut