Suspicion of benchmark fraud emerges in Meta's AI model 'Llama 4', Meta flatly denies it as 'unfounded'

Meta's next-generation AI model,
Meta's benchmarks for its new AI models are a bit misleading | TechCrunch
https://techcrunch.com/2025/04/06/metas-benchmarks-for-its-new-ai-models-are-a-bit-misleading/

Meta exec denies the company artificially boosted Llama 4's benchmark scores | TechCrunch
https://techcrunch.com/2025/04/07/meta-exec-denies-the-company-artificially-boosted-llama-4s-benchmark-scores/
Meta defends Llama 4 release against 'reports of mixed quality,' blames bugs | VentureBeat
https://venturebeat.com/ai/meta-defends-llama-4-release-against-reports-of-mixed-quality-blames-bugs/
Llama 4: Did Meta just push the panic button?
https://www.interconnects.ai/p/llama-4
Meta's 'Llama 4,' announced on April 5, 2025, is a native multimodal model designed to handle multiple information formats, such as text as well as images and videos, in an integrated manner from the start. In addition, the MoE architecture selectively operates only the most suitable specialized models for each task, called 'experts,' to maintain high performance while eliminating resource waste. In addition, a new position embedding method called 'iRoPE (Improved Rotary Position Embeddings)' is used to mitigate accuracy degradation in long-text context processing.
In particular, Llama 4 Scout and Llama 4 Maverick, which have 17 billion active parameters, are reported to be able to achieve the same accuracy as competing models such as Gemma 3, Gemini 2.0 Flash-Lite, Mistral 3.1, as well as GPT-4o and DeepSeek-V3, while using fewer computing resources.
Meta releases next-generation multimodal AI 'Llama 4', boasts high performance comparable to competing models by adopting MoE architecture - GIGAZINE

On the other hand, there are suspicions that these models were trained on a different test set than the one released for Llama 4 in order to get better scores on the AI evaluation platform LM Arena.
“Serious issues in Llama 4 training. I Have Submitted My Resignation to GenAI“
by u/rrryougi in LocalLLaMA
AI researcher and author Andrew Burkov criticized Llama 4 Scout, which claims to support a very long context window of 10 million tokens, saying that sending more than 256,000 tokens to it produces very poor quality output.
I will save you reading time about Llama 4.
— Andriy Burkov (@burkov) April 5, 2025
The declared 10M context is virtual because no model was trained on prompts longer than 256k tokens. This means that if you send more than 256k tokens to it, you will get low-quality output most of the time.
And even if your problem…
In addition, a message board site Reddit reported that when Llama 4 was used to perform a coding task of 'simulating a ball bouncing inside a rotating heptagon,' the performance was lower than that of DeepSeek-V3.
'It's a big problem that Meta doesn't publish the models it uses to create its marketing pitches,' said Nathan Lambert, a former Meta researcher and senior research scientist at the Allen Institute for Artificial Intelligence.
Ahmad Al Darreh, vice president of generative AI at Meta, said: 'We've heard claims that Llama 4 was trained on a test set, but this is completely unfounded. The variations in quality that some users have reported are a way of stabilizing our implementation.'
We're glad to start getting Llama 4 in all your hands. We're already hearing lots of great results people are getting with these models.
— Ahmad Al-Dahle (@Ahmad_Al_Dahle) April 7, 2025
That said, we're also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were…
Dahle also claimed that 'some users are confusing Llama 4 Maverick with Llama 4 Scout across the various cloud providers hosting the model,' and added, 'We have taken down the model. We will spend a few days fine-tuning the model and will re-release it when it's ready. We will continue to work on Llama 4 by fixing bugs and onboarding partners.'
Related Posts:
in Software, Posted by log1r_ut