Introducing the cost-effective AI model 'Qwen3-Next,' which is 10 times faster than previous models and achieves the same or better performance.



An AI research team at Alibaba, a major Chinese technology company, has developed a low-cost, high-performance AI model called ' Qwen3-Next ' and released it for free. Qwen3-Next was trained at less than one-tenth the cost of previous models, and is capable of inference processing more than ten times faster when there are a large number of input tokens. Despite this, its performance is said to be equal to or better than previous models, and in some tests it even outperforms Google's Gemini-2.5-Flash-Thinking.

Qwen3-Next: Towards Ultimate Training & Inference Efficiency

https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list

The base model of Qwen3-Next, 'Qwen3-Next-80B-A3B-Base,' is a model that uses a Mixture of Experts (MoE) architecture that incorporates multiple expert models. Unlike conventional training methods, it uses Gated DeltaNet and Gated Attention in a 3:1 ratio, achieving both high performance and low training costs.



Qwen3-Next-80B-A3B-Base can be trained 10.7 times faster than

Qwen3-32B , and yet its benchmark scores are higher than Qwen3-32B.



Qwen3-Next-80B-A3B-Base is an 80 billion parameter model containing 512 expert models, but only up to 3 billion parameters are active during actual inference processing. Its processing speed is significantly improved compared to previous models: for an input of 32,000 tokens, the speed up to outputting the first token (prefill phase) is 10.6 times faster than Qwen3-32B, and the output speed thereafter (decode phase) is 10 times faster.



Below is a graph comparing the time it takes to output the first token for each number of input tokens in the three models 'Qwen3-32B (white)', 'Qwen3-30B-A3B (green)', and 'Qwen3-Next-80B-A3B-Base (purple)'. The superiority of Qwen3-Next-80B-A3B-Base does not decline even when the number of input tokens increases.



Similarly, in the decode phase, high-speed processing is possible even when there are many input tokens.



The table below summarizes the results of various benchmarks for the Qwen3-Next-80B-A3B-Base. Despite being low-cost and fast, the Qwen3-Next-80B-A3B-Base achieved scores equal to or higher than those of older models.



The development team has made some fine-tuning changes to Qwen3-Next-80B-A3B-Base and has released ' Qwen3-Next-80B-A3B-Instruct ' and ' Qwen3-Next-80B-A3B-Thinking ' for free at the links below.

Qwen3-Next - a Qwen Collection

https://huggingface.co/collections/Qwen/qwen3-next-68c25fd6838e585db8eeea9d



Below is a graph showing the performance of Qwen3-Next-80B-A3B-Instruct (red).



The inference model, Qwen3-Next-80B-A3B-Thinking (red), outperforms older models in all tests, and according to the development team, it even outperforms Google's Gemini-2.5-Flash-Thinking in several benchmark tests.



The development team is committed to improving the architecture for the upcoming Qwen 3.5 release.

in AI,   Software, Posted by log1o_hf