Alibaba releases 'Qwen3-Omni,' an AI model capable of real-time voice conversation, and 'Qwen3-VL,' an image recognition AI model with performance equivalent to GPT-5, as well as a large number of other language models and image editing models.



Qwen, the AI research team at Alibaba, a major China-based technology company, announced Qwen3-Omni , an AI model capable of responding in real time in natural language, on September 22, 2025. Furthermore, in the short period between September 22 and 24, a number of AI models were announced, including Qwen3-VL , Qwen3-TTS , Qwen-Image-Edit-2509 , Qwen3-VL , Qwen3-LiveTranslate-Flash , and Qwen3-Max .

Qwen

https://qwen.ai/home

◆Qwen3-Omni
Qwen3-Omni is an AI model that can process text, images, audio, and video and respond in real time. In addition to supporting text and audio responses, it also boasts high multilingual capabilities, capable of understanding text in 119 languages, understanding audio in 19 languages, and generating audio in 10 languages.

Qwen3-Omni: Natively Omni-Modal Foundation Models!
https://qwen.ai/blog?id=fdfbaf2907a36b7659a470c77fb135e381302028&from=research.research-list



Users can talk to Qwen3-Omni about what's in the camera's camera. You can see an example of Qwen3-Omni in action in the video below.

Qwen3-Omni: Natively Omni-Modal Foundation Models! - YouTube


The Qwen team has published benchmark results for Qwen3-Omni-Flash and Qwen3-Omni-30B-A3B, with Qwen3-Omni-Flash achieving scores equal to or better than GPT-4o and Gemini-2.5-Flash.



And Qwen3-Omni-30B-A3B outscores GPT-4o and Qwen3-Omni-30B-A3B in most tests.



Each model of Qwen3-Omni is available at the following links:

Qwen3-Omni - a Qwen Collection
https://huggingface.co/collections/Qwen/qwen3-omni-68d100a86cd0906843ceccbe

◆Qwen3-VL
Qwen3-VL is a visual language model with advanced image recognition capabilities, capable of understanding the content of photos, app screenshots, etc. It also supports OCR for 32 languages.

Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action
https://qwen.ai/blog?id=99f0335c4ad9ff6153e517418d48535ab6d8afef&from=research.latest-advancements-list



The benchmark results for Qwen3-VL-235B-A22B-Instruct are as follows: Although it is an open model, it outperforms Gemini-2.5-Pro and GPT-5 in many tests.



The inference model Qwen3-VL-235B-A22B-Thinking also outperformed Gemini-2.5-Pro and GPT-5.



As an example of how it works, the Qwen team presents 'accurate recognition of the names of Demon Slayer characters.'



The model data for Qwen3-VL is available at the following link:

Qwen3-VL - a Qwen Collection

https://huggingface.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe

◆Qwen3-TTS
Qwen3-TTS is a speech generation model that supports 10 languages, including Japanese. Qwen3-TTS can translate input speech into other languages while preserving emotional expression. You can see an example of how it works, including translation into Japanese, in the video below.

Qwen3-TTS: Multi-timbre & Multi-lingual & Multi-dialect Speech Synthesis. - YouTube
https://www.youtube.com/watch?v=MC6s4TLwX0A

◆Qwen-Image-Edit-2509
Qwen-Image-Edit-2509 is an updated version of the image editing AI model ' Qwen-Image-Edit ' that has improved its ability to maintain consistency for faces and products. Editing examples using Qwen-Image-Edit-2509 can be seen at the following link.

Qwen-Image-Edit-2509: Multi-Image Support, Improved Consistency
https://qwen.ai/blog?id=1675c295dc29dd31073e5b3f72876e9d684e41c6&from=research.research-list



◆Qwen3-LiveTranslate-Flash
Qwen3-LiveTranslate is a real-time speech interpretation model that supports 18 languages, including Japanese. It allows input of not only speech but also visual elements such as lip movements and gestures, improving speech recognition accuracy.

Qwen3‑LiveTranslate: Real‑Time Multimodal Interpretation — See It, Hear It, Speak It!
https://qwen.ai/blog?id=b2de6ae8555599bf3b87eec55a285cdf496b78e4&from=research.latest-advancements-list



In benchmark tests conducted by the Qwen team, Qwen3-LiveTranslate-Flash achieved higher scores than Gemini-2.5-Flash and GPT-4o-Audio-Preview.



◆Qwen3-Max
Qwen3-Max is the top model in the Qwen3 series of inference models.

Qwen3-Max: Just Scale it

https://qwen.ai/blog?id=241398b9cd6353de490b0f82806c7848c5d2777d&from=research.latest-advancements-list



In the ' Text Arena ,' which allows humans to evaluate the text generation performance of AI models without revealing their true identity, Qwen3-Max ranked third, beating GPT-5-Chat. Qwen3-Max is currently available on Qwen Chat and is expected to be publicly available soon.



in Software,   Video, Posted by log1o_hf