Alibaba's development team announces 'Qwen3-ASR-Flash,' a highly accurate automatic transcription AI model that supports 11 languages, including Japanese

The development team behind Alibaba's large-scale language model '
🎙️ Meet Qwen3-ASR — the all-in-one speech recognition model!
— Qwen (@Alibaba_Qwen) September 8, 2025
✅ High-accuracy EN/CN + 9 more languages: ar, de, en, es, fr, it, ja, ko, pt, ru, zh
✅ Auto language detection
✅ Songs? Raps? Voice with BGM? No problem. <8% WER
✅ Works in noise, low quality, far-field
✅ Custom… pic.twitter.com/eE1ucgYpVX
Qwen3 ASR: Hear clearly, transcribe smartly.
https://qwen.ai/blog?id=41e4c0f6175f9b004a03a07e42343eaaf48329e7&from=research.latest-advancements-list
Qwen-ASR-Flash is a high-performance speech recognition service based on Qwen3-Omni, built using a large amount of multimodal data, including tens of millions of hours of automatic speech recognition data.
Qwen-ASR-Flash is said to perform well in speech recognition, even with complex background noise and loud singing voices, and supports 11 languages and multiple accents. It also provides customized speech recognition results based on user-entered prompts.
Qwen-ASR-Flash supports 11 languages: Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, and Arabic. It also supports various regional accents, including British English, American English, and Sichuan, Minnan, and Cantonese, in addition to Mandarin.
The graph below shows the error rates of automatic speech recognition for 'Qwen-ASR-Flash,' 'Gemini 2.5 Pro,' 'GPT-4 Transcribe,' 'Paraformer-v2,' and 'Doubao-ASR.' Qwen-ASR-Flash's performance is represented by purple bars, and it shows that it achieves low error rates across a wide range of audio, including Chinese, Chinese Accent (Chinese dialects), English, Multilingual (multiple languages including Japanese), Entities (Chinese and English benchmarks), Lyrics (Chinese songs), Fullsong (full Chinese and English songs), AccentHard (audio with strong accents or noise), and LongMix (audio with a mixture of multiple languages).

A demo version of Qwen-ASR-Flash was also released on Hugging Face, so I tried using it to transcribe text.
Qwen3 ASR Demo - a Hugging Face Space by Qwen
Drag and drop the Japanese audio file into the 'Upload Audio' field on the left side of the screen.

Select 'Auto Detect' in the language selection field and then click 'Start Recognition.'

Although the speech contained music and noise in the background, the words were recognized with high accuracy. However, there were some inaccuracies, such as 'Of course it's red' becoming 'Roman red.'
The API for Qwen-ASR-Flash is available on Alibaba Cloud Model Studio.
Alibaba Cloud Model Studio Console
https://bailian.console.alibabacloud.com/?tab=doc#/doc/?type=model&url=2979031
Related Posts:
in Software, Web Service, Posted by log1h_ik