Alibaba's development team announces 'Qwen3-ASR-Flash,' a highly accurate automatic transcription AI model that supports 11 languages, including Japanese



The development team behind Alibaba's large-scale language model '

Qwen ' has announced a new speech recognition AI called ' Qwen3-ASR-Flash .' Qwen-ASR-Flash supports 11 languages, including Japanese, and is said to be able to transcribe even songs with sound or audio mixed with background noise with high accuracy.



Qwen3 ASR: Hear clearly, transcribe smartly.
https://qwen.ai/blog?id=41e4c0f6175f9b004a03a07e42343eaaf48329e7&from=research.latest-advancements-list

Qwen-ASR-Flash is a high-performance speech recognition service based on Qwen3-Omni, built using a large amount of multimodal data, including tens of millions of hours of automatic speech recognition data.

Qwen-ASR-Flash is said to perform well in speech recognition, even with complex background noise and loud singing voices, and supports 11 languages and multiple accents. It also provides customized speech recognition results based on user-entered prompts.



Qwen-ASR-Flash supports 11 languages: Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean, and Arabic. It also supports various regional accents, including British English, American English, and Sichuan, Minnan, and Cantonese, in addition to Mandarin.

The graph below shows the error rates of automatic speech recognition for 'Qwen-ASR-Flash,' 'Gemini 2.5 Pro,' 'GPT-4 Transcribe,' 'Paraformer-v2,' and 'Doubao-ASR.' Qwen-ASR-Flash's performance is represented by purple bars, and it shows that it achieves low error rates across a wide range of audio, including Chinese, Chinese Accent (Chinese dialects), English, Multilingual (multiple languages including Japanese), Entities (Chinese and English benchmarks), Lyrics (Chinese songs), Fullsong (full Chinese and English songs), AccentHard (audio with strong accents or noise), and LongMix (audio with a mixture of multiple languages).



A demo version of Qwen-ASR-Flash was also released on Hugging Face, so I tried using it to transcribe text.

Qwen3 ASR Demo - a Hugging Face Space by Qwen

https://huggingface.co/spaces/Qwen/Qwen3-ASR-Demo

Drag and drop the Japanese audio file into the 'Upload Audio' field on the left side of the screen.



Select 'Auto Detect' in the language selection field and then click 'Start Recognition.'



Although the speech contained music and noise in the background, the words were recognized with high accuracy. However, there were some inaccuracies, such as 'Of course it's red' becoming 'Roman red.'



The API for Qwen-ASR-Flash is available on Alibaba Cloud Model Studio.

Alibaba Cloud Model Studio Console
https://bailian.console.alibabacloud.com/?tab=doc#/doc/?type=model&url=2979031

in Software,   Web Service, Posted by log1h_ik