Alibaba releases R1-Omni, an AI that can predict and read human emotions, as open source and available for download



Alibaba Group Holding has unveiled R1-Omni , an AI model with the ability to read human emotions, said to be able to analyze emotions through movement and sound.

GitHub - HumanMLLM/R1-Omni

https://github.com/HumanMLLM/R1-Omni

Alibaba Releases AI Model That Reads Emotions to Take On OpenAI - Bloomberg
https://www.bloomberg.com/news/articles/2025-03-12/alibaba-releases-emotional-intelligence-model-to-rival-chatgpt

Alibaba's R1-Omni AI Model Expands the Frontier of Emotion Recognition - WinBuzzer
https://winbuzzer.com/2025/03/12/alibabas-r1-omni-ai-model-expands-the-frontier-of-emotion-recognition-xcxwbn/

The model on which R1-Omni is based is the open source model ' HumanOmni-0.5B ' created by Alibaba's Chief Researcher Jiaxing Zhao. Alibaba has open-sourced HumanOmni-0.5B, the pre-trained model 'EMER-SFT', the dataset-trained 'MAFW-DFEW-SFT', and the final model R1-Omni. We measured the performance of each model.

Below is the demonstration movie released by Alibaba. Click to play.



The output of each model is as follows. The answer the developers assumed was 'anger.'

◆HumanOmni-0.5B:Fear.
◆EMER-SFT: In the video, a man in a brown jacket stands in front of a vivid mural. He frowns and has his mouth open to express dissatisfaction. His facial expressions and vocal characteristics suggest that he is experiencing intense emotional turmoil. The audio content includes the words 'fear' and 'lowering his voice,' suggesting that he is facing an emergency or expressing strong emotions during an emotionally tense moment. Overall, the analysis indicates that the man is experiencing a deep sense of fear and anxiety. Answer: 'fear.'
◆MAFW-DFEW-SFT: Anger.
◆R1-Omni: In the video, a man in a brown jacket stands in front of a vibrant mural. Under the brown jacket he wears a pink shirt and has dark, curly hair. His facial expression is complex, with large eyes, a slightly open mouth, raised eyebrows, and furrowed brows, revealing surprise and anger. Voice recognition technology reveals that his voice includes words such as 'you,' 'keep your voice down,' and 'nervous,' indicating strong emotion and agitation. Overall, he exhibits an emotional state of confusion, anger, and arousal. Answer: 'Anger.'

The second movie is below.



The answers are as follows: The answer the developers expected was 'Happy.'

◆HumanOmni-0.5B: Happy
◆EMER-SFT: In the video, a woman is seen indoors in what appears to be a personal space or study room. She is wearing a red headband, a beige blazer over a white top, and her hair is tied back neatly. Her facial expression shows slight confusion and surprise, as if she has encountered something unexpected. As time passes, her emotions seem to calm down, but the specific reason for this is not explained. In the audio, there is a long silence between 'but' and 'it was,' indicating hesitation and anxiety. The subtitle reads, 'But it was interesting.' This sentence is likely expressing the person's confusion and surprise. From the video clues, the woman's slightly confused and surprised facial expression, as well as her overall appearance and facial expression, we can infer that this sentence expresses a hesitation or uncertain tone. Furthermore, the audio clue indicating a long silence between 'but' and 'it was' also indicates hesitation and uncertainty. Therefore, this sentence shows the woman's interest and curiosity about something, but also a certain hesitation and uncertainty. Answer: 'Surprise'
◆MAFW-DFEW-SFT: Happy
◆R1-Omni: In the video, the opening scene shows a woman smiling with her eyes slightly closed and her mouth slowly open. Her expression seems somewhat amused, which may indicate that she is experiencing some fun or enjoyable situation at that moment. In the audio, the sentences flow seamlessly, with no gaps between each sentence, and the tone is light and airy. Combined with the text content, it feels like the character is in a very happy and positive emotional state. The subtitle reads, 'It was fun.' This sentence likely expresses the woman's satisfaction or curiosity about something or someone. From the video clues of the woman's laughter and joy, and the audio clues of a continuous flowing, light and airy tone, it can be inferred that this sentence expresses the woman's interest or curiosity about something. The joy in this sentence is consistent with an overall sense of happiness and positivity. Answer: 'Happy.'

As you can see, Alibaba has its models write out their thought processes, a technique called 'reinforcement learning with verifiable rewards (RLVR),' which is said to allow for finer adjustments than simply checking the answers and correcting errors.

By the way, it is unclear whether the 'R1' in R1-Omni is a reference to the recently released 'DeepSeek-R1'. In addition, R1-Omni is said to have made training more efficient by using the subject-learning method 'Group Relative Policy Optimization (GRPO)' that was also used in DeepSeek-R1.

in Software, Posted by log1p_kr