Feb 05, 2025 10:40:00

In the 'final test of humanity,' where the highest answer accuracy was about 9%, OpenAI's Deep Research achieved over 26%

This article, originally posted in Japanese on 10:40 Feb 05, 2025, may contains some machine-translated parts.
If you would like to suggest a corrected translation, please click here.

It has been revealed that OpenAI's AI agent '

Deep research ' has already achieved a high score of 26.6% on the 'Humanity's Last Exam ,' an evaluation test that quantifies AI performance and is said to be the 'most difficult to date.' This is a 183% increase in the highest score in less than 10 days since the test was released.

OpenAI's Deep Research smashes records for the world's hardest AI exam, with ChatGPT o3-mini and DeepSeek left in its wake | TechRadar
https://www.techradar.com/computing/artificial-intelligence/openais-deep-research-smashes-records-for-the-worlds-hardest-ai-exam-with-chatgpt-o3-mini-and-deepseek-left-in-its-wake

'The Last Test of Humanity' is a benchmark packed with specialized problems from a wide range of fields, including mathematics, humanities, and natural sciences, and is a selection of questions posed by university professors and famous mathematicians. In the pre-release test, it was reported that even OpenAI's latest model at the time, ' o1 ,' which had high inference capabilities, was only able to score 8.3%.

The most difficult AI test ever, 'The Last Test for Humanity,' has been released, consisting of 3,000 multiple-choice and short-answer questions - GIGAZINE

However, since the last human test was made public, many models have begun to record scores higher than the above. For example, DeepSeek's inference model ' R1 ', which has performance comparable to 'o1', achieved an accuracy of 9.4%, while OpenAI's inference model ' o3-mini ', released about two weeks after the release of 'R1', achieved an accuracy of 10.5%, and 'o3-mini-high', which has a higher inference level, achieved an accuracy of 13%.

Following this, a new AI agent from OpenAI called 'Deep research' came in second with a score of 26.6%, beating out the competition.

Deep research is an AI agent that searches for information on the internet and makes inferences, which sets it apart from existing chat services that only process information that has been pre-programmed.

OpenAI announces that ChatGPT will have a 'Deep research' feature that allows it to collect online information - GIGAZINE

The score of 26.6% was announced by OpenAI, but technology media TechRadar points out that it is a bit unfair because it is comparing AI with and without search functionality. However, it is surprising that a higher score than existing models was recorded in an evaluation test shortly after its release, and it shows how quickly AI is advancing.

On the other hand, if new evaluation tests are solved too quickly, there is a risk that the original purpose of quantitatively evaluating AI performance will not be achieved. Creating evaluation tests is costly, and if a test that was released at such a high cost and billed as 'the last one for humanity' is easily passed, the gap between AI performance and the difficulty of the evaluation test may widen further.

Regarding this point, the news site TIME points out that 'the speed at which evaluation tests are being created has not kept up with AI advances,' and adds, 'Effective evaluation tests remain difficult to design, costly and underfunded compared to their importance in identifying dangerous capabilities early. With leading labs releasing high-performance models every few months, the need for new tests to evaluate the models' capabilities has never been greater.'

AI models are getting smarter and smarter, so testing methods can't keep up - GIGAZINE

Related Posts:

Feb 05, 2025 10:40:00 in Software, Posted by log1p_kr

Archives

Categories: Note; Headline; Review; Coverage; Interview; Tasting; Mobile; Software; Web Service; Web Application; Hardware; Vehicle; Science; Creature; Video; Movie; Manga; Anime; Game; Design; Art; Food; Security; Notice; Pick Up; Column; Free Member

Search

<	7, 2025					>
Sun	Mon	Tue	Wed	Thu	Fri	Sat
29	30	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31	1	2

<	7, 2025					>
Sun	Mon	Tue	Wed	Thu	Fri	Sat
29	30	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31	1	2

<	7, 2025					>
Sun	Mon	Tue	Wed	Thu	Fri	Sat
29	30	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31	1	2

<	7, 2025					>
Sun	Mon	Tue	Wed	Thu	Fri	Sat
29	30	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31	1	2