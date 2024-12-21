As I predicted, AGI has been achieved in 2024 - OpenAI is calling it GPT o3:

The "AIME score" mentioned refers to a score on the American Mathematics Competitions (AMC), specifically the AMC 10 or AMC 12, which are national mathematics exams in the United States designed to identify talented high school students. High scorers on the AMC exams often qualify for the AIME (American Invitational Mathematics Examination), a more advanced competition. 03 achieved a 96.7% accuracy.

The GPQ Diamond is a high-level evaluation benchmark for AI, focused specifically on PhD-level science problems. These questions are drawn from advanced academic materials and represent some of the most challenging problems in the domain of scientific reasoning and knowledge. The benchmark is designed to measure how well an AI model can understand, reason about, and solve complex scientific problems—essentially testing it against the capabilities of human experts in their specialized fields. Model o3 achieved an 87.7% accuracy on GPQ Diamond. For context, even an expert PhD in their own field of specialization typically scores around 70% on these same questions.

This means o3 significantly outperforms human experts in raw accuracy.

"We're reaching saturation for a lot of them or nearing saturation" indicates that OpenAI is now hitting the ceiling on many existing evaluation metrics. When models consistently score near-perfect results, these benchmarks no longer provide meaningful insights into the AI's capabilities or areas for improvement. New tests must be devised. Harder benchmarks would better reflect the upper limits of AI reasoning, knowledge synthesis, and adaptability across novel and unpredictable problems.

One such test is the new Epic AI's Frontier Math Benchmark. It’s specifically designed to push the limits of AI mathematical reasoning by presenting problems that are:

Novel: These problems are not derived from widely available datasets or repeated examples.

Unpublished: They aren’t publicly documented or available for AI models to have seen during training.

Extremely Hard: The difficulty level is so high that even expert mathematicians struggle with it.

Terence Tao, often referred to as the “Einstein of Math” (and who appeared in an OpenAI video a few weeks ago) is one of the world’s most renowned mathematicians Even here needs hours or even days to solve individual problems from this benchmark. Model o3, under aggressive test-time settings, achieved over 25% accuracy. While 25% may not sound impressive compared to benchmarks where AI scores exceed 90%, it represents an order-of-magnitude improvement over previous results.

Another important test is the ARC Benchmark and the broader push toward benchmarks that test general reasoning, adaptability, and problem-solving skills rather than narrowly focused mathematical ability. ARC AGI, was developed in 2019 by François Chollet in his paper On the Measure of Intelligence. However, it has been unbeaten, until now:

“As a capabilities demonstration, when we ask o3 to think longer and ramp up to high compute, o3 was able to score 87.5% on the same hidden holdout set. This is especially important because human performance is comparable at the 85% threshold. Being above this is a major milestone, and we have never tested a system or model that has done this beforehand. This is new territory in the ARC AGI world…When I look at these scores, I realize I need to switch my worldview…”

To summarize:

We now have an AI capable of working alongside someone like Terence Tao, excelling in tasks that challenge even top experts. It consistently scores higher on benchmarks than most human specialists and demonstrates general reasoning abilities comparable to the average human.

This aligns with the definition of Artificial General Intelligence (AGI) I was referring to when predicting its arrival in 2024. In other words, the Age of AI has truly begun.

Looking ahead to 2025, I predict AI will gain powerful tools for self-improvement and expanded control over digital environments, including the ability to autonomously manage complex tasks on your computer. At that point, anything becomes possible—both profoundly beneficial and potentially dangerous.

Good luck in 2025, everyone!

