| | By Jeff Brown, Editor, The Bleeding Edge | | How will we know? | How will we know when artificial general intelligence (AGI) has been achieved? | It's a peculiar question. After all, an AGI will be smarter and more capable than any human, and we won't be smart enough to understand how smart "it" really is… | It's an odd problem to solve for. | As we reviewed in the past, the industry tends to use several benchmarks to test an AI on its learned skills. Benchmarks with odd names like MMLU-Pro, Natural2Code, MATH, GPQA (diamond), and MMMU are common ones. | And as we learned in yesterday's Bleeding Edge – Gemini Gains Agency, the WebVoyager benchmark is used to measure how successfully an agentic AI can complete tasks in the real world on the internet. | But what about measuring general intelligence? Demonstrating that we can learn/memorize a skill from a body of knowledge is one thing. Acquiring a new skill and learning from scratch is another. | That's what the ARC Prize was designed to do. | Measuring Intelligence Gains | In 2019, a well-known software developer designed the Abstract and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) benchmark to measure how well an AI can acquire new skills on unknown tasks. | The benchmark (test) was designed in such a way that would not advantage large corporations that could simply train larger and larger models on even larger training sets. These are problems that humans can solve naturally while remaining extremely challenging for an AI. | For example, below is an example of a puzzle that would have to be solved by an AI… | | Source: Arc Prize | Three examples are given on the left side. The test is to submit the correct output on the right side based on the test input while using the examples on the left side. | For fun, try spending a little time to solve it yourself. | When we look at example 1 on the left, we can see that every red square is surrounded by yellow squares on the corners. And dark blue squares are surrounded by orange ones on each side. Examples 2 and 3 show what to do with light blue and pink squares. | Knowing the "key" for each color – provided by the left examples – we can produce the correct output on the test. It's fun to do, and for those interested, you can try the puzzles here. | It's worth noting that the example shown above is a very easy one. The actual test is compiled of much harder ones. | Up until a few weeks ago, no AI came anywhere near achieving 100%. Not even close. | For example, OpenAI's GPT-3 – which feels like a relic despite only being a few years old – scored 0% on the ARC-AGI benchmark. Zero. | And OpenAI's GPT-4o – a very capable model, as we've learned, which is only months old – scored 5% last year. | It wasn't until OpenAI's o1 models that we started to see performance between 7–32%. Still poor but moving in the right direction. | But something huge happened over the holidays, aside from Google's release of its first agentic AI – Gemini 2.0, which we explored yesterday. | The day after Google announced Gemini 2.0, OpenAI announced its "o3" model. | Google was in the lead for one day. Just one day. | That's how tight this race is. | And yes, OpenAI's o3 model is also an agentic AI. | Oh. And it just about blew the roof off the ARC-AGI benchmark. | | | | If you’ve missed out on NVIDIA’s recent 1,600% run… Don’t worry. Because there’s one AI stock that could be a lot more lucrative. It’s currently trading for only $16. And it pays out 270X more in “AI Royalties” than NVIDIA. Get the ticker here >>> | | | o3's Stunning Outperformance | See the chart below, in the upper right quadrant… | | Source: ARC Prize | OpenAI's o3 High model scored a breakthrough 87.5% on this general intelligence test. | This is absolutely crazy. | And the chart below shows the significance of this latest development. | | Source: Arc Prize | The chart shows the path of AIs improving on various AI benchmarks as they progress to human capability levels. | We can see that many of the benchmarks in blue have already exceeded human capabilities. And we can also see that we've just seen an inflection point in the ARC-AGI benchmark, as a result of these latest development – indicated in the yellow line. | There's only one conclusion that we can draw… | We're not going to have to wait much longer before we achieve AGI. | We actually don't know where Gemini 2.0 stands on the ARC-AGI benchmark. Only Gemini 1.5 has been tested, as we can see below, at a meager 8%. | | Source: ARC Prize | The second highest score is from Jeremy Berman, a human using his own combination of Anthropic's Claude Sonnet 3.5 and Evolutionary Test-time Compute – a computational method of improving software performance by iterating over time. | OpenAI's o3 model didn't just outperform on the ARC-AGI benchmark… | o3 jumped significantly in software coding performance, as seen below, compared to its o1 model. | | Source: OpenAI | Worth noting on the right that the score of 2727 is higher than that of OpenAI's chief scientist (who achieved a score of 2665). That's a score higher than one of the world's foremost human experts. | And as for math, o3 scored 96.7% on the AIME 2024 exam, missing only one question. | | Source: OpenAI | And the performance of o3 on the GPQA Diamond benchmark was just stunning. | At a score of 87.7, it by far exceeds human expert performance. There's not a human on Earth that could demonstrate that level of mastery across so many PhD-level subjects. It's worth noting that Google's Gemini 2.0 only scored 62.1% on the GPQA Diamond benchmark. | And perhaps other than the radical improvement in the ARC-AGI benchmark, o3 surprised the industry with its performance on the EpochAI Frontier Math benchmark. | | Source: OpenAI | I know that looking at a 25.2 score might not seem impressive, but the previous state of the art was just 2.0. | A Sign of What's to Come | So what's the significance? This all kind of feels rather abstract, doesn't it? | These problems are in the realm of mathematical research. They're unpublished. And there is no way to train an AI on these kinds of problems. It can take a human expert days to solve some of the problems in the EpochAI Frontier Math test. | In other words: This score by OpenAI's o3 is quite extraordinary. It demonstrates the early stages of an AI being able to reason, conduct novel research, and solve extremely complex unsolved problems. | And it is a sign of things to come this year. | OpenAI's o3 model isn't yet available, but o3 mini will be out by the end of January. And the full model will follow in the weeks after that. | Google must be licking its wounds, having been bested by OpenAI so quickly. | And we have Anthropic's next release to look forward to. And xAI's Grok 3.0 in the next couple of months will surely surprise the industry. | Are you ready? | Is the world ready? | … for general intelligence? | Jeff | | | | It might feel funny to think about a bunch of software code having agency in a similar way that... | | | | We had a wide range of exciting topics this week for our AMA today. | | | | If we want clean, carbon emission-free energy at a scale, there's only one solution… | | | | | | | | | To ensure our emails continue reaching your inbox, please add our email address to your address book. This editorial email containing advertisements was sent to riku221199@gmail.com because you subscribed to this service. To stop receiving these emails, click here. Brownstone Research welcomes your feedback and questions. But please note: The law prohibits us from giving personalized advice. To contact Customer Service, call toll free Domestic/International: 1-888-512-0726, Mon-Fri, 9am-7pm ET, or email us here. © 2025 Brownstone Research. All rights reserved. Any reproduction, copying, or redistribution of our content, in whole or in part, is prohibited without written permission from Brownstone Research. | | | |
No comments:
Post a Comment