Home » News » AI’s Puzzle Problem: New Benchmark Shows Humans Still Ahead in Reasoning

News

AI’s Puzzle Problem: New Benchmark Shows Humans Still Ahead in Reasoning

March 26, 2025

by Steve Rosenblum, Founder & CEO

TL;DR (Key Takeaways)

ARC-AGI-2 is a new intelligence test designed to evaluate how well AI models can reason like humans.
The test consists of visual pattern puzzles that require flexible, abstract thinking.
Humans solved ~60% of the tasks with minimal effort. Most AI models scored 1% or less.
Even the most advanced AI models (e.g. successors to GPT-4) failed the majority of the tasks.
AIs struggle with:
- Understanding symbolic meaning and context
- Combining multiple reasoning rules at once
- Adapting to new or changing situations
This shows today’s AI is still far from human-level general intelligence (AGI).
It highlights the limitations of current AI systems and the need for new architectures to reach true reasoning capabilities.
Real-world implication: AI may fail when asked to solve novel or unfamiliar problems, while humans can adapt quickly.

A new intelligence test for AI – called ARC-AGI-2 – reveals a striking gap between human and machine reasoning. Developed by the ARC Prize Foundation (co-founded by AI researcher François Chollet), this benchmark presents visual puzzles that any ordinary person can solve, but that stump even cutting-edge AI models. The results are a reality check on AI’s capabilities: humans significantly outperform the latest AI systems on these tasks, highlighting how much room there is for improvement in human-like intelligence.

What Is ARC-AGI-2?

ARC-AGI-2 is essentially an “IQ test” for AI. It consists of puzzle-like tasks using grids of colored squares, where the AI (or human) must infer a hidden pattern or rule and produce the correct output. Each task provides a few examples of input-output pairs (like mini before-and-after puzzles) and then asks the solver to generate the missing answer for a new input. The catch: the puzzles are novel and not things the AI would have seen during training – they’re designed to test adaptive reasoning rather than memorized knowledge. Humans find these puzzles intuitive enough (relying on our general pattern-recognition and reasoning skills), but for AI models, they pose a serious challenge.

Example of an ARC-AGI-2 puzzle. The AI is shown a few examples of colored-grid transformations (left) and must figure out the rule to produce the correct output for a new input (right, with the question mark). Humans can reason through puzzles like this, but current AIs struggle to generalize the pattern.

Human vs. AI: Puzzle-Solving Performance

In evaluations, people vastly outperformed AI models on ARC-AGI-2. In a controlled test with over 400 participants, the average person solved about 60% of the puzzles correctly. Every single task in the benchmark was solvable by humans – in fact, each puzzle was cracked by at least two people within a couple of tries. This confirms that the challenges are not “impossible” – they’re aligned with human reasoning abilities.

By contrast, AI models barely made a dent in the test. Most advanced models only got around 1% of the questions right. Even highly sophisticated AI systems – including those by top AI labs – failed almost all the tasks. For example, OpenAI’s “O3” reasoning model (an advanced prototype that uses step-by-step reasoning and search) scored roughly 4% on ARC-AGI-2. This same model had achieved about 75% on the earlier ARC test (ARC-AGI-1) by using massive computing power, but the new ARC-AGI-2 broke its strategy, dropping it to single-digit performance. In fact, many well-known AI systems – including powerful large language models like GPT-4’s successors and Google’s Gemini – basically failed entirely, with pure text-based AIs scoring 0–1% on these puzzles.

To put it simply, no current AI comes close to human-level performance on this benchmark. An average adult can solve far more of these problems than even the best AI even when the AI is allowed multiple attempts. One striking detail: the top-performing AI had to use an estimated $200 worth of computational power per task and still got only a few percent correct. The human brain, by comparison, solves 60% of them on coffee and snacks. This underscores how inefficient and brittle AI’s reasoning can be compared to human cognition.

Why Do AIs Struggle With These Tasks?

If these puzzles are “easy” for humans (at least for some people), why do AI models flounder? It turns out the tasks require a kind of flexible thinking and abstraction that machines aren’t yet good at. Researchers observed several specific reasoning challenges in ARC-AGI-2 where AI falls short:

Understanding Symbols in Context: AI systems often fail to grasp that a shape or color can represent something beyond just a pattern. For instance, a puzzle might require recognizing that a configuration of blocks means “tree” and should be treated differently from just matching colors. Current AI models tend to see only raw patterns (symmetries, rotations, etc.) and miss the deeper meaning that humans instantly assign.
Combining Multiple Rules: Humans are used to juggling several rules or conditions at once (“if it’s red and large, move it left, unless there’s a blue square present, then do X”). ARC-AGI-2 puzzles often have more than one rule interacting, which is a nightmare for current AI reasoning. AI models do okay when there’s only a single simple rule, but when a puzzle requires applying two or three rules together, they get confused or apply only one of them correctly.
Adapting to Context Changes: Many puzzles require using a rule in one situation and a different rule in another, depending on context. For example, a puzzle might say “in the small grid, do X, but in the big grid, do Y.” Humans notice the context switch and adjust our approach. AI systems, however, tend to fixate on one pattern they’ve detected and apply it blindly everywhere.

In sum, ARC-AGI-2 deliberately probes these aspects of reasoning. The puzzles demand the solver to truly understand and adapt – to see the “why” behind a pattern, to manage several moving parts, and to know when to flex a rule. These are things we humans learn in childhood and use effortlessly in novel situations. Current AI, on the other hand, mostly learns from tons of data and statistical patterns; it struggles with on-the-fly reasoning that hasn’t been pre-programmed or seen during training.

What Do These Results Tell Us About AI vs Human Intelligence?

The stark outcome of ARC-AGI-2 sends a clear message: AI today is still far from human-like general intelligence. Yes, AI has made incredible strides – it can translate languages, write code, recognize images, and even beat world champions at games like Go. In many specific, narrow domains, AI systems are superhuman (for example, they calculate faster, memorize more, and never get tired). However, those successes are specialized skills. They don’t add up to the kind of versatile, adaptable intelligence humans have.

The ARC-AGI-2 “human-AI gap” highlights what’s missing: the ability to learn new problems quickly and efficiently. In other words, an AI can be a genius at one thing and clueless outside its comfort zone – whereas humans can usually pick up new tasks or switch contexts with little instruction.

Crucially, these findings address the buzzword “AGI” (Artificial General Intelligence) – the idea of an AI that can understand or learn any intellectual task a human can. The ARC-AGI-2 benchmark was designed as a reality check for AGI claims. The fact that there are plenty of problems in ARC-AGI-2 that are trivial for humans but baffle the best AIs is strong evidence that we do not have human-like AI yet. As the ARC Prize team puts it, as long as we can easily find tasks that any person on the street can solve but even the smartest AI cannot, true general intelligence hasn’t been achieved.

Beyond the Lab: Implications and Real-World Challenges

Why does this matter beyond a specific set of pixel puzzles? It matters because it touches on trust and capability of AI in the real world. In life and business, we often face problems that are new, that don’t look exactly like anything we’ve seen before. Humans deal with such novel situations all the time by adapting our prior knowledge. The ARC-AGI-2 results suggest that if an AI is confronted with a genuinely unfamiliar problem or a scenario that isn’t in its training data, it may struggle or fail where a person would succeed.

In safety-critical applications like self-driving cars or medical diagnosis, the inability to handle corner cases – the unusual, unexpected situations – is a serious concern. ARC-AGI-2 is essentially a collection of “corner cases” for AI reasoning, and today’s models are flunking it.

The benchmark also carries a lesson in how we measure progress in AI. It’s not just about getting a high score on some test; it’s about how you got that score. Brute-forcing a solution with enormous compute or being lucky on known examples isn’t the same as truly understanding the problem efficiently. Intelligence, as the ARC team emphasizes, includes an element of efficiency – doing a lot with a little, as our brains do.

The fact that a human can, with relatively minimal effort, solve many of these puzzles while an AI needs billions of operations and still fails tells us there’s a qualitative difference in how we reason versus how AIs currently “think.” It also highlights a limitation: simply scaling up AI models (more data, more parameters, more compute) might not automatically bridge this gap. New strategies and architectures may be needed for AI to approach human-like cognitive flexibility.

On a hopeful note, benchmarks like ARC-AGI-2 guide researchers toward those missing pieces. By pinpointing where AI falls short (for instance, understanding context or combining rules), scientists and engineers can focus on developing new techniques to overcome these barriers. It’s a reminder that despite the hype, AI isn’t an all-powerful, human-replacing brain just yet – but also an invitation to innovate.

Bottom Line

ARC-AGI-2 provides a refreshingly human-centric report card for AI. On these abstract reasoning puzzles that most people handle with common sense and a bit of creativity, today’s AI systems are still failing almost outright. This contrast in cognitive performance – humans at 60% vs. AIs at ~1% – underscores that we’ve not yet replicated the general problem-solving ability of the human mind in machines.

It reminds us that human intelligence is more than just data-crunching: it’s adaptable, context-aware, and efficient in ways that machines have yet to achieve. For the general public, the takeaway is both reassuring and motivating. Reassuring, because it means AI isn’t close to matching the full breadth of human intellect – your ability to reason through new problems is still uniquely yours. Motivating, because it shows where the frontier of AI research lies.

As AI continues to advance, tests like ARC-AGI-2 will keep us honest about what these systems can really do, and will push the development of AI that can not only process information, but truly reason about it like a human would.

Sources: The ARC-AGI-2 benchmark and results are detailed by the ARC Prize team, with analysis of where AI struggles (e.g. interpreting symbols, multi-rule reasoning, contextual understanding). Multiple news outlets have reported on how humans beat AI on these tasks (humans ~60% vs. top models ~1-4%), emphasizing that this gap signals we are still far from human-level general AI.

Author: Steve Rosenblum

AI’s Puzzle Problem: New Benchmark Shows Humans Still Ahead in Reasoning

TL;DR (Key Takeaways)

What Is ARC-AGI-2?

Human vs. AI: Puzzle-Solving Performance

Why Do AIs Struggle With These Tasks?

What Do These Results Tell Us About AI vs Human Intelligence?

Beyond the Lab: Implications and Real-World Challenges

Bottom Line

How to Audit Your Existing Training Material in Under 1 Hour

How to Standardize Onboarding Across Remote and In Office Teams

How to Make Employee Training More Engaging (Without LMS Fatigue)

Ready to Transform Your Communication?

Company

Contact