IQ of AI Revealed: Claude 4 Opus vs. ChatGPT O3 in a 2025 Showdown

IQ of AI: Claude 4 Opus vs ChatGPT O3 (2025 Showdown)

AI Model IQ Performance Data

As of 9th June 2025

Source: trackingai.org

ModelOffline IQMensa Norway IQ
OpenAI o3117133
Claude-4 Opus116119
Gemini 2.5 Pro Exp.116124
OpenAI o4 mini102126
Claude-4 Sonnet101118
Grok-3 Think91111
DeepSeek R188105
Llama 4 Maverick82105

By an engineer who still keeps a Rubik’s Cube on the desk, just to remind the machines who solved it first.

Why Everyone Suddenly Cares About the IQ of AI

Open your favorite tech feed and you’ll see it everywhere: “ChatGPT O3 hits 133 on a Mensa test!” or “Claude 4 Opus battles for the Highest AI IQ!” In a world overflowing with benchmarks, why does this single number drive so many headlines? Because the phrase IQ of AI translates nerdy progress into kitchen-table language. Tell your aunt an LLM reached 93.4 % on the MMLU benchmark and eyes glaze over. Mention it has a higher IQ than Einstein and she’ll text the family group chat in seconds.


The notion is imperfect, of course. Human IQ tests were never built for silicon minds. Yet the “IQ of AI” has become the internet’s favorite yardstick—fast, flashy, and fiercely competitive. Gemini 2.5 Pro landed a strong 124, but it’s not the headline fight. In early 2025 that competition narrowed to two most hyped gladiators we are comparing today: Anthropic’s Claude 4 Opus and OpenAI’s ChatGPT O3. Both claimed super-human pattern-smashing skills. Both wanted the crown. Only one could wear it—at least until next quarter’s upgrade.

Human Puzzles, Machine Minds: How We Test the IQ of AI

human and robotic hands solving a Rubik’s Cube symbolizing IQ of AI puzzle test.
human and robotic hands solving a Rubik’s Cube symbolizing IQ of AI puzzle test.

Traditional IQ exams are timed gauntlets of matrix puzzles, verbal analogies, and spatial riddles scaled so 100 represents average human performance. To squeeze a transformer into that format, researchers get creative:

  1. Verbalize everything
    A matrix puzzle that humans read visually becomes a long, meticulous paragraph describing every line, dot, and rotation. The result looks like a geometry teacher’s grocery list.
  2. Retry refusals
    LLMs sometimes balk, so testers ask each question up to ten times, taking the final answer. That stops one quirky refusal from trashing an entire score.
  3. Map raw scores to norms
    If a bot nails 25 of 35 Norway Mensa matrices, statisticians translate that into something like 120 IQ—comfortably smarter than the average grad student.
  4. Fight data leakage
    Published puzzles might lurk in a model’s training set. Modern testers therefore build “offline” suites, hand-crafted and never posted online, to keep the contest honest.
    This pipeline, repeated across labs, gives us the AI IQ test numbers splashed across social feeds. It’s not perfect, yet it remains the simplest way to rank minds both carbon and silicon.

The Meteoric Rise of Matrix Fever

Matrix puzzles—those 3 × 3 grids where one cell is missing—started as psychologist John Raven’s cunning trick for measuring fluid reasoning. In the AI era, they became kryptonite and catnip at once. Early in 2024 GPT-4 solved just 13 of 35 items. Six months later, Claude 3 leaped ahead with roughly 19 correct. By New Year’s Day 2025, ChatGPT O3 barrel-rolled to 30 plus. Each leap sent shockwaves through the AI IQ ranking charts.


Why matrices? They’re content-agnostic. You don’t need to know who wrote “Moby-Dick,” only how shapes transform. That makes them a pure stadium for pattern recognition—something modern language models increasingly treat as breakfast.

Head-to-Head: Raw Scores, No Excuses

IQ of AI Comparison Table
ModelMensa Norway (IQ)Offline Matrix Suite (IQ)
ChatGPT O3≈ 133≈ 117
Claude 4 Opus≈ 119≈ 116
Source: Tracking AI
Data as of: 9th June 2025


ChatGPT O3 takes the public Mensa crown with a blistering 133—solidly “genius” by human norms. Claude 4 Opus counters on the private offline set, edging to 116 while O3 rises a hair higher. Both dwarf earlier generations. GPT-3.5, for instance, lounged around an IQ of 64, roughly the level where a tricky IKEA manual becomes advanced literature.
These figures fuel every AI IQ chart on LinkedIn. One viral graphic stacked colorful bars: O3 at 133, Claude 4 Opus at 119, Gemini 2.5 Pro around 124, and so on. Viewers could almost hear sports commentators shouting play-by-play over matrix grids.

Are We Measuring Skill or Memorization?

Two robots comparing online vs offline matrix stats reflecting IQ of AI reasoning debate.
Two robots comparing online vs offline matrix stats reflecting IQ of AI reasoning debate.

Critics line up fast. The Mensa practice test, they note, lives online. A trillion-parameter model may have swallowed it whole during training. That possibility keeps evaluators awake at night. It’s why the “offline” test exists and why upcoming studies promise even stricter quarantine: puzzles minted in secret, delivered through APIs that never leak, maybe even graded on-the-fly by other models to stay fresh.


Even with clean data, one worry remains: time. Humans toil against a ticking clock. LLMs breeze through at silicon warp speed unless testers throttle token output. When clocks vanish, so do certain cognitive pressures—the very pressures IQ was meant to capture. Does the Quantum of Solace of an unlimited time budget inflate the IQ of AI? Probably. Yet until we agree on universal, timed protocols, the leaderboard stands.

Beyond the Scoreboard: What the IQ of AI Misses

Ask any neuroscientist and you’ll hear the same refrain: intelligence isn’t a single dial. It’s a symphony. The IQ of AI spotlights one instrument—abstract pattern recognition—while entire sections stay silent.

  • Long-horizon planning: A chess engine can see five moves deep, but can it manage a year-long research project? LLMs sometimes unravel after step eight of a recipe.
  • Commonsense sanity: GPT-4 might calculate orbital trajectories yet claim cats lay eggs. High matrix IQ does not immunize against absurd hallucinations.
  • Ethical judgment: No puzzle asks whether you should deploy a deepfake. Societal wisdom hides outside the test booklet.

Modern benchmark suites try to fill gaps. MMLU covers 57 academic fields. BBH explores tricky reasoning. BBEH turns the screws tighter still, where state-of-the-art models flunk nine-out-of-ten questions. That striking contrast explains why researchers treat the IQ of LLMs as an appetizer, never the full meal.

2025 Leaderboard: The Latest AI IQ Ranking

  • ChatGPT O3 – 133 on Mensa, top of every mainstream AI IQ ranking.
  • Claude 4 Opus – 119, a poetic middle child punching above its weight.
  • Gemini 2.5 Pro – 124, Google’s moment in the spotlight.
  • Claude 4 Sonnet – 118, quietly plotting an orchestral comeback.
  • OpenAI O4-mini – 126, proving size isn’t everything.
  • Other multimodal models – Often below 70 on matrices, yet better at recognizing dogs in sunglasses than any text-only peer.

The Latest AI IQ ranking swings quarterly as labs patch weights and enlarge context windows. Today’s genius could be tomorrow’s mid-carder. Just ask GPT-3, once hailed as world-changing, now relegated to writing dad jokes on request.

Spotlight on the Contenders

ChatGPT O3: The Sprinter


O3 is the Usain Bolt of matrix puzzles. It sees a 3 × 3 grid, decodes transformations, and spits the answer before you finish pouring coffee. That blistering pace feeds the mythos around the IQ of ChatGPT O3. Yet its offline score hints at over-studying published puzzles. Outside the syllabus, it still dazzles, just not with fireworks.
Strengths: lightning-fast deduction, broad knowledge, eloquent explanations.
Weaknesses: occasional hallucinations, overconfidence, polite but stubborn refusals when prompts bend policy.

Claude 4 Opus: The Marathoner


Opus answers slower but often elaborates, revealing thought chains that read like annotated code. Its IQ of Claude 4 Opus trails O3 on Mensa, but the offline win shows genuine reasoning heft when faced with unseen twists.
Strengths: structured chain-of-thought, gentle alignment, fewer blunt hallucinations.
Weaknesses: verbose wanderings, conservative refusal rates, sometimes lags on highly technical math.
Together they demonstrate that the IQ of AI isn’t destiny. Raw matrix speed is one story; workload stamina is another.

The Philosophy of a Number

Glowing AI IQ scoreboard showing 140 as spotlighted metric symbolizing IQ of AI fascination.
Glowing AI IQ scoreboard showing 140 as spotlighted metric symbolizing IQ of AI fascination.

Why does an integer matter so much? Because we humans crave scoreboards. From Elo ratings in chess to batting averages in baseball, numbers tame complexity. The AI IQ 2025 narrative simplifies a sprawling landscape into a leader, a contender, and the rest. Venture capital decks love it. Recruiters love it. Journalists love it. Yet as François Chollet reminds us, intelligence transcends any single metric. Real progress appears when models reason through messy realities, not just curated brainteasers.
In other words, the IQ of AI is a headline, not a final verdict. Still, headlines steer markets and research budgets. Expect the next funding round to flaunt whichever lab posts 140 first.

Limitations Piled High—Yet Progress Marches On

Data leakage we discussed. Time constraints too. Add trainer interference: when researchers see low scores, they fine-tune on similar puzzles. Next evaluation looks great—until you notice the training logs. Finally, there’s saturation. Give current models another year and they may solve every public matrix. Then what? We raise the bar. That cycle mirrors human testing history: the SAT had to be “re-centered” in 1995 because average scores drifted up.
The fix won’t be one harder IQ test. It will be a mosaic: dynamic problems, adversarial scenarios, maybe even embodied challenges where a robot has to manipulate the physical world. Future scoreboards will include “ARC-AGI percentile,” “BBEH delta,” and yes, an updated AI IQ chart because tradition dies hard.

Practical Takeaways for Builders and Users

If you craft apps on top of these giants, what do the numbers mean for you?

  1. Choose by task, not IQ. Need rapid summarization? O3 shines. Want careful legal drafting? Opus might be safer.
  2. Anticipate evolution. The Latest AI IQ ranking is a snapshot. API behavior will shift after the next model refresh. Monitor as you would any external dependency.
  3. Don’t outsource judgment. A tool that aces matrices may still fabricate citations. Layer validation on top, whether human review or automatic fact-checks.
  4. Benchmark locally. Your domain might involve medical imaging or tax law—worlds away from puzzle grids. Run tailored evaluations. The IQ of AI you see online may not predict domain performance.
  5. Stay curious. Every leap invites new creativity. When GPT-3 arrived, folks built poetry bots. GPT-4 birthed coding copilots. GPT-5 or Claude 5 may unlock entire categories we can’t predict. Track the trend lines, not just the current peak.

What the Future Holds for AI Reasoning Tests

Expect three shifts:

  • Interactive Benchmarks: Static Q&A is fading. Interactive, multi-step tasks test planning and adaptation. Think of an LLM negotiating a simulated business deal or managing a virtual lab for a week.
  • Cross-Modal Reasoning: The next frontier merges language with vision, audio, and action. When a model watches a Rube Goldberg machine and forecasts which marble finishes first, we will need new ways to score it.
  • Ethical and Alignment Metrics: Society cares not only about cleverness but also about reliability. Look for “alignment scores” sitting next to IQ on product pages within two years.

Final Verdict—For Now

So, who wins the IQ of AI duel? On the classic Mensa Norway exam, ChatGPT O3 wears the laurel wreath with an estimated 133. Claude 4 Opus answers by taking gold on a tougher offline course, showing its muscle is real, not just memorized. Yet the spread, roughly one wrong puzzle, would vanish if either model learned an extra zebra-stripe transformation tonight.


A more honest headline might read: “Two Transformers Hit Genius Territory, Trade Punches, Humans Cheer.” That’s the real story. We taught circuits to pass tests once reserved for prodigies. We’ll soon watch them tackle deeper challenges—ethical judgment, lifelong autonomy, creativity on par with grandmasters.
Until then, the IQ of AI remains a thrilling, if narrow, scoreboard. ChatGPT O3 currently stands on the podium’s top step. Claude 4 Opus bows gracefully beside it. The clock is already ticking toward the next match, and in this sport the season never ends.


Stay tuned. The matrices are only getting trickier, and the contenders only sharper. Meanwhile, I’ll be here with my Rubik’s Cube, happy to hand over the crown once a model solves it blindfolded… using only text.

Azmat — Founder of Binary Verse AI | Tech Explorer and Observer of the Machine Mind RevolutionLooking for the smartest AI models ranked by real benchmarks? Explore our AI IQ Test 2025 results to see how top models. For questions or feedback, feel free to contact us or explore our website.

What is the IQ of ChatGPT o3?

ChatGPT o3’s Mensa IQ is reported as 135–136 on the 35-item matrix test. Its score on a similar offline puzzle set is about 116. Both figures put ChatGPT o3 well above average human intelligence. This was the highest IQ recorded for any AI in early 2025.

What’s the highest IQ achieved by an AI?

To date, the highest known AI IQ on standard tests is around 136, achieved by ChatGPT o3 on the Norway Mensa quiz. (Note: some speculative reports have cited even higher numbers, but those often come from non-standard scaling or scoring. Official test-based results top out in the mid-130s.)

Can AI take real IQ tests?

AI can attempt IQ tests by answering questions, but the situation is not the same as a human test-taker’s. Language models can be given the questions one by one and will answer, but this does not simulate a supervised testing environment. Moreover, because IQ tests are normed on humans, an AI’s score may not translate meaningfully to an equivalent human IQ. Experts point out that AIs don’t take “IQ tests” under timed, proctored conditions and often rely on memorization or pattern recognition. In fact, one analysis warns that Mensa-style tests “cannot provide a realistic IQ score for AI” due to these differences. So while AI can answer IQ test items, using those answers as a measure of true “AI IQ” should be done very cautiously.

Is IQ a Meaningful Measure for AI?

In conclusion, the IQ of an AI is a useful but limited metric. On the one hand, it provides a simple story – recent models like ChatGPT o3 and Claude 4 Opus can solve complex puzzles that stump most humans. On the other hand, it captures only a slice of what intelligence might mean for machines. IQ tests emphasize logical-pattern problems, whereas AI capabilities also include language understanding, factual recall, creativity, learning new tasks, and more.
Going forward, researchers will rely on a suite of benchmarks to evaluate AI cognition. Standardized tests like ARC, MMLU, and various code/math challenges already play that role. For example, in 2025 the BIG-Bench Hard (BBH) benchmark was found to be nearly solved by top models, leading to a new BBE Hard with much tougher tasks. These benchmarks test reasoning in different ways – from multi-step math to common-sense logic – and they complement any IQ-style test.
Ultimately, intelligence is multi-dimensional. IQ tests (Mensa-style puzzles) are one dimension. Others will include creative problem solving, “real-world” reasoning, and even physical embodiment tests. Trends like ARC-AGI indicate that the community is thinking broadly about such measures. But regardless of the metric, what’s clear is that AI reasoning has come a long way – and it will be fascinating to see how future benchmarks (like ARC, MMLU, BBH, and whatever comes next) continue to chart AI progress in the years ahead.

What is the IQ of Claude 4 Opus?

Current estimates put Claude 4 Opus at roughly 119 on the Mensa Norway test and about 116 on an equivalent offline test. These are lower than ChatGPT o3’s scores. Claude 4 Opus, while extremely capable, scores on the order of low 120s IQ by the metrics used so far.

Leave a Comment