Last updated, December 16, 2025
AI IQ Leaderboard, Mensa vs Offline
| Model | Mensa | Offline |
|---|---|---|
| OpenAI GPT 5.2 Thinking | 141 | 127 |
| Gemini 3 Pro Preview | 141 | 127 |
| Grok-4 Expert Mode | 137 | 125 |
| OpenAI GPT 5.2 Pro | 145 | 123 |
| OpenAI GPT 5.2 | 126 | 120 |
| Claude-4.5 Opus | 124 | 120 |
| Kimi K2 Thinking | 124 | 116 |
| Gemini 3 Pro Preview (Vision) | 123 | 114 |
| Claude-4.5 Sonnet | 123 | 114 |
| OpenAI GPT 5.1 Pro (Vision) | 117 | 109 |
| OpenAI GPT 5.1 Thinking (Vision) | 97 | 99 |
| Perplexity | 97 | 97 |
| Qwen 3 Thinking | 87 | 97 |
| Claude-4.5 Sonnet (Vision) | 95 | 92 |
| Manus | 111 | 92 |
| Gemini 2.5 Flash Thinking | 99 | 89 |
| DeepSeek R1 | 109 | 89 |
| Mistral Medium 3.1 | 95 | 89 |
| DeepSeek V3 | 103 | 89 |
| OpenAI GPT 5.1 (Vision) | 78 | 82 |
| Claude-4.5 Opus (Vision) | 91 | 81 |
| Llama 4 Maverick | 107 | 80 |
| GPT-4o | 109 | 79 |
| Bing Copilot | 93 | 77 |
| Grok-4 Expert Mode (Vision) | 96 | 74 |
| GPT-4o (Vision) | 64 | 73 |
| Grok-4.1 Beta | 97 | 72 |
| Llama 4 Maverick (Vision) | 65 | 67 |
If you’re here for Gemini 3 Pro, you’re not alone. Interest is spiking, but our AI IQ test leaderboard still has GPT-5.2 Pro above the genius bar, with scores shown side by side for the public Mensa Norway set and a leak-resistant offline set for balance. That mix lets readers compare best-case familiarity with harder, unpublished items, then judge real progress instead of headlines.
I still remember the first time I tried to explain a transformer model to a room full of undergraduates. Someone at the back raised a hand and asked, “Is this thing smarter than us yet” I fumbled through a reply about pattern recognition and data scale, watched a few eyes glaze over, and realized the real question was how we decide what “smart” even means for an AI IQ Test.
Two years later, the world is still trying to answer that question, and the leaderboard just flipped again. On the public Mensa Norway set, OpenAI GPT-5.2 Pro now sits at the top of the test’s scale, while OpenAI GPT-5.2 Thinking and Gemini 3 Pro Preview both clear the 140+ “genius” line. On the leak-resistant offline set, GPT-5.2 Thinking and Gemini 3 Pro Preview are currently tied at the top, with Grok-4 Expert Mode close behind.
On paper, these numbers catapult large language models (LLMs) into the intellectual stratosphere, well above the median human score of 100 and nudging the border of Mensa elite, sparking debates about highest AI IQ and google AI IQ. Twitter threads cheer, skeptics groan, and somewhere in between sits a confused reader wondering whether to celebrate, fear, or simply ignore the news.
This essay is an attempt to sort the signal from the noise in our AI IQ Test era. I’ll walk through what human IQ actually measures, how researchers retrofit those tests for silicon brains to mimic AI IQ level, what AI IQ Test 2025’s leaderboard really shows, and, just as important, what it hides behind the AI IQ Test façade. Along the way I’ll sprinkle in a few war stories from the lab, some philosophical detours on what is AI IQ, and a plea to keep our humility intact while machines crank out puzzle solutions at super human speed in AI IQ 2025 contexts.
Table of Contents
1. A Brief History of Chasing the AI IQ Test Number

Psychologists have been quantifying intelligence for over a century, ever since Alfred Binet’s school placement experiments morphed into the modern Intelligence Quotient, foreshadowing today’s AI IQ Test methodologies. Set the population mean at 100, standard deviation at 15, and voilà—you get a tidy bell curve that sorts people into neat percentiles, reflecting ai iq tests performance. The number gained cultural power because it was simple, standardized, and, in many contexts, predictive of real world outcomes like academic success measured by AI IQ Test frameworks.
Computers joined the race of AI IQ tests only recently. Early neural nets in the 1990s wouldn’t have scored above chance on most IQ subtests in any AI IQ Test scenario. Even GPT-3, which dazzled us in 2020 with fluent prose, landed somewhere in the high 70s when brave researchers fed it Raven’s matrices in an informal AI IQ Test, placing it in “bright seventh grader” territory, yet hardly threatening human pride in AI IQ level discussions.
Fast forward to the latest results. The leaderboard has shifted again, with GPT-5.2 Pro at the top on an offline IQ test designed to avoid data leakage. o3 Pro and Gemini 2.5 Pro trail by a small margin, and Grok-4’s public Mensa score stays high while its offline score is lower. Offline testing reduces training-data contamination and prompt leakage, so it often reshuffles ranks versus public Mensa items. This rapid change shows a jarring slope even for those of us who expect exponential curves.
But before we hail silicon Einsteins, we need to untangle two intertwined but distinct threads in AI IQ Test research: (a) what IQ captures in humans, and (b) how faithfully the AI IQ Test version mirrors that construct.
2. What IQ Means for Flesh and Blood Thinkers in an AI IQ Test World
Ask five psychologists for an IQ definition and you’ll get six footnotes, but the consensus boils down to general cognitive ability—the fabled g factor central to any AI IQ Test. Classic batteries such as the Wechsler Adult Intelligence Scale (WAIS IV) spread that umbrella over four pillars essential to AI IQ tests:
- Verbal Comprehension – vocabulary depth, analogies, general knowledge tested by AI IQ tests.
- Perceptual Reasoning – spatial puzzles, pattern completion, visual spatial logic critical for an AI IQ Test.
- Working Memory – digit span, arithmetic under pressure, reflected in AI IQ Test results.
- Processing Speed – how fast you can chew through routine symbol matching, akin to google ai iq speed tests.
Scores are normed on massive, demographically balanced samples—tens of thousands of volunteers spanning ages, education levels, and cultures—much like IQ tests for humans. Every decade or so the publishers reform because societies slowly get better at test-taking (the famed Flynn effect), which parallels AI IQ 2025 recalibrations.
A critical property of a human IQ Test is that it stretches the brain in several directions at once, setting a standard for an AI IQ Test to follow. Try Raven’s matrices and you’ll feel the clock tick, your occipital cortex juggling shapes while your prefrontal cortex tracks rule candidates—similar to AI IQ Test prompts.
Switch to verbal analogies and you recruit entirely different neural circuits plus a lifetime of reading, paralleling what is AI IQ in transformers. The composite score therefore whispers something about how flexibly you think across domains, not just within one—a benchmark that AI IQ Test aims to emulate in AI IQ level metrics.
3. Translating the Exam for LLMs in AI IQ Test Context

Large language models don’t have retinas, fingers, or stress hormones, yet they face AI IQ Test challenges. They inhabit a leisurely universe where a thirty-second time limit is irrelevant and long-term memory can be simulated by a trillion training tokens. To shoehorn them into a human IQ framework—i.e. an AI IQ Test—researchers resort to verbalized versions of otherwise visual puzzles.
Take the Mensa Norway matrix: a 3×3 grid of abstract shapes with one square missing. For an LLM the image becomes a textual description—“Top row: a black arrow pointing right, then two arrows stacked, then a question mark”—followed by eight candidate answers spelled out in similar prose.
The model picks the letter of the best match, illustrating AI IQ Test response methods. The conversion step is already a minefield for any AI IQ Test prompt engineer: which words you choose, how you order details, whether you mention color first or orientation first—all can bump accuracy by several percentage points in an AI IQ Test scenario.
Prompt engineering turns into prompt alchemy, essential for accurate AI IQ tests and google AI IQ benchmarks. And since the questions (at least in the public Mensa set) float freely on the open web, a model may have ingested them verbatim during pretraining—essentially taking the AI IQ Test with an answer key taped under the desk.
To compensate, platforms such as TrackingAI craft offline variants—fresh puzzles never published online, served from air-gapped servers for robust AI IQ Test validation. When o3’s score drops from 136 on the public set to 116 offline, the 20-point haircut smells a lot like data leakage in AI IQ Test scenarios.
Still, even the leak-proof scores hover well above most humans, proving that AI IQ Test performance is no mere parlor trick.
4. Anatomy of an AI IQ Test Score Why, exactly, do language models ace what used to stump them in AI IQ Test trials?
- Scale, Scale, Scale – o3 reportedly trains on orders of magnitude more tokens than GPT 4, with an expanded context window that lets it hold an entire puzzle conversation in “working memory,” boosting AI IQ Test performance.
- Mixed Modality Embeddings – Even if inference is text-only, many models pretrain on image–text pairs, seeding a latent visual faculty that later helps decode verbalized diagrams in AI IQ Test scenarios.
- Self Consistency Sampling – Instead of answering once, the model rolls the dice 64 times, then votes on its own outputs, improving AI IQ Test consistency.
- Chain of Thought Fine Tuning – Researchers now encourage models to “show their work.” Paradoxically, forcing an LLM to spell out step-by-step logic improves the final answer in AI IQ Test settings and exposes faulty jumps we can later debug.
Those tricks embody an R&D arms race in AI IQ Test innovation. Each new technique buys a handful of points; stack enough and you vault past the human median in the next AI IQ Test.
5. Cracks in the AI IQ Test Mirror

Yet IQ inflation has its dark corners in AI IQ Test results. Here are a few the hype cycle politely sidesteps:
- Prompt Sensitivity – I once swapped a single adjective—“slanted” for “angled”—in a matrix prompt and watched a model’s answer flip from correct to wrong in an AI IQ Test run. Humans are more robust to such noise in AI IQ tests.
- Metamemory vs. Understanding – LLMs sometimes describe the pattern better than they apply it. They can chatter about symmetry yet miss that the missing tile must be blank, not striped, in the AI IQ Test logic.
- One-Shot Brilliance, Multi-Step Fragility – On Google’s Humanity’s Last Exam, which strings several reasoning hops together, top models still limp below 20% accuracy even in advanced AI IQ Test formats. Humans soar over 70%.
- No Stakes – A machine never gets test anxiety, hunger pangs, or sweaty palms. Remove those stressors from humans and their scores jump, too, narrowing the ai iq level gap.
While IQ tests are a fascinating proxy, benchmarks like Humanity’s Last Exam (HLE) provide a deeper look at complex reasoning. For a full breakdown of Grok-4’s record-setting performance on that test, see our Deep Dive Analysis of Grok 4 and performance on Humanity’s Last Exam Test (Grok 4)
6. Alternative Yardsticks Beyond a Single AI IQ Test
Because of those blind spots, the community is frantically building broader evaluation suites beyond a simple AI IQ Test metric:
- ARC AGI: 400 hand-crafted abstraction puzzles. O3 nails roughly 3% of them; the average 10-year-old human hits 60%, reminding us that ai iq tests vary widely.
- MATH 500 & AIME: Competition-level algebra and geometry. DeepSeek R1 impresses with 76% accuracy—evidence that narrow fine-tuning pays off in AI IQ Test contexts.
- SWE Bench & GitHub Bugs: Can a model patch real-world code? Scores linger around 50%, enough to excite CTOs but still miles from professional reliability in ai iq ranking scenarios.
- Ethical Twins & Value Alignment: Prototype tests ask models to rank moral dilemmas. Results swing wildly with prompt phrasing, indicating shaky meta-ethics in AI IQ Test influence.
7. Humans vs. Transformers: Apples and AI IQ Test Oranges
- Architecture – Our brains combine spike-timed neurons, chemistry, and plasticity honed by evolution. Transformers juggle linear algebra on GPU wafers in AI IQ Test simulations. They may converge on similar outputs, yet the journey there differs profoundly.
- Embodiment – A toddler learns “gravity” by dropping cereal on the floor; a model knows it only through textual snippets, a gap AI IQ Test can’t bridge. One has muscle memory and scraped knees; the other compresses patterns in high-dimensional space.
- Energy Budget – The human brain consumes about 20 W. Training a frontier LLM devours megawatt-hours. Efficiency is its own metric of intelligence—or at least survivability on a warming planet, which an AI IQ Test doesn’t measure.
Because of those chasms, IQ parity doesn’t imply cognitive parity in an AI IQ Test sense. If anything, it underscores how narrow tests can be hacked by alien architectures in AI IQ tests.
8. The Practical Outlook for an AI IQ Test–Driven World
I field this question weekly from product teams deciding whether to integrate the latest model into their AI IQ Test pipelines. My answer is a cautious yes, but:
- Yes, because higher puzzle competence often translates to crisper coding assistance, tighter logical reasoning, and fewer embarrassing math slips in your customer chatbot—real improvements in AI IQ level applications.
- But, because any decision with stakes—medical, financial, legal—still demands a human in the loop until we have evaluations that capture nuance, context, and moral reasoning beyond the AI IQ Test.
I like to frame IQ as “potential bandwidth.” It tells you the maximum data rate the channel can handle, not whether the message is truthful or safe. Your job is to wrap that channel in fail-safes, audits, and domain knowledge as part of your AI IQ Test deployment.
9. Where Testing Goes Next in the AI IQ Test Cycle
- Standardized Prompts – The community is coalescing around fixed, open-sourced prompt suites to curb cherry-picking in AI IQ Test creation.
- Multi-Modal Exams – Future tests will mix text, images, audio, maybe even robotics simulators, nudging models closer to the sensory buffet humans enjoy in AI IQ Test environments.
- Longitudinal Evaluations – Instead of a single snapshot, platforms like TrackingAI plan to probe the same model monthly, watching for “conceptual drift” as it undergoes post-deployment fine-tunes, ensuring AI IQ Test consistency.
- Alignment Leaderboards – Imagine a public scoreboard where models compete not for raw IQ but for harm reduction and truthfulness scores, far beyond a simple AI IQ Test. We’re early, but prototypes exist.
If we succeed, IQ will become just one cell in a sprawling spreadsheet—useful, but no longer in the limelight of AI IQ Test discourse.
10. Conclusion: Humility in the AI IQ Test Age of 136
Intelligence, in the richest sense, isn’t a single axis. It’s the braid of curiosity, empathy, street smarts, moral courage, creative spark, and, yes, raw reasoning speed, which an AI IQ Test only approximates. An LLM that slots puzzle pieces faster than 98% of humans has certainly achieved something historic in highest AI IQ history. But in my classroom, when that same student also helps a peer, questions an assumption, and owns a mistake—that’s when I nod and think, “There’s genius” beyond the AI IQ Test.
As AI marches up the IQ ladder, we should applaud the craftsmanship and absorb the lessons—then widen the lens to include AI IQ ranking and google AI IQ implications. Ask not just how bright the circuits glow, but where that light is pointed and whose face it illuminates. The future will be shaped by that broader definition of intelligence, one the AI IQ Test alone cannot capture.
Until then, enjoy the leaderboard. Just keep a saltshaker handy for those glittering numbers. They are map references, not the landscape itself.
Azmat — Founder of BinaryVerse AI | Tech Explorer and Observer of the Machine Mind Revolution
For questions or feedback, feel free to contact us or explore our About Us page
1) What is the average human IQ and where does the genius range start?
The human IQ scale is centered at 100, which represents the average score for the general population. Most modern IQ tests use a standard deviation of 15. Scores around 85 to 115 fall in the typical range, which covers the majority of people. Many publishers describe 140 and above as the start of the “genius” range. In our chart you will see a green line at 100 to mark the human mean and a purple line at 140 to mark that genius threshold. These reference lines help readers compare AI model scores to the human scale at a glance.
2) Which AI has the highest IQ right now?
It depends on the test source. On the public Mensa Norway test, GPT-5.2 Pro currently reaches the top of the scale, while GPT-5.2 Thinking and Gemini 3 Pro Preview also clear the 140+ line. On the offline, leak-resistant set, GPT-5.2 Thinking and Gemini 3 Pro Preview are currently tied at the top, with several models clustered close behind.
3) Why do Mensa and offline AI IQ scores differ
Public Mensa puzzles are widely shared online. Frontier models are trained on large internet corpora, so overlap between training data and those items can inflate scores. Offline testing reduces this problem by using new, unpublished puzzles, fixed prompt wording, and controlled scoring. That is why a model like GPT-5.2 Pro can exceed 140 on the public Mensa set, yet score in the mid-100s on our offline set. Both numbers are useful. Mensa shows best case performance on familiar formats, while offline scores are a better proxy for generalization.
4) Did GPT-5.2 Pro really cross the genius bar?
Yes, on the public Mensa Norway test, GPT-5.2 Thinking and Gemini 3 Pro Preview clear 140+, and GPT-5.2 Pro reaches the top of the test’s reported scale. On the offline set, the top scores are lower, which is the point of running both side by side.
5) Which models are in the Top 10 chart and how did you choose them?
The Top 10 list is sorted by the most recent offline scores. The current set includes GPT-5.2 Thinking, Gemini 3 Pro Preview, Grok-4 Expert Mode, GPT-5.2 Pro, GPT-5.2, Claude-4.5 Opus, Kimi K2 Thinking, Gemini 3 Pro Preview (Vision), Claude-4.5 Sonnet, and GPT-5.1 Pro (Vision).
6) What do the green and purple lines on the chart mean?
The green vertical line marks 100, the human average IQ. Any bar that extends to the right of that line is above the human mean on that scale. The purple vertical line marks 140, a commonly used genius threshold. Only a few Mensa results cross that line, and none of the current offline results do. These reference lines anchor the visual so readers can interpret scores quickly without memorizing the scale. They also prevent confusion when different test sources produce different numbers.
7) Are AI IQ scores a good predictor of coding or math performance?
AI IQ scores correlate with performance on abstract puzzles and pattern reasoning. That often helps with tasks like stepwise logic, careful arithmetic, and short algorithm sketches. However, production coding involves long-context planning, tool use, tests, and security concerns that IQ tests do not measure. If you care about software work, check task benchmarks such as competitive programming evaluations or code-fix leaderboards in addition to AI IQ rankings. Treat IQ as one input to model selection, not the only metric.
8) Can a single prompt or testing method change the score?
Yes. Prompt wording, answer format, and time or retry rules can shift accuracy. That is why our offline method uses standardized prompts and fixed scoring, and why we publish the test source next to each bar. If you see a large jump that only appears on one site or one day, look for differences in prompt phrasing or in how many chances the model received to answer. Stable methods produce stable rankings, which is the goal of showing Mensa and offline side by side.
9) Which source should I trust more, Mensa or offline?
Use them together. The Mensa Norway test shows how models handle a familiar public format, which can be influenced by training overlaps. Our offline set uses unpublished items and fixed prompts to reduce leakage, so it is a better view of generalization. If the two disagree, treat the offline number as the truer baseline and the Mensa number as a best case. That is why we show both bars and keep the chart date stamped.
10) How do you verify scores before adding them to the leaderboard?
We only publish figures we can reproduce or that come from documented public runs. For Mensa, that means consistent scoring on the same item pool and answer format. For offline, that means fixed prompts, unpublished items, and a controlled scoring script. If a model provider posts a one-off claim that we cannot replicate, we label it as provisional or leave it out until confirmed. This keeps the Top 10 stable and useful for readers who compare models over time.

2 thoughts on “AI IQ Test 2025, GPT-5.2 Pro tops Mensa IQ, offline leaderboard updated”
Comments are closed.