An A.I. Tool's Intelligence does not Exist

The benchmark leaderboards tell a seductive story: Model A scores 87% on MMLU, Model B hits 92%, so Model B must be "smarter." We've built an entire evaluation apparatus on this premise—that we can measure the intelligence of a tool in isolation, assign it a number, and use that number to predict performance in the real world. This is wrong. Worse, it's a category error that systematically misleads us about where intelligence actually resides in AI systems.

A tool's intelligence does not exist as a standalone property. What we measure when we benchmark an AI model is not the model's capability—it's the performance of a composite system in which human and machine cognition are inextricably fused. The prompter's skill, domain knowledge, and ability to iterate matter as much as the model's parameters. Strip away the human element and you haven't isolated the AI's intelligence; you've created an artificial testing scenario that tells you almost nothing about real-world performance.

This isn't a minor methodological quibble. It's a fundamental attribution error that pervades how we evaluate, compare, and deploy AI systems. When a system succeeds, we credit the model. When it fails, we blame the model. But both attributions ignore the distributed nature of the intelligence doing the work.

Intelligence as Distributed Cognition

The theoretical foundation for this argument comes from cognitive science, not computer science. In 1998, philosophers Andy Clark and David Chalmers proposed the "extended mind thesis"—the idea that cognitive processes extend beyond the brain into tools and artifacts in the environment. Your notepad isn't just storing information for your brain; it's part of your cognitive system. The calculator doesn't assist your mathematical thinking; it is part of how you think mathematically in that moment.

AI systems take this further. When you interact with a language model, you're not using a passive tool that either works or doesn't. You're entering into what researchers call a "participation framework"—a joint activity where composition and interpretation happen across both agents. The model generates based on your context; you refine based on its output; meaning emerges from the interaction, not from either component alone.

Lucy Suchman demonstrated this in her foundational work on situated action at Xerox PARC. She showed that intelligent behavior doesn't come from executing pre-formed plans stored in heads or machines. It emerges from ongoing interaction with a "swarm of contingencies" in the environment. Suchman studied people trying to use an "intelligent" photocopier and found that communication breakdowns occurred not because users were incompetent or the system was poorly designed in isolation, but because human action and machine response were fundamentally situated —dependent on context that couldn't be fully specified in advance.

This matters for AI evaluation because it means performance cannot be attributed to the model alone. The "intelligence" doing the task is distributed across the human-AI system, including the prompter's knowledge of what to ask, how to frame it, what context to provide, and how to interpret the output.

The Prompt Engineering Gap

The empirical evidence for this is overwhelming. Researchers at Wharton found that changing a single word in a prompt—saying "please" versus "I order you"—can shift performance by up to 60 percentage points in either direction on individual questions. That variance averages out across full datasets, but it reveals something crucial: the same model, facing the same question, produces radically different results based on how you talk to it.

Other studies show even starker effects. A medical AI system's accuracy jumped from 80.1% to 99.6% when researchers applied proper prompt engineering techniques. That's not a 20% improvement in the AI—it's the same model. The difference is entirely in how the human user structured the interaction. The model's "intelligence" as measured by accuracy changed by nearly 20 percentage points based solely on user skill.

This creates a problem for benchmarks that most researchers have simply ignored. When you report that Model X achieves 89% accuracy on Task Y, that number includes massive hidden variance based on how the prompt was constructed. Different prompt templates for the same task can produce standard deviations of 0.28 in accuracy—meaning the "same" model might score anywhere from 25% to 90% depending on how you ask.

Recent work shows much of this variance comes from heuristic evaluation methods that miss semantically correct answers phrased differently than expected. But even correcting for evaluation problems, substantial prompt-dependent variation remains. Google researchers showed that this happens because large language models perform a kind of "learning without training"—they adapt their internal behavior based on the prompt context, essentially fine-tuning themselves at inference time based on what you show them.

This means prompt quality functions as a proxy for user intelligence. A skilled prompter who understands the task domain, knows how to provide relevant examples, and can frame requests clearly will get dramatically better results from the same model than a novice user. The benchmark score conflates these two sources of capability.

Benchmarks Measure the Wrong Thing

Most AI benchmarks suffer from catastrophic construct validity failures—they don't actually measure what they claim to measure. A comprehensive review of 445 benchmarks from major AI conferences found that "almost all articles have weaknesses in at least one area" of construct validity.

Construct validity is a concept from measurement theory. It asks: does your test actually measure the abstract concept you claim it measures? If you want to measure "reasoning ability" but your test can be passed by memorization, you have low construct validity. Your high scores are irrelevant or misleading.

The review found systematic problems. When benchmarks provided definitions for what they were measuring, 47.8% used "contested" definitions—concepts with "many possible definitions or no clear definition at all". The example they give is "harmlessness," a key safety metric. If two models score differently on a harmlessness benchmark, it may only reflect two different arbitrary definitions of harmlessness, not a genuine difference in safety.

Only 16% of benchmarks used statistical uncertainty estimates or tests to compare model results. Without this, you can't know if a 2% lead for Model A over Model B represents a real capability difference or random chance. Yet enterprise procurement decisions worth hundreds of millions of dollars get made based on these numbers.

Data contamination makes this worse. Studies found contamination levels ranging from 1% to 45% across benchmarks. When benchmark questions appear in training data, models aren't demonstrating reasoning—they're recalling answers. Large models benefit more from this contamination than small ones. The performance you're measuring is partly memory, but you're attributing it to capability.

The most sophisticated contamination analysis to date, using a method called ConTAM, found that contamination impact has been systematically underestimated. The longest matching n-gram detection method revealed much higher performance gains from contamination than previous approaches caught. For large models, eliminating truly contaminated examples can drop benchmark scores substantially—meaning the reported capabilities were inflated by exposure to test data during training.

The Human-AI Performance Paradox

When researchers actually measure human-AI systems in realistic tasks, the results contradict everything the benchmarks predict. A meta-analysis of 106 experimental studies on human-AI collaboration found that on average, human-AI combinations performed significantly worse than the best of humans or AI alone (effect size g = -0.23).

This wasn't a small sample quirk. The researchers at MIT Sloan analyzed 352 separate effect sizes across diverse domains. The consistent finding: combining humans and AI often degrades performance compared to just using whichever one is better at the task.

The pattern is even more specific. For decision-making tasks—choosing between finite options—the combined human-AI systems showed significantly negative performance (g = -0.27). But for content creation tasks—generating text, images, or other artifacts—combinations showed significant performance gains. The meta-analysis also found that when AI alone outperformed humans alone, adding humans into the loop caused substantial performance losses. When humans outperformed AI alone, adding AI helped.

This demolishes the naive story that "AI augmentation" automatically makes humans better at tasks. Whether collaboration helps or hurts depends entirely on the task structure, the relative performance baseline, and how the interaction is designed. But current benchmarks measure none of this. They test the AI in isolation, then implicitly assume that performance transfers to real-world collaborative contexts. It doesn't.

Other studies show similar effects. Research on AI delegation—where an AI decides which task instances to handle itself versus passing to humans—found that task performance and satisfaction improved when delegation was handled well, but the mechanism was human self-efficacy , not AI capability. The AI created conditions where humans performed better, but the performance gain came from the human side of the system.

The Attribution Error

There's a term from social psychology that explains what's happening here: the fundamental attribution error. This is our tendency to attribute others' failures to their character while attributing our own failures to circumstance. When someone cuts us off in traffic, they're an asshole. When we cut someone off, we were running late for an important meeting.

Writer Stephen Turner coined an extension: the "AI attribution error". When an AI produces something useful, we attribute it to our skill in model choice, prompting, and steering. When it produces garbage, we attribute it to the model being "a useless lying machine that can't follow directions." Both attributions ignore the joint system that produced the outcome.

The same pattern appears at the organizational level. When companies deploy AI successfully, press releases credit "visionary technology adoption." When systems fail or hallucinate, the narrative becomes "the model drifted" or "AI limitations". The technology gets blamed, not the integration approach, the training quality, the user skill, or the evaluation process.

This attribution error is baked into how we design benchmarks. We want to measure "the AI's intelligence," so we control for human factors, standardize prompts, and eliminate variance. Then we report a number that represents the AI's capability. But you can't measure a distributed system by measuring one component in isolation and calling it done. That's like measuring the horsepower of a car engine and claiming you've measured how well the car drives. The engine matters, but so do the transmission, tires, suspension, driver skill, road conditions, and weather.

When we benchmark AI systems, we're usually measuring something more like: "How well does this model perform when prompted by an expert researcher who has iterated on prompts, knows the task domain well, and has optimized for benchmark performance?" That number tells you very little about how the model will perform when prompted by a novice user in a different domain who doesn't know how to structure requests effectively.

Sociotechnical Reality

The broader context makes this worse. AI systems don't operate in sterile laboratory conditions—they're deployed in complex sociotechnical systems where performance depends on organizational culture, workflow integration, user training, data quality, and institutional constraints.

Studies of AI deployment in healthcare show this clearly. A clinical decision support system might achieve 95% accuracy in benchmark testing, but when deployed in a real emergency department, performance depends on whether nurses trust it, whether it integrates into existing workflows, whether users understand its limitations, whether the training data matched the patient population, and whether staff have time to engage with its outputs during crisis conditions. The "work-as-imagined" by AI developers rarely matches "work-as-done" by frontline users.

This is why ecological validity—how well findings from controlled tests translate to real-world settings—matters so much for AI evaluation. A benchmark with perfect internal validity but no ecological validity tells you something precise about nothing that matters. Yet most AI benchmarks have been designed with little consideration for whether performance in the test environment predicts performance in deployment contexts.

The sociotechnical perspective reveals another problem with attributing intelligence to the AI component. When a diagnostic AI "catches" a disease a doctor missed, did the AI do that, or did the human-AI system do it? The AI flagged something. The human decided whether to investigate further. The workflow determined whether that flag was noticed. The training determined whether the human understood the AI's reasoning. The organizational culture determined whether the human felt empowered to override the AI or trust it. Intelligence is distributed across all those elements.

The Measurement Problem

What does this mean for how we should evaluate AI systems? The field of measurement science—metrology—offers some answers. Valid measurement requires clearly defining what you're measuring, establishing uncertainty estimates, using appropriate baselines, and ensuring construct validity.

NIST's research on AI measurement emphasizes that construct validity is context-dependent. Whether a benchmark accurately measures "mathematical reasoning" depends not just on the test itself but on what claims you want to make about the system. A test might have validity for ranking models relative to each other but no validity for predicting real-world math problem-solving ability.

The key insight from metrology is that measurement validity depends on the system being measured, not just the instrument. If you want to measure how well AI systems perform tasks, you need to measure the system that actually performs tasks—which includes the human user, the interface, the workflow, the data, and the model. Measuring just the model and extrapolating to system performance is like measuring the temperature of an oven's heating element and claiming you've measured how well it bakes bread.

This also requires thinking clearly about uncertainty. All measurements involve uncertainty, but many benchmark results present point estimates with no error bars. When Model A scores 87.3% and Model B scores 87.1%, that difference is almost certainly not statistically significant given realistic sample sizes and variance. Yet leaderboards rank them as if the ordering is certain.

Baselines matter too. Comparing AI to AI tells you which model performs better on specific tasks, but it doesn't tell you whether either one is useful. You need human baselines—how well do expert humans perform? How well do novice humans perform? How well do humans perform with different levels of AI assistance? Without these baselines, you can't interpret whether an 89% accuracy score represents superhuman performance or whether a five-year-old could do better.

Where Intelligence Actually Lives

If intelligence doesn't live "in" the AI model, where does it live? The answer is that intelligence is a property of systems, not components. It emerges from the interaction between human cognition, artificial processing, environmental structure, and social context.

This is the insight from distributed cognition theory. When a ship's crew navigates, the intelligence isn't in any one person's head—it's distributed across the navigator plotting the course, the helmsman steering, the lookout watching for hazards, the charts providing reference, and the instruments providing data. The system navigates. The components enable navigation, but intelligence is a systems-level property.

The same applies to human-AI systems. When you use a language model to draft an email, debug code, or analyze data, the intelligence doing that work is distributed across your understanding of the goal, your ability to frame the request, the model's pattern matching, your evaluation of outputs, your domain knowledge that catches hallucinations, and your iterative refinement of the interaction. You're not using the AI's intelligence as a tool. You're creating a temporary cognitive system that thinks differently than either you or the AI could alone.

This perspective explains why prompt engineering matters so much. Prompts aren't just instructions to a machine—they're the interface between two cognitive systems. A good prompt creates conditions for productive distributed cognition. A bad prompt creates a system that thrashes between mismatched representations of the task.

It also explains the paradox that AI can both deskill humans and enhance their capabilities. Cognitive offloading—using external tools to reduce mental effort—can either support skill development (when tools scaffold learning) or degrade it (when tools substitute for thinking). Whether AI enhances or undermines human intelligence depends entirely on how the interaction is structured, not on the AI's capabilities in isolation.

Rethinking Evaluation

The implications for evaluation are profound. If you want to know how well an AI system will perform in practice, you cannot measure the AI in isolation. You need to measure the human-AI system under realistic conditions with representative users.

This means evaluations should include:

User skill variance. Test the same model with novice, intermediate, and expert users. Report performance distributions, not just means. Acknowledge that the same "85% accuracy" might represent perfect performance by experts and 70% by novices.

Prompt sensitivity analysis. For any benchmark, show how performance varies across different prompt formulations. If a model's score ranges from 60% to 90% depending on prompt wording, that variance is information, not noise.

Ecological validity testing. Deploy systems in realistic conditions with real users performing actual work. Compare benchmark performance to field performance. Report the gap.

Human baselines and collaboration modes. Show how the AI compares to humans alone, how humans perform with AI assistance, and how the human-AI system compares to the best of either alone. These are different questions with different answers.

Task structure specification. Acknowledge that performance depends on whether tasks involve decision-making, content creation, open-ended reasoning, or constrained classification. Report separately for different task types.

Attribution transparency. Stop claiming benchmarks measure "AI capability" or "model intelligence." They measure performance of a specific testing protocol that includes data, prompts, evaluation methods, and models. Be honest about what's actually being measured.

None of this is easy. It's far simpler to run a model through a standardized benchmark and report a number. But simple measurements that mislead are worse than complex measurements that inform. Right now, we're optimizing for simplicity at the cost of validity.

The Conceptual Shift

The deeper issue is conceptual. We've inherited a framework for thinking about intelligence that treats it as a property that individual entities possess in varying amounts. This framework works reasonably well for comparing individual humans on specific tasks—though even there, context matters enormously. But it breaks completely when applied to human-AI systems.

Tools don't have intelligence. Systems demonstrate intelligent behavior. A calculator doesn't "do math"—the human-calculator system solves mathematical problems more efficiently than the human alone could. The intelligence isn't in the calculator; it's in knowing when to use it, how to use it, what problems it solves, and what its limitations are.

Language models are more complex than calculators, but the principle holds. When GPT-4 "writes" an essay, it's not demonstrating intelligence in the way a human writer does. It's serving as part of a distributed cognitive system in which the human provides goals, evaluates outputs, catches errors, refines directions, and makes judgments about quality. The system produces the essay. Attributing authorship solely to either the human or the AI misses what actually happened.

This doesn't mean AI capabilities don't matter. Model quality makes an enormous difference to system performance. But "model quality" is not the same as "model intelligence," and benchmark scores measure at best the former while claiming to measure the latter. A high-quality model integrated poorly performs worse than a mediocre model integrated well. The benchmark won't tell you that.

The Path Forward

We need evaluation frameworks that measure what actually matters: how well do human-AI systems perform meaningful tasks in realistic conditions? This requires drawing on insights from human factors, sociotechnical systems theory, measurement science, and distributed cognition—not just machine learning benchmarks.

Some researchers are moving this direction. Work on human-AI collaboration metrics looks at task handoff efficiency, human override rates, and collaborative decision quality rather than just AI accuracy in isolation. Studies on AI delegation examine how systems perform when AI decides which tasks to handle versus pass to humans. Research on prompt evaluation focuses on measuring the interaction quality, not just the output.

But these remain minority approaches. The dominant paradigm still treats AI evaluation as a question of measuring model capabilities through standardized benchmarks. This paradigm assumes intelligence lives in the model, that better benchmark scores predict better real-world performance, and that human factors are noise to be controlled rather than essential components of system capability.

That assumption is wrong. The cost of getting this wrong is high. Organizations deploy systems based on benchmark performance and are surprised when real-world results differ dramatically. Researchers optimize for metrics that don't predict user satisfaction or task success. Safety evaluations test models in isolation and miss failure modes that emerge from human-AI interaction patterns. Billions of dollars flow toward benchmark improvements that don't translate to meaningful capability gains.

The alternative is to acknowledge what cognitive science has known for decades: intelligence is not a substance that tools contain. It's a property of systems adapting to tasks. The intelligence of a human-AI system emerges from the interaction, not from either component. Until we build evaluation methods around that reality, our benchmarks will continue to measure the wrong thing while claiming precision about nothing that matters.

Comments

Popular Posts