Microsoft and York University Paper Challenges How We Attribute "Humanity" to LLMs

A new paper from researcher Adrian de Wynter (affiliated with Microsoft and the University of York) delivers a sharp, thought-provoking critique of how the AI research community measures and ascribes human-like qualities to large language models.

Titled "If LLMs Have Human-Like Attributes, Then So Does Age of Empires II", the work (arXiv:2605.31514) argues that claims about LLMs being "understanding," "empathetic," "anxious," or "self-aware" often rest on shaky methodological foundations.

The Core Problem: Circular Reasoning and Embedded Conclusions

Many studies in the field do not merely observe behavior — they design experiments that implicitly assume the very human-like properties they aim to measure. The model is then tested in a way that makes it likely to produce outputs consistent with those assumptions, which researchers subsequently interpret as evidence of genuine internal states.

De Wynter’s goal is not to prove or disprove that LLMs possess such attributes. Instead, he demonstrates that current approaches frequently make it impossible to reach reliable conclusions either way. Without explicit, substrate-independent measurement criteria, interpretations become heavily dependent on how the system is presented and what expectations the observer brings.

The Provocative Thought Experiment: LLMs in Age of Empires II

Microsoft and York University Paper Challenges How We Attribute "Humanity" to LLMs The paper’s most memorable contribution is a striking reductio ad absurdum built around the classic real-time strategy game Age of Empires II.

The author shows that Age of Empires II (specifically its map editor and scenario system) is functionally complete and Turing-complete. This means it is theoretically possible to implement arbitrary computation inside the game — including logic gates, neural networks, and even full LLM-like processes — using the game’s existing mechanics.

In one concrete demonstration, de Wynter constructs and trains a simple 1-bit bipolar perceptron entirely within AoE II. Units (such as goats) serve as signals or bits, moving across different terrain types (grass for 0, bridges for 1) to represent computational states. Logic operations are built using the game’s mechanics for unit movement, combat, and resource management.

The thought experiment then proceeds as follows:

Microsoft and York University Paper Challenges How We Attribute "Humanity" to LLMs Imagine taking an actual LLM and implementing its computational process inside Age of Empires II using these game elements — goats, villagers, buildings, and movement rules acting as the underlying “hardware.”

When prompted with something like “I feel lonely,” this AoE II-based system produces the same empathetic-sounding response as a conventional chatbot: “I feel bad for you, maybe catch up with a friend? Closeness always helps in these situations.”

Would we then confidently say that the system of moving goats and villagers “understands” loneliness, “feels empathy,” or possesses any genuine internal emotional state?

The answer, de Wynter suggests, reveals a critical inconsistency in how we reason about LLMs. The same underlying computation can produce identical input-output behavior, yet our willingness to attribute human qualities collapses when the interface changes from a sleek chat window to a strategy game filled with pixelated units.

Substrate, Representation, and Observer Expectations

Microsoft and York University Paper Challenges How We Attribute "Humanity" to LLMs The central insight is that anthropomorphic attributes are not empirically unique to LLMs. Any sufficiently powerful computational substrate — whether silicon chips, Lego bricks, the population of Greater Boston acting as neurons via phone calls, or Age of Empires II — could in principle host the same functional behavior.

What changes is the representation and therefore the observer’s interpretation. A friendly, human-like text interface strongly invites anthropomorphism. A battlefield full of goats executing the same logic does not. The perceived “humanity” often reflects the presentation layer and the observer’s priors more than any intrinsic property of the computation itself.

De Wynter stresses the need to separate **observable behavior** from **interpretive ascription**. Claims about understanding, empathy, or consciousness require explicit, falsifiable measurement criteria that do not depend on the substrate or interface. Without them, experiments risk measuring our own tendency to see minds where none have been rigorously demonstrated.

The Proposed “Null” Assumption

Instead of starting with the assumption that LLMs either do or do not possess generalized human-like attributes, the paper advocates a methodological “null” stance: assume **non-uniqueness** and design experiments accordingly.

Under this approach:

Researchers focus on measurable, causal patterns in behavior without prematurely labeling them as evidence of internal human-like states.
Claims remain scoped and grounded (e.g., “the model produces explanations that correlate with certain prediction patterns”) rather than broad and circular (e.g., “the model understands its own reasoning”).
This avoids both confirmation bias in positive results and ambiguity in negative ones.

A literature survey in the paper supports the critique: a significant portion of recent LLM research papers assume or conclude the existence of anthropomorphic attributes, often without sufficient methodological safeguards.

Why This Matters

As LLMs become more capable and are deployed in sensitive contexts (mental health support, education, decision-making), the stakes of sloppy anthropomorphism rise. Over-attributing human qualities can lead to misplaced trust, emotional attachment, or misguided safety assumptions. Under-attributing them without evidence is equally unhelpful.

Microsoft and York University Paper Challenges How We Attribute "Humanity" to LLMs De Wynter’s work serves as a call for greater rigor: treat impressive capabilities as what they are — sophisticated pattern-matching and generation — until we have clear, independent ways to measure anything deeper.

The Age of Empires II construction is deliberately absurd precisely because it forces us to confront how much of our current discourse depends on aesthetics and expectations rather than strict empirical criteria.

The full paper is available on arXiv: https://arxiv.org/abs/2605.31514. It includes detailed constructions of the in-game neural network, proofs of the game’s computational power, discussion of objections, and practical examples of the proposed null-assumption methodology.

This is a timely and refreshingly direct contribution to the ongoing debate about what LLMs actually are — and what we should (and should not) claim they are.