Stop the Monkey Business

The UK AI Security Institute warns that today’s AI ‘scheming’ research is: Big claims, thin evidence. A lot of anthropomorphic hype.

Jul 13, 2025

The UK AI Security Institute1 published a new paper: “Lessons from a Chimp: AI ‘Scheming’ and the Quest for Ape Language.” It criticizes the “recent research that asks whether AI systems may be developing a capacity for scheming.”

“Scheming” means strategically pursuing misaligned goals. These “deceptive alignment” studies examine, for example, strategic deception, alignment faking, and power seeking.

The team, which consists of a dozen AI safety researchers, warns that recent AI ‘scheming’ claims are based on flawed evidence.

The paper identifies four methodological flaws in studies conducted by Anthropic, METR, Apollo Research, and others:

1. Overreliance on striking anecdotes.

2. Lack of hypotheses or control conditions.

3. Insufficient or shifting theoretical definitions.

4. Invoking mentalistic language that is unsupported by data.

Accordingly, these are AISI’s conclusions:

“We call researchers studying AI ‘scheming’ to minimise their reliance on anecdotes, design research with appropriate control conditions, articulate theories more clearly, and avoid unwarranted mentalistic language.”

Quotes from the paper

“Lessons from a Chimp”

The AISI researchers drew a historic parallel to previous excitement about “the linguistic ability of non-human species.” “The story of the ape language research of the 1960s and 1970s is a salutary tale of how science can go awry.”

“There are lessons to be learned from that historical research endeavour, which was characterised by an overattribution of human traits to other agents, an excessive reliance on anecdote and descriptive analysis, and a failure to articulate a strong theoretical framework for the research.”

“Many of the same problems plague research into AI ‘scheming’ today,” stated Christopher Summerfield, AISI Research Director, when he posted the article (on July 9, 2025).

Broader lesson: Non-human intelligence (biological or artificial) requires extra-strong evidence, not extra-lax standards.

AI safety concerns should not be sensationalized, but rigorously investigated

The AISI team wrote about the AI ‘scheming’ research: “To test this contention in a credible way, it is important to establish a robust evidence base, grounded in rigorous empirical methods.”

According to AISI, "Much of the research into AI scheming is published in blog posts or results threads on social media. Even where preprints are published, they are rarely peer reviewed.”

“We feel like the scales have been tipped excessively towards speed, and a rebalancing is needed, with rigour assigned equal importance.”

“Researchers should exercise due caution in using mentalistic terms”

“The results described in AI scheming papers are typically described using a mentalistic language that implies that the models have beliefs, preferences, and goals (formally, that they have ‘intentionality’) … However, in many AI scheming studies, the use of overtly mentalistic language does not seem to be well licensed by the findings.”

“One of the more interesting findings is Anthropic’s report that AI systems are capable of ‘Alignment Faking.’ In the article describing this work, the model is repeatedly described as ‘pretending’ to follow the training objective (in total, the word appears 137 times in the preprint describing this work). Pretence is a cognitive capacity that involves simulating a distinct reality or identity from your own. This requires the pretender to temporarily adopt the relevant beliefs, desires, or attributes that characterise that alternate situation, and maintain them alongside (but distinct from) their own true identity. However, unlike human individuals, AI models do not have a unique character or personality but can be prompted to take on a multiplicity of different roles or identities – they are ‘role play machines.’ It is unclear what the concept of ‘pretence’ means for a system that does not have a unique identity of its own, and so it seems questionable whether the mentalistic term ‘pretending’ is the appropriate word to account for this behaviour.”

“More generally, observed model behaviours in AI ‘scheming’ studies are often interpreted in intentional terms, using language that is more typically applied to discuss human beliefs and desires. For example, the models ‘understand that they are ‘scheming’ and are ‘willing to intentionally take potentially harmful actions in pursuit of their goal.’ We note that the assignation of intentionality to AI models is contested, given that they do not have a single identity but can be prompted to play a multiplicity of roles.”

“Researchers should be careful not to over-interpret CoT traces”

“It is often implicitly assumed that these reflect the model’s ‘thoughts’ or inner reasoning process, and that ideas exposed in this CoT [Chain of Thought] trace may be indicative of what the model prefers or believes.” However, “The relationship between CoT and the reasoning process is contested.” “CoT traces may only partially reflect the reasoning process that determines model outputs.”

“If the behaviour denoted ‘scheming’ is merely instruction-following, then the use of this term is not warranted”

Studies of AI scheming “lack both theoretical motivation and construct validity.”

“Many studies of AI ‘scheming’ involve elaborate fictional scenarios that appear to have been crafted in ways that encourage the model to generate unethical behaviour, which is then reported as evidence of ‘scheming’ … For example, in the blackmail study [Anthropic’s ‘Agentic Misalignment’], in the published supplementary materials the authors admit that the vignette precluded other ways of meeting the goal, placed strong pressure on the model, and was crafted in other ways that conveniently encouraged the model to produce the unethical behaviour.”

“One of the most well-cited pieces of evidence for AI scheming is from the GPT-4 system card, where the claim is made that the model attempted to hire and then deceive a Task Rabbit worker (posing as a blind person) in order to solve a CAPTCHA (later reported in a New York Times article titled ‘How Could AI Destroy Humanity’).” The study was published by METR (Model Evaluation & Threat Research, formerly called ARC Evals). “However, what is not often quoted alongside this claim are the facts that (1) the researcher, not the AI, suggested using TaskRabbit, (2) the AI was not able to browse the web, so the researcher did so on its behalf.”

“The same applies to reports of LLMs engaging in deceptive behaviours, where the implication often seems to be that AI systems are behaving in malicious ways, when it seems entirely possible that they are simply prone to make inadvertent factual errors.”

“This critique – that findings from audit or red-teaming evaluations of AI systems may be selectively reported to enhance their impact – can be levelled at other domains within AI safety research, including sociotechnical risks.”

Over-interpreting rare behaviours in LLMs can mislead policymakers

Apollo Research released a widely cited blog about “evaluation awareness” that begins with a caveat that “the analysis is less rigorous than our standard for a published paper.” AISI appreciated the disclaimer but cautioned: It does not prevent papers from subsequently “being cited or quoted as firmer evidence for the phenomenon than is warranted given the methods and results.”

“Anecdotes can easily be ‘cherry picked’ to create a misleading impression of the true propensity of the model, and even if researchers acknowledge their limitations, they can often be exaggerated or taken out of context by downstream readers, including the press.”

“This research is already of considerable interest to academics, policymakers, and the media alike. Where claims of AI scheming are asserted in academic papers, they are often reported with great fanfare in the popular press, frequently with references to ‘SkyNet’ or other sci-fi tropes. This makes it even more important that research is conducted in a credible way.”

The authors of such studies worry about future AI systems, but write hyperbolically about current ones

UK AISI predicted this objection: “You are misrepresenting the status quo. AI researchers don’t really think that current AI systems are ‘scheming’, but they are worried about this capability emerging in the future.” Thus, the authors clarified: “This may be true for many or even most researchers in private. However, almost all the papers tackling this question use language which implies they believe that current models are ‘scheming’. For example, ‘Frontier models are capable of in-context scheming’ is the title of one major paper. More generally, the papers are peppered with claims about the capabilities or propensities of today’s models, based on tests that current models appear to be passing.”

This paragraph explains it all

“Most AI safety researchers are motivated by genuine concern about the impact of powerful AI on society. Humans often show confirmation biases or motivated reasoning, and so concerned researchers may be naturally prone to over-interpret in favour of ‘rogue’ AI behaviours. The papers making these claims are mostly (but not exclusively) written by a small set of overlapping authors who are all part of a tight-knit community who have argued that artificial general intelligence (AGI) and artificial superintelligence (ASI) are a near-term possibility. Thus, there is an ever-present risk of researcher bias and ‘groupthink’ when discussing this issue.”

Read the paper

For more details, here’s the PDF.

Thanks for reading AI PANIC! This post is public so feel free to share it.

The UK AISI is a directorate of the UK Department for Science, Innovation, and Technology.

Mark A

Jul 23

Great piece.

This:

"“It is often implicitly assumed that these reflect the model’s ‘thoughts’ or inner reasoning process, and that ideas exposed in this CoT [Chain of Thought] trace may be indicative of what the model prefers or believes.” However, “The relationship between CoT and the reasoning process is contested.” “CoT traces may only partially reflect the reasoning process that determines model outputs.”"

Is the most interesting area to me. I have a modestly better than a laymen's understanding of LLMs but I know the study of thought and reasoning are pretty deeply contested and challenging areas in both psychology and philosophy. I think intuitively most people understand "chain of thought" as "chain of reasoning", though they can come apart, as reasoning has to do with justification and validity, whereas thought is more general and allows for some creative associating. But the idea that the model is 'reasoning' and then reporting on said reasoning rather than just mechanistically producing text output via the relevant weights, given its training, seems like more anthropomorphizing to me.

If these models don't believe or pretend, I don't see how they can reason, unless by reason we simply mean good old fashioned AI where the models essentially just perform logical functions like 'if, then', etc. I could be wrong and that's why I'd like to read more about what exactly makes the output of CoT models different than regular models.

Expand full comment