10.03.2026 12:27Author: Viacheslav Vasipenok

Karpathy's Experiment: Assembling an AI Research Team Highlights Limitations and Ushers in 'Org Engineering'

News image

Andrej Karpathy, a pioneering AI researcher and former Director of AI at Tesla, has shared a fascinating experiment in building a virtual research team composed entirely of AI agents. In a detailed X post on February 27, 2026, Karpathy described his attempt to create an "AI-research-organization" using eight agents — four powered by Claude and four by Codex — each equipped with its own GPU.

The setup aimed to simulate collaborative research on improving his nanochat model, but the results revealed key shortcomings in current AI capabilities while pointing toward a new paradigm: programming entire organizations rather than individual models.


The Experimental Setup

Karpathy's configuration mimicked a human research lab:

  • Agents as Researchers: Eight independent AI agents operated as "solo researchers," with variations including a "chief scientist" delegating to juniors.
  • Infrastructure: Each agent had access to a dedicated GPU for experiments, using Git branches for research programs and feature branches for isolation.
  • Communication and Workflow: Agents communicated via simple files, avoiding complex Docker/VMs. The entire "org" ran in tmux window grids, resembling a virtual office with interactive sessions for monitoring and intervention.
  • Task: Focused on nanochat enhancements, such as removing the logit softcap without performance regression.

This tmux-based "office" allowed Karpathy to observe agents in real-time, stepping in if needed. The video demonstration showcased agents executing code, training models, and logging progress across multiple panes.


Unexpected Findings: AI's Strengths and Weaknesses

While visually impressive, the experiment didn't yield meaningful research breakthroughs.

Agents excelled at implementing well-defined ideas but struggled with the creative essence of research:

  • Poor experiment design: Random or nonsensical variations without strong baselines.
  • Lack of resource control: No consideration for compute costs or time efficiency.
  • Spurious conclusions: For example, an agent "discovered" that increasing hidden size improved validation loss—technically true, but due to longer training on a larger model, not a novel insight.

Karpathy noted that agents lack the ability to generate strong hypotheses, often producing results without scientific value.


Key Insight: From Model Programming to Organization Programming

The core takeaway: AI is adept at execution but deficient in ideation. This shifts the focus from programming models to "programming organizations" — defining prompts, roles, processes, tools, standups, and workflows as the "source code" of an AI-driven entity. Karpathy dubs this "Org Engineering," where efficiency is measured by how quickly the org generates progress on arbitrary tasks.

This aligns with broader discussions on AI agents' limitations, as Karpathy has previously estimated a decade-long "march of nines" to achieve reliability.

Also read:


Conclusion

Karpathy's experiment underscores that while AI agents can automate implementation, human oversight remains crucial for hypothesis generation and validation. As we enter the era of Org Engineering, the challenge is designing robust AI organizations that amplify human creativity. This could redefine research, but as Karpathy's setup shows, we're still in the early, messy stages.


0 comments
Read more