AI Just Beat Law Professors at Their Own Game — Stanford Study

In a landmark blind study from Stanford Law School, law professors overwhelmingly preferred AI-generated answers to student questions over responses written by their own colleagues. The results are striking — and have major implications for legal education and professional services.

How the Experiment Was Designed
- 16 professors from 14 top U.S. law schools, all teaching contracts law from the same textbook.
- They collectively created 40 realistic questions that students typically ask during office hours.
- Each professor wrote their own concise answers (~90 words).
- AI systems (primarily Gemini 2.5 Pro and NotebookLM) generated matching responses.
- In nearly 3,000 anonymized, blind comparisons, the same professors evaluated which answer was better for students — without knowing the source.
The Results
- AI won 75% of head-to-head matchups against human professors.
- Professors flagged AI answers as “potentially harmful or misleading for student learning” in only 3.5% of cases — compared to 12% for peer-written answers (with one professor’s answers flagged nearly 40% of the time).
- The best AI models performed at the level of the strongest human instructor in the sample.
When the researchers later evaluated additional frontier models using an AI judge (calibrated against human judgments), every single LLM tested outperformed the human professors. The leaderboard was topped by Claude Opus 4.7, followed by ChatGPT 5.4 and Gemini 2.5 Pro. The gap appears to be widening with each new model generation.
Two Surprising Product-Building Lessons

Stock Gemini 2.5 Pro (no custom retrieval) actually outperformed both NotebookLM and dedicated commercial AI tutoring systems that used RAG grounded in the textbook.
The authors’ hypothesis: The base model already knows contract law doctrine well. Injecting long context via RAG can introduce noise and cause “lost in the middle” problems, while custom system prompts sometimes interfere with the model’s natural pedagogical strengths.
2. Reasoning Budget Matters More Than Freshness
Gemini Flash with extra “thinking”/reasoning steps significantly outperformed the same model without it. Interestingly, newer models weren’t always better — Gemini 3.1 Pro underperformed 2.5 Pro, suggesting that post-training and specific fine-tuning for educational use can outweigh raw recency.

- Lovable: $0 → $400M ARR in 14 Months — A Brutal Teardown of the Playbook (and What’s Probably Wrong With It)
- AI Is Rewriting the Economics of Outsourcing
- Hollywood Finally Gets It: Sydney Sweeney Isn’t Just a Star — She’s a Business
- The Golden Age of Being a Hired Developer Is Over
What This Really Means
The study demonstrates that AI isn’t just matching surface-level style — it is meeting (and exceeding) the implicit professional standard that expert lawyers use to judge each other’s work. This is deeper than factual recall; it involves nuanced reasoning, balanced analysis, and clear explanation under ambiguity.

As Professor Nyarko and his co-authors emphasize, the findings don’t mean we should replace professors wholesale. But they do challenge blanket skepticism. The conversation should move from “Can AI do this?” to “How do we deploy it responsibly to help students learn better?”
The era in which elite legal reasoning was an exclusively human domain is ending faster than most expected. And the disruption is coming not from the bottom up, but from the very top of the profession.