Nvidia's AI Training Scandal: Emails Reveal Pursuit of Pirated Books Amid Copyright Lawsuit

In a fresh escalation of the ongoing battle between tech giants and content creators, newly disclosed emails have exposed Nvidia's alleged efforts to access massive troves of pirated books for AI model training.

The case, initially launched in early 2024, accuses Nvidia of copyright infringement through the use of unauthorized datasets, including the infamous Books3 collection. With AI's insatiable hunger for data clashing against intellectual property rights, this development could reshape how companies source training materials and face legal scrutiny.
The Roots of the Lawsuit: From Books3 to Broader Allegations

Nvidia defended the practice as "fair use," arguing that the books served as mere statistical inputs for model development. By mid-2024, the plaintiff group expanded to include Andre Dubus III and Susan Orlean, with potential for hundreds more affected authors to join the class action.
Fast-forward to January 2026: An amended complaint broadens the accusations, alleging Nvidia not only relied on Books3 but actively sought out additional pirated sources.
This includes distributing scripts to corporate clients for downloading "The Pile" — an 800GB open-source dataset containing Books3 — potentially enabling contributory infringement. Plaintiffs seek damages for direct, vicarious, and contributory copyright violations, emphasizing Nvidia's role in models like Retro-48B and InstructRetro.
The Smoking Gun: Emails with Anna’s Archive
The amended filing's bombshell is a series of emails from August 2023, where a Nvidia data strategy team member contacted Anna’s Archive — a self-described "largest shadow library in human history" — inquiring about "high-speed access" to its collections for LLM pre-training data.

Nvidia's response was swift: Within a week, management greenlit the deal, citing "competitive pressures" as justification. Access was granted to approximately 500 terabytes of data, encompassing millions of books — many digitized from the Internet Archive's controlled lending system, which has faced its own copyright challenges.
While the exact payment (estimated in tens of thousands of dollars) and whether Nvidia ultimately used the data remain unconfirmed, the correspondence marks the first public evidence of a major U.S. tech firm negotiating directly with such a pirate repository.
Beyond Anna’s: A Web of Shadow Libraries

Plaintiffs allege Nvidia drew from multiple illicit sources, including Library Genesis (LibGen), Sci-Hub, and Z-Library — platforms notorious for distributing pirated academic and literary works.
These claims build on earlier accusations tied to Books3, part of EleutherAI's "The Pile" dataset, which Nvidia referenced in model cards before its removal from Hugging Face in October 2023 due to copyright concerns.
Further, Nvidia is accused of providing tools that facilitated customer access to these datasets, amplifying the infringement.
This pattern echoes broader industry tensions: Similar suits target companies like OpenAI, Meta, and Databricks for unauthorized data scraping, with courts increasingly scrutinizing "fair use" defenses.
In 2024 alone, cases like Nazemian v. Nvidia and O’Nan v. Databricks highlighted the use of Books3, while others evolved to focus on direct infringement amid dismissed broader claims.

- Utah's AI Prescription Revolution: Doctronic Ushers in Autonomous Medicine
- Google's AI Commerce Leap: Direct Offers and the Universal Commerce Protocol Redefine Shopping in the Gemini Era
- NVIDIA's H200 Conundrum: US Approvals Meet Chinese Blocks in the AI Chip Wars
- Big Tech’s HR Overhaul: The End of Low Accountability
Competitive Pressures and Ethical Dilemmas
Nvidia's internal rationale — pressure from rivals — underscores the cutthroat race in AI development, where data scarcity drives desperate measures. Yet, this raises ethical red flags: By normalizing pirated sources, tech firms risk undermining creators' rights, potentially stifling innovation in publishing and academia. Anna’s Archive's own cautionary stance ironically highlights the moral ambiguity, as even pirates drew a line at unchecked corporate exploitation.
As the case progresses, it could set precedents for AI training ethics, forcing companies to seek licensed data or face hefty penalties. Nvidia has yet to comment on the amended complaint, but the emails suggest a calculated risk in a high-stakes game where innovation and infringement increasingly collide. For authors like Nazemian and Keene, this isn't just about compensation—it's a fight to protect the value of human creativity in an AI-dominated era.