01.02.2026 12:42Author: Viacheslav Vasipenok

Nvidia's AI Training Scandal: Emails Reveal Pursuit of Pirated Books Amid Copyright Lawsuit

News image

In a fresh escalation of the ongoing battle between tech giants and content creators, newly disclosed emails have exposed Nvidia's alleged efforts to access massive troves of pirated books for AI model training.

This revelation, part of an amended class-action lawsuit filed in the U.S. District Court for the Northern District of California, paints a picture of a company under competitive pressure turning to shadow libraries like Anna’s Archive for data — despite explicit warnings about the material's illegal origins.

The case, initially launched in early 2024, accuses Nvidia of copyright infringement through the use of unauthorized datasets, including the infamous Books3 collection. With AI's insatiable hunger for data clashing against intellectual property rights, this development could reshape how companies source training materials and face legal scrutiny.


The Roots of the Lawsuit: From Books3 to Broader Allegations

The controversy traces back to March 2024, when authors Abdi Nazemian, Brian Keene, and Stewart O’Nan filed suit against Nvidia, claiming the company's NeMo Megatron AI models were trained on pirated copies of their works via the Books3 dataset — a 196,000-book collection scraped from the Bibliotik shadow library.

Nvidia defended the practice as "fair use," arguing that the books served as mere statistical inputs for model development. By mid-2024, the plaintiff group expanded to include Andre Dubus III and Susan Orlean, with potential for hundreds more affected authors to join the class action.

Fast-forward to January 2026: An amended complaint broadens the accusations, alleging Nvidia not only relied on Books3 but actively sought out additional pirated sources.

This includes distributing scripts to corporate clients for downloading "The Pile" — an 800GB open-source dataset containing Books3 — potentially enabling contributory infringement. Plaintiffs seek damages for direct, vicarious, and contributory copyright violations, emphasizing Nvidia's role in models like Retro-48B and InstructRetro.


The Smoking Gun: Emails with Anna’s Archive

The amended filing's bombshell is a series of emails from August 2023, where a Nvidia data strategy team member contacted Anna’s Archive — a self-described "largest shadow library in human history" — inquiring about "high-speed access" to its collections for LLM pre-training data.

Anna’s Archive, known for hosting illegally obtained books and papers, responded by offering access but explicitly warning that the materials were acquired unlawfully and urging confirmation of internal approval.

Nvidia's response was swift: Within a week, management greenlit the deal, citing "competitive pressures" as justification. Access was granted to approximately 500 terabytes of data, encompassing millions of books — many digitized from the Internet Archive's controlled lending system, which has faced its own copyright challenges.

While the exact payment (estimated in tens of thousands of dollars) and whether Nvidia ultimately used the data remain unconfirmed, the correspondence marks the first public evidence of a major U.S. tech firm negotiating directly with such a pirate repository.


Beyond Anna’s: A Web of Shadow Libraries

The lawsuit doesn't stop at one interaction. Plaintiffs allege Nvidia drew from multiple illicit sources, including Library Genesis (LibGen), Sci-Hub, and Z-Library — platforms notorious for distributing pirated academic and literary works.

These claims build on earlier accusations tied to Books3, part of EleutherAI's "The Pile" dataset, which Nvidia referenced in model cards before its removal from Hugging Face in October 2023 due to copyright concerns.

Further, Nvidia is accused of providing tools that facilitated customer access to these datasets, amplifying the infringement.

This pattern echoes broader industry tensions: Similar suits target companies like OpenAI, Meta, and Databricks for unauthorized data scraping, with courts increasingly scrutinizing "fair use" defenses.

In 2024 alone, cases like Nazemian v. Nvidia and O’Nan v. Databricks highlighted the use of Books3, while others evolved to focus on direct infringement amid dismissed broader claims.

Also read:


Competitive Pressures and Ethical Dilemmas

Nvidia's internal rationale — pressure from rivals — underscores the cutthroat race in AI development, where data scarcity drives desperate measures. Yet, this raises ethical red flags: By normalizing pirated sources, tech firms risk undermining creators' rights, potentially stifling innovation in publishing and academia. Anna’s Archive's own cautionary stance ironically highlights the moral ambiguity, as even pirates drew a line at unchecked corporate exploitation.

As the case progresses, it could set precedents for AI training ethics, forcing companies to seek licensed data or face hefty penalties. Nvidia has yet to comment on the amended complaint, but the emails suggest a calculated risk in a high-stakes game where innovation and infringement increasingly collide. For authors like Nazemian and Keene, this isn't just about compensation—it's a fight to protect the value of human creativity in an AI-dominated era.


0 comments
Read more