Nvidia faces class-action lawsuit for training AI model on ‘shadow library’

30 April 2024

Nvidia joins the growing list of technology companies which have been accused of training their AI-powered large language models (LLMs) on licensed material without permission. On 8 March 2024, authors Abdi Nazemian, Brian Keene, and Stewart O’Nan filed two class-action lawsuits in the Californian federal district court against Nvidia Corporation and Databricks respectively, alleging that their works were part of a dataset of nearly 197,000 books used to train Nvidia’s NeMo and Databricks’ MosaicML models.[1]

The rise of the ‘shadow library’

The authors claim that the NeMo and MosaicML models were trained on a dataset which incorporated hundreds of thousands of pirated e-books. This so-called ‘shadow library’ was found on the now obsolete database Bibliotik. Others of its kind include Library Genesis, Z-Library, Sci-Hub and Anna’s Archive. According to the filings, the claimants allege that Bibliotik’s cache of unlicensed copyrighted material included at least one published work from each author: Nazemian’s Like a Love Story, Keene’s Ghost Walk and O’Nan’s Last Night at the Lobster. They are seeking unspecified damages for people in the United States whose copyrighted material has been used to train the NeMo and MosaicML’s models within the past three years.

The Nvidia and Databricks cases strike at the heart of a key issue surrounding LLMs: the general lack of transparency as to what data these models are trained on. This opacity makes it difficult to establish what input material is fed into these systems. Ironically, in the case of Nvidia and Databricks, the companies had disclosed precisely what data their models were trained on. This data is alleged to include a particularly controversial dataset known as Books3. The claimants are relying on the fact that Shawn Presser, the mind behind the Books3 dataset, has confirmed publicly that all of Bibliotik was incorporated into the dataset. Therefore, they allege, this is evidence that their books must have been copied and fed into Nvidia and Databricks’ AI models, which they argue constitutes direct copyright infringement. Books3 was available on the machine learning and data science platform Hugging Face until October 2023, after which it was removed with a disclaimer stating that the dataset was “defunct and no longer accessible due to reported copyright infringement.”[2]

Most notably, relying on such works as training material appears to be somewhat of a widespread practice within the LLM space; OpenAI has stated it has used a dataset similarly comprised of eBooks to train its GPT3 language model,[3] which currently powers the ChatGPT bot.

The road ahead

The lawsuits against Nvidia and Databricks are not the first of its kind and they are not poised to be the last. It likely comes as no surprise that a similar class-action copyright infringement claim was filed against OpenAI and Meta in July 2023, alleging that the training material used for the ChatGPT and LLaMa AI models included copyrighted books. One of the plaintiffs in this case is the well-known comedian and author, Sarah Silverman. As of 13 February 2024, this case is still pending before the United States District Court of Northern California. Although U.S. District Judge Araceli Martinez-Olguin rejected the claimants’ arguments that ChatGPT’s output infringed their copyrights and that OpenAI unjustly enriched itself with their work, the direct copyright infringement complaint remains the last of six claims standing.[4]

It is worth noting that a global governmental crackdown on shadow libraries has been well underway for some time. The New York division of the United States Federal Bureau of Investigation seized several websites associated with Z-Library in October 2022 and subsequently brought charges of criminal copyright infringement, wire fraud and money laundering against two Russian nationals.[5] Courts in several jurisdictions including France have additionally issued blocking injunctions against internet service providers ordering them to obstruct public access to Z-Library.[6]

It follows that in light of the proliferation of generative AI platforms powered by LLMs such as ChatGPT and NeMo, it is imperative that companies in the AI space do more than merely disclose which data has been fed into their systems. An active effort must be made to exclude any copyrighted material as training material which was obtained without the permission of the copyright holder. If this is not guaranteed, they may find themselves similarly exposed to an onslaught of copyright infringement claims.

[1] Nazemian et al. v NVIDIA Corporation, U.S. District Court, Northern District of California, No. 24-01454; O’Nan et al. v Databricks, Inc., and Mosaic ML, Inc., U.S. District Court, Northern District of California, No. 24-01451.

[2] Hugging Face Inc., ‘Dataset Card for the_pile_books3’ <https://huggingface.co/datasets/the_pile_books3> accessed 18 March 2024.

[3] OpenAI, ‘Language Models are Few-Shot Learners’ (22 July 2020) John Hopkins University.

[4] Tremblay et al. v OpenAI Inc, U.S. District Court, Northern District of California, No. 3:23-cv-03223; Silverman et al. v OpenAI Inc, U.S. District Court for the Northern District of California, No. 3:23-cv-03416.

[5] United States Attorney’s Office, Eastern District of New York ‘Two Russian Nationals Charged with Running Massive E-Book Piracy Website’ (16 November 2022) <https://www.justice.gov/usao-edny/pr/two-russian-nationals-charged-running-massive-e-book-piracy-website> accessed 20 March 2024.

[6] Tribunal Judiciaire Paris, 3-ième Chambre, 1re sec., Syndicat national de l’Edition c/SFR, Bouygues Telecom, Free et Orange, RG No 22/08014 [25 August 2022].

Our thinking

ESG litigation risk for UK-headquartered companies in respect of human rights, environmental impact and labour conditions overseas: An update on case law

Kerry Stares

Insights
17 March 2025
Find out more
Data Protection and Privacy: Continuing Trends and Developments

Janine Regan

Insights
17 March 2025
Find out more
Arbitrating shareholders’ disputes

Thomas R. Snider

Insights
14 March 2025
Find out more
Kevin Gibbs and Sadie Pitman write for CoStar on the need for investment in power infrastructure to support new data centres

Kevin Gibbs

In the Press
10 March 2025
Find out more
New code of practice for the cyber security of AI development

Rebecca Steer

Quick Reads
10 March 2025
Find out more
EU Design Legislation Updates

Matthew Clark

Insights
06 March 2025
Find out more
Extra Time: The business of women’s football in Africa

Sarah Johnson

Podcasts
06 March 2025
Find out more
Singaporean Court Declines to Revisit SIAC Registrar’s Administrative Decision

Thomas R. Snider

Insights
05 March 2025
Find out more
Ilona Bateson speaks at an event hosted by TheIndustry.fashion on the challenges and opportunities for fashion retailers in 2025

Ilona Bateson

In the Press
04 March 2025
Find out more
The World’s Most Exclusive Gold Card

Kurt Rademacher

Quick Reads
03 March 2025
Find out more
Swiss Anti-Corruption Laws: A Guide to Bribery Offences, Compliance, and Penalties

Daniela Iselin

Insights
28 February 2025
Find out more
Passage of the English Arbitration Act 2025 into Law

Thomas R. Snider

Insights
27 February 2025
Find out more
Mary Bagnall writes for FMCG CEO on the recent Thatchers v Aldi court ruling

Mary Bagnall

In the Press
26 February 2025
Find out more
Up In The AI: Gen AI and In-house Teams

Joe Cohen

Podcasts
25 February 2025
Find out more
5 trends to watch in International Arbitration in 2025

Thomas R. Snider

Insights
24 February 2025
Find out more
Up in the AI: Gen AI and Access to Justice

Joe Cohen

Podcasts
18 February 2025
Find out more
EU AI Act: Key provisions now in force

Racheal Muldoon

Insights
17 February 2025
Find out more
EU Designs: Upcoming increases in renewal fees and amendments to renewal deadlines

Charlotte Duly

Quick Reads
17 February 2025
Find out more
Maintaining the Integrity of Sport – Time for AI to Take the Lead ?

Darren Bailey

Quick Reads
13 February 2025
Find out more
Mahmood v Standard Chartered Bank – A landmark decision in discrimination and victimisation but what does it mean for discrimination claims in the DIFC?

Nick Hurley

Insights
13 February 2025
Find out more