top of page
OutSystems-business-transformation-with-gen-ai-ad-300x600.jpg
OutSystems-business-transformation-with-gen-ai-ad-728x90.jpg
TechNewsHub_Strip_v1.jpg

LATEST NEWS

Court documents reveal Meta deliberately used copyrighted content for AI training

Marijan Hassan - Tech Journalist

Newly unsealed court documents reveal that Meta employees discussed using copyrighted materials obtained through legally dubious means to train the company’s AI models. The revelations, which stem from the ongoing copyright infringement lawsuit Kadrey v. Meta, highlight Meta’s approach to AI training data, including alleged reliance on pirated books and other copyrighted works.



Internal deliberations on using copyrighted content

In one exchange, Meta research engineer Xavier Martinet suggested an "ask forgiveness, not permission" approach, advocating for acquiring books and escalating the decision to executives. Martinet also floated the idea of buying e-books at retail prices, bypassing licensing agreements with publishers, and argued that "a gazillion" startups were likely already using pirated books for AI training.


Senior manager Melanie Kambadur, who works on Meta’s Llama model research team, acknowledged the legal concerns but noted that Meta’s lawyers were becoming “less conservative” with approvals. Kambadur also mentioned that Meta was in talks with platforms like Scribd for licensing deals but still entertained alternative sources for training data.


Another chat revealed discussions about using Libgen, a known repository of pirated materials, as a data source. Meta's director of product management, Sony Theakanath, even deemed Libgen "essential" for achieving state-of-the-art AI model performance, suggesting "mitigations" to reduce legal exposure, like removing files labeled "pirated/stolen" and avoiding public acknowledgment of Libgen usage.


The filings also indicate that Meta explored scraping data from Reddit and considered "overriding" past decisions on training sets to ensure sufficient data for its models. Meta leadership reportedly recognized that its first-party data sources were inadequate.


Torrenting and seeding controversy

In a separate court filing, Meta admitted to torrenting an 82TB dataset of pirated, copyrighted material from shadow libraries to train its LLaMA AI models. However, the company claims its employees "took precautions not to 'seed' any downloaded files."


"Seeding" refers to sharing downloaded files with other users in a peer-to-peer torrenting system. Meta's lawyers argue there are "no facts to show that Meta seeded Plaintiffs' books," pinning their defense on the absence of concrete proof of data distribution.


Despite Meta's denial, testimony from Meta executive Michael Clark revealed that configuration settings were modified "so that the smallest amount of seeding possible could occur." When questioned about the reason for minimizing seeding, Meta invoked attorney-client privilege. Internal messages from Meta researcher Frank Zhang also suggest potential efforts to conceal seeding activity from Meta's servers to avoid tracing.


The legal and ethical implications

Meta has continued to assert that training AI models on copyrighted content falls under “fair use,” a stance that has been met with fierce opposition from authors and copyright holders.


The lawsuit, filed in the U.S. District Court for the Northern District of California, includes plaintiffs such as authors Sarah Silverman and Ta-Nehisi Coates, who argue that Meta was a "knowing participant in an illegal peer-to-peer piracy network," bypassing lawful acquisition methods. They argue that Meta cross-referenced pirated books with licensed books to determine the viability of licensing agreements.


The legal stakes are high, with Meta bringing in Supreme Court litigators from the law firm Paul Weiss to aid in its defense. If Meta successfully argues that downloading copyrighted content is not illegal—only distributing it is—it could have major ramifications for copyright law and AI development. Conversely, a ruling against Meta could set a precedent requiring stricter licensing agreements for AI training data.

wasabi.png
Gamma_300x600.jpg
paypal.png
bottom of page