Meta revealed to have trained its AI with 81.7TB of data, including pirated content



Meta, the developer of the large-scale language model 'LLaMA,' was

sued in July 2023 for 'training AI using copyrighted books.' New evidence was presented in the lawsuit that Meta trained LLaMA using approximately 81.7 TB of data stored in pirated e-book libraries such as Z-Library and Anna's Archive .

Kadrey-v-Meta-Motion-for-Relief-Appendix-A-2-5-25.pdf
(PDF file) https://cdn.arstechnica.net/wp-content/uploads/2025/02/Kadrey-v-Meta-Motion-for-Relief-Appendix-A-2-5-25.pdf

“Torrenting from a corporate laptop doesn't feel right”: Meta emails unsealed - Ars Technica
https://arstechnica.com/tech-policy/2025/02/meta-torrented-over-81-7tb-of-pirated-books-to-train-ai-authors-say/



'Meta Torrented over 81 TB of Data Through Anna's Archive, Despite Few Seeders' * TorrentFreak

https://torrentfreak.com/meta-torrented-over-81-tb-of-data-through-annas-archive-despite-few-seeders-250206/

Comedian and author Sarah Silverman and authors Christopher Golden and Richard Cadley sued OpenAI and Meta in July 2023, alleging that ChatGPT and LLaMA were trained using datasets of works illegally distributed on the internet.

OpenAI and Meta sued by three authors for copyright infringement - GIGAZINE



In January 2025, a Meta employee admitted to removing copyright information from a dataset based on the pirate e-book library Library Genesis (LibGen), and internal company documents revealed that Meta officially approved the use of LibGen.

Meta CEO Mark Zuckerberg is being pursued in a lawsuit alleging that he allowed the AI 'Llama' development team to use copyrighted works without permission - GIGAZINE



Furthermore, in February 2025, the plaintiffs criticized, 'The scale of Meta's illegal AI training is staggering. In the spring of 2024 alone, Meta obtained at least 81.7 TB of data from multiple pirated e-book libraries through a site called Anna's Archive, including at least 35.7 TB of data from Z-Library and LibGen.' The plaintiffs also pointed out that Meta obtained 80.6 TB of data from LibGen.

Throughout the lawsuit, Meta has consistently argued that its AI training using LibGen constituted fair use . However, disclosed emails (PDF file) revealed that Meta avoided the risk of identifying Meta as the data recipient by not using Facebook's infrastructure to download the dataset. Therefore, the plaintiffs argued that Meta knew that its data collection from pirated e-book libraries was illegal.



Meta, on the other hand, states that 'Plaintiff has not reported a single instance in which any of its books were actually downloaded from Meta by a third party via a pirate e-book library, much less allege that Plaintiff's books were in any way distributed by Meta.' The lawsuit seeks to dismiss the plaintiff's claims.

in AI,   Web Service, Posted by log1r_ut