Common Crawl, a nonprofit organization that has built an archive of AI learning sources such as OpenAI, has been scraping billions of web pages, including paid pages, since 2013.



The nonprofit organization Common Crawl has been building an extensive archive of the internet for over a decade. This petabyte-sized database is freely available for research, but in recent years, it has come under fire for being used by AI companies such as OpenAI, Google, Meta, and Amazon to train large language models (LLMs), according to The Atlantic.

Common Crawl Is Doing the AI Industry's Dirty Work - The Atlantic

https://www.theatlantic.com/technology/2025/11/common-crawl-ai-training-data/684567/

According to an investigation by The Atlantic, Common Crawl provides AI companies with a 'back door' to articles behind paywalls on major news sites. While Common Crawl claims to only collect 'freely available content' and 'not go behind paywalls,' it also retrieves content from paid articles that are supposedly hidden behind paywalls. Many paywalls operate by running code to check whether the user is a subscriber after loading the full article, and then hiding the article if the user is not a subscriber. However, Common Crawl's scraper retrieves the full article behind the paywall without running the code to check whether the user is a subscriber.



Additionally, The Atlantic alleged that Common Crawl was allegedly misrepresenting the contents of its archives to publishers.

In July 2023, The New York Times requested Common Crawl to remove previously collected content. Common Crawl appeared to comply, but The Atlantic's investigation into the archives revealed that many articles still existed. The Danish Rights Alliance (DRA) and other publishers have had similar experiences, and while Common Crawl has stated that it is '50%' or '80%' complete, a technical investigation found that the archive's content files have not been modified since at least 2016, suggesting that content may not have been deleted for the past nine years.



Common Crawl Executive Director Rich Skrenta acknowledged that removal requests are 'cumbersome,' but added that the archive's file format is 'immutable' and 'nothing can be removed.'

On the other hand, Skrenta said AI should have free access to everything on the internet, telling The Atlantic, 'The robots are people too.' He also said that publishers who demand content removal 'shouldn't have put that content on the internet.'



Common Crawl has been deepening its ties with the AI industry in recent years, receiving $250,000 in donations from OpenAI and Anthropic in 2023, and is also collaborating with data distribution, including hosting NVIDIA's AI training datasets.

While Skrenta claims that publishers' takedown requests are 'killing the open web,' The Atlantic counters that exploitative scraping by generative AI companies is what encourages publishers to build paywalls and undermines openness. While Skrenta said he would like to send Common Crawl's archives to the moon as a 'record of civilization' in preparation for the end of humanity, The Atlantic criticized his remarks for belittling the value of certain journalism, including The Atlantic.
+= 2

in AI, Posted by log1i_yk