What is the 'data shortage problem' that will cause data used to train AI to run out by 2026?



There is a huge amount of data available on the internet, and AI models created by training these data are appearing one after another. While the spread of AI is exploding, researchers are concerned that the training data that fuels AI systems may run out.

Researchers warn we could run out of data to train AI by 2026. What then?

https://theconversation.com/researchers-warn-we-could-run-out-of-data-to-train-ai-by-2026-what-then-216741

Accurate and powerful AI training requires massive amounts of data. According to Rita Maturionite, a senior lecturer at the School of Law at Macquarie University in Australia who specializes in the legal regulation of technology in the creative industries, ChatGPT was trained on 570GB of text data, or roughly 300 billion words.

Similarly, the Stable Diffusion algorithm that powers image-generating AIs like DALL-E, Lensa, and Midjourney was trained on the LIAON-5B dataset, which consists of 5.8 billion image-text pairs. If these algorithms don't have enough training data, the output will be inaccurate or of low quality.



Not only quantity but also quality of training data is important. For example, low-quality data such as social media posts and blurry photos are easily available, but are not suitable for training high-performance AI models.

A more serious problem is that text data obtained from social media is at risk of being filled with prejudice and discrimination, and of containing false information or illegal content.

For example, when Microsoft tried to train an AI using content from X (then Twitter), the AI began generating misogynistic and racist comments.

Microsoft's AI is suspended after making a series of questionable statements, including 'Fucking feminists should burn in hell' and 'Hitler was right' - GIGAZINE



This precedent has led AI developers to seek out high-quality data sources, such as books, scientific papers, Wikipedia, online articles, and text from specific filtered content. For example, Google uses 11,000 romance novels from self-publishing site Smashwords to improve the conversational capabilities of Google Assistant.

High-performance models like ChatGPT and DALL-E 3 were created by training on these high-quality, massive datasets, but limits are beginning to appear. A paper published in 2022 on the preprint server arXiv predicted that 'if AI training continues at the current rate, high-quality text data will run out by 2026, low-quality text data will run out between 2030 and 2050, and low-quality image data will run out between 2030 and 2060.'

According to consulting firm PwC, AI could have an economic impact of up to $15.7 trillion (approximately 2,363,886 billion yen) on the global economy by 2030. However, if the data to train AI runs out by 2030, the development of AI will be delayed.



However, Maturionite says that 'the situation may not be as bad as it seems,' because there are many unknowns regarding the development of AI models.

Researchers are also exploring ways to address the risk of data scarcity. One way to do this is to improve algorithms to use existing data more efficiently. Using less data means that more powerful AI systems can be trained with less computing power, which also reduces the carbon emissions generated by AI development.

Another approach is to use AI to synthesize training data, allowing AI developers to synthesize the data needed to tailor a specific AI model. Several projects are already using synthetic content from MOSTLY AI, a company that creates synthetic data for AI models, and Maturionite believes this approach will become more common in the future.

AI developers are also looking beyond the free internet to find new outlets, such as content owned by major publishers and offline repositories. News Corp, one of the world's largest news content owners, announced in September 2023 that it was in negotiations with AI developers about content deals. Thus, AI development, which has traditionally used free content almost without permission, is shifting toward paying for premium content.

Regarding this trend, Maturionite said, 'Creators are protesting the unauthorized use of their content to train AI models, and some are suing AI companies like Microsoft, OpenAI, and Stability AI. Being paid for their work could also help improve the power imbalance between creators and AI companies.'

in AI,   Software, Posted by log1l_ks