Over 2 billion Discord messages scraped using public API and published online, 3167 servers from 2015 to 2024

A team of 15 researchers from the Federal University of Minas Gerais in Brazil scraped Discord as part of a research project, creating a database of more than 2 billion messages that they have made available online. The researchers say they have anonymized the data.
Discord Unveiled: A Comprehensive Dataset of Public Communication (2015-2024)
(PDF file)
Researchers Scrape 2 Billion Discord Messages and Publish Them Online
https://www.404media.co/researchers-scrape-2-billion-discord-messages-and-publish-them-online/
The research team obtained data from 3,167 publicly available servers, collected 2,052,206,308 messages exchanged by 4,735,057 people from 2015 to 2024, and published them as JSON files.
Discord servers can be freely created by users and can be set to public or private, and users can find public servers using Discord's ' Discover ' feature.
The researchers used this discovery feature to attempt to map all public Discord servers, finding a total of 31,673 servers as of November 17, 2024. They then randomly selected 10% of those servers for scraping.

The research team said the purpose of the database is 'so that other research teams can use it to study mental health and politics, or to train bots. Our dataset will allow us to study the impact of digital platforms on political discourse, how misinformation spreads, and effective moderation and regulation strategies tailored to such environments.'
The research team explained that when publishing the chat histories, they took security precautions, such as rewriting usernames and hashing and truncating user IDs and messages.
However, while the information was certainly obtained from a server that anyone can see, some people point out that since Discord is basically used as a means of communication within a small community, some people do not expect that information from public servers will literally be made public.

'While the researchers claim to have anonymized the data, no one is going to be happy about their Discord messages being stored in a public file online,' said 404 Media, a technology media outlet. 'Few people read the terms of service, and it's important to remember that many of Discord's users are children. Discord is first and foremost a platform for gamers to organize their communities, and kids probably wouldn't expect their casual jokes to end up in a public database.'
In addition, Discord's developer policy states that 'unless specifically permitted by Discord, you may not use the content of messages obtained through the API for machine learning or AI training (including large-scale language models)' and 'you may not mine or scrape any data, content, or information available on or through the Discord Service.' In addition, the terms of use also prohibit scraping. For this reason, 404 Media points out that this research seems to violate the terms of use in the first place, even before privacy concerns.
Related Posts:
in Web Service, Posted by log1p_kr