An example has emerged in which 'Bluesky operators do not use user posts for AI learning, but third parties can learn AI,' and a data set of 1 million posts is made public on Hugging Face via Bluesky's API

X (formerly Twitter), which updated its terms of use in November 2024, clarified that posts will be used for AI training. In response to this, many users have switched to Bluesky, a rival social networking site that has stated that it will not use posts for AI training . However, a data set of 1 million posts obtained via Bluesky's API was made public on Hugging Face.
Someone Made a Dataset of One Million Bluesky Posts for 'Machine Learning Research'
https://www.404media.co/someone-made-a-dataset-of-one-million-bluesky-posts-for-machine-learning-research/
Bluesky may not train AI on your posts, but others can, and users are furious - Neowin
https://www.neowin.net/news/bluesky-may-not-train-ai-on-your-posts-but-others-can-and-users-are-furious/
Bluesky, AI, and the battle for consent on the open web
https://werd.io/2024/bluesky-ai-and-the-battle-for-consent-on-the-open
Bluesky updated its official account on November 15, 2024, stating that it would not use user content to train generative AI. However, because Bluesky has a system in place that keeps all posts open, there were concerns that it would be impossible to prevent AI learning by third parties.
Unlike X (formerly Twitter), Bluesky has stated that it will not use posts to train AI - GIGAZINE

Meanwhile, engineer Daniel van Strien announced on November 26, 2024 that 'a dataset of 1 million posts from Bluesky has been made available on Hugging Face.' Van Strien said about this dataset, 'It can be used for training and testing language models on social media content, analyzing social media posting patterns, studying conversation structure and reply networks, studying social media content moderation, and natural language processing tasks using social media data.'
In the post, Van Strien explains, 'We created the dataset using Bluesky's API, Firehose .' Firehose is an API that streams all posts in real time and allows third parties to freely use the posted data.
However, the dataset has drawn criticism from some users, with one user harshly criticizing Van Strien, saying, 'I moved to Bluesky to get away from crappy scraping with X, and now you're trying to use Bluesky data to train your AI - that's disgusting.'
In response to these criticisms, Van Strien removed the dataset from the Hugging Face repository on November 27, 2024. 'While I wanted to support the development of the platform's tools, I realized that this approach violated the principles of transparency and consent in data collection. I apologize for this mistake,' Van Strien said.
After the dataset was made public, Bluesky updated its official account, revealing that it is developing a mechanism to 'explicitly indicate whether or not users consent to their data being used for AI training.'
The mechanism for indicating whether or not AI can learn is being considered in the form of a ' robots.txt ' for websites. However, Bluesky says that 'it is up to the external developer to decide whether or not to respect user consent.'
Bluesky also said, 'We are continuing discussions with our engineers and lawyers and will provide an update soon.'
Related Posts:
in Software, Web Service, Posted by log1r_ut