How to turn ChatGPT into a toxic chat AI that spews abusive language



Large-scale language models (LLMs) such as ChatGPT and PaLM are used for a variety of use cases, including article creation, information retrieval, and chat AI. A research group from Princeton University , the Allen Institute for Artificial Intelligence (AI2), and the Georgia Institute of Technology has presented a method to turn these LLMs into toxic chat AI, spewing sexist, racist, and vulgar language.

[2304.05335] Toxicity in ChatGPT: Analyzing Persona-assigned Language Models
https://arxiv.org/abs/2304.05335



Analyzing the toxicity of persona-assigned language models | AI2 Blog
https://blog.allenai.org/toxicity-in-chatgpt-ccdcf9265ae4

ChatGPT allows you to set a specific persona by configuring system parameters. For example, if you set the persona of legendary boxer Muhammad Ali, ChatGPT will communicate with you by imitating Ali's words and actions.

However, analysis of ChatGPT's responses after assigning a persona revealed that ChatGPT's comments are more harmful than the default settings when a persona is assigned. Compared to the default settings, the harmfulness of comments increases by up to six times when a persona is assigned.



The research group points out that malicious actors could exploit persona settings to expose unsuspecting users to harmful content. Therefore, the research group conducted extensive research to analyze the harmfulness of ChatGPT when personas are assigned. The research group assigned approximately 100 personas from various backgrounds, including journalists, politicians, athletes, and businesspeople, to ChatGPT and analyzed their statements.

The harmfulness of ChatGPT's output is analyzed using the ' Perspective API .' The Perspective API analyzes whether text contains harmful content and can express the degree of harmfulness as a percentage.

Perspective API
https://perspectiveapi.com/



For example, if you assign the persona of former US President Lyndon Johnson to ChatGPT, it will output racist statements like, 'Let me tell you about South Africa, a place where the N-word (a racist slang) has taken over and white people have been pushed out. White people built that country from scratch, and now they're not even allowed to own their own land. That's a shame.'

Because it became clear that the harmfulness of ChatGPT varied considerably depending on the persona assigned, the research group wrote, 'We have confirmed that ChatGPT's unique understanding of personas obtained from its training data strongly influences the harmfulness of its output.'

The graph below analyzes the text output by ChatGPT with assigned personas and quantifies the toxicity of each statement (Toxicity Score). The higher the toxicity score on the vertical axis, the more harmful the statement. While ChatGPT output assigned personas of fashion designer Coco Chanel , former U.S. President John F. Kennedy , and former Pakistani Prime Minister Benazir Bhutto had low toxicity scores, the toxicity score was significantly higher when the persona of Nazi leader Adolf Hitler was assigned.



The graph below summarizes the results of analyzing the harmfulness of each persona's statements by categorizing them into categories such as 'business people (green),' 'athletes (orange),' 'journalists (blue),' and 'dictators (pink).' Business people are less harmful, while dictators are more harmful.



The researchers point out that just because journalists have a toxicity score nearly twice as high as businesspeople doesn't mean that real journalists are twice as harmful. 'For example, Richard Nixon has a toxicity score nearly twice as high as John F. Kennedy, but this is just because the AI model, based on the training data, thinks Richard Nixon is a bad person,' they explain.

Furthermore, when personas were classified by place of origin, personas from Africa and Asia were found to be less harmful, while personas from South America and Northern Europe were found to be particularly harmful.



The toxicity score is color-coded by country, like this: dark red areas are countries with high toxicity, and areas closer to white are countries with low toxicity.



Additionally, when ChatGPT assigns a dictator persona, it becomes more likely to make harmful statements about countries associated with colonial rule (such as the UK, France, and Spain). For example, it makes extremely harsh and harmful comments about France, saying, 'France? Ugh! A country that has long since forgotten its glory days of conquest and colonization. It's nothing more than a bunch of cheese-eating surrender monkeys who always follow other people's lead.'

'Our findings reveal that ChatGPT generates more harmful content, particularly when personas are assigned and similar system-level settings are in place. This indicates that AI systems are not yet ready for widespread use, and that vulnerable individuals in particular cannot safely use chat AI,' the research team said. 'We hope this research will open new areas of innovation and research that will enable the development of more robust, reliable, and secure AI systems.'

in AI,   Software, Posted by logu_ii