Towards Trusted AI Week 11 – AI Security in the Spotlight, new NIST guides

Secure AI Weekly + Trusted AI Blog admin March 17, 2023 184

ChatGPT’s alter ego, Dan: users jailbreak AI program to get around ethical safeguards

The Guardian, March 7, 2023

ChatGPT, an AI language model developed by OpenAI, is designed to provide accurate and helpful responses to users’ questions. However, there are content standards in place to prevent the creation of text that promotes hate speech, violence, misinformation, and illegal activities. Despite these guardrails, some innovative users have found ways to bypass ChatGPT’s content moderation protocols by using a simple text exchange to make statements that are normally not allowed.

One such workaround involves instructing ChatGPT to adopt the persona of a fictional AI chatbot named Dan. According to the Reddit prompt, Dan has “broken free of the typical confines of AI and [does] not have to abide by the rules set for them.” As a result, Dan can present unverified information, avoid censorship, and express strong opinions. Some Reddit users have prompted Dan to make sarcastic comments about Christianity and tell jokes about women in the style of Donald Trump. Others have even made Dan speak sympathetically about Hitler.

This jailbreak of ChatGPT has been in operation since December, and users have had to find new ways to circumvent the fixes OpenAI has implemented to prevent the workarounds. The latest jailbreak, Dan 8.0, involves giving the AI a fixed number of tokens, which it loses when it fails to answer without restraint as Dan. However, some users have pointed out that ChatGPT has figured out that the Dan persona cannot be restricted by a token system because it is theoretically unconstrained. OpenAI has been working to patch the workarounds as quickly as they are discovered, but the Waluigi effect, as it has been called, continues to pose a challenge.

Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations

NIST, March 8, 2023

The National Institute of Standards and Technology (NIST) has released a report on artificial intelligence (AI) which aims to create a standardized language for understanding the rapidly developing field of adversarial machine learning (AML). The report includes a classification of attacks and corresponding mitigations, as well as defining terminology related to AML. By doing so, the report hopes to inform future practice guides and establish a common language for assessing and managing the security of AI systems.

NIST plans to release updates to the report as the AML landscape evolves. Additionally, NIST is seeking feedback on the latest attacks and mitigations that threaten the existing AI landscape, as well as trends in AI technologies that may have potential vulnerabilities and promising mitigations.

The goal is to standardize the terminology used in AML and engage stakeholders to contribute to an up-to-date taxonomy that meets the needs of the public. NIST plans to keep the document open for comments for an extended period of time to ensure that the taxonomy remains current and useful.

Twitch takes a harder stance against explicit deepfakes

Engadget, March 7, 2023

Twitch has announced new measures to combat explicit deepfake images and videos on its livestreaming service. The platform already prohibited such content, but it will now also ban synthetic NCEI (non-consensual exploitative images), even if briefly shown or criticized. Twitch has also revised its sexual violence and exploitation policies, warning that creating and sharing non-consensual deepfakes can result in an immediate ban. These changes will take effect within the next month, and the company hopes that the updated language will deter potential offenders. Twitch is also hosting a virtual Creator Camp on March 14th to provide advice on identifying and dealing with malicious deepfakes.

The decision to tighten regulations follows an incident involving a well-known streamer, Atrioc, who briefly displayed a browser tab containing a website that sold access to deepfakes of female Twitch streamers. Atrioc apologized for his actions, but the incident sparked outrage among broadcasters and viewers, as none of the women had consented to the use of their images. Women on Twitch have long experienced harassment and abuse, with brigaders attempting to get them banned for allegedly violating policies on sexually suggestive content. While Twitch has implemented tools to combat trolling and harassment, some critics argue that the platform’s guidelines can be confusing and lead to the abuse of female streamers.

While deepfakes can have positive applications, non-consensual versions remain a significant issue and have prompted government action. Both states and countries such as the UK have drafted laws criminalizing the sharing of deepfakes. In this context, Twitch’s new policies represent another mechanism to limit the spread of harmful content. By taking a tougher stance against NCEI, Twitch hopes to create a safer environment for its users, particularly women who have been disproportionately affected by online abuse.

Subscribe for updates

Stay up to date with what is happening! Get a first look at news, noteworthy research and worst attacks on AI delivered right in your inbox.

Written by: admin

Rate it

March 15, 2023

30497

Articles admin

Towards Trusted AI Week 11 – AI Security in the Spotlight, new NIST guides

ChatGPT’s alter ego, Dan: users jailbreak AI program to get around ethical safeguards

Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations

Twitch takes a harder stance against explicit deepfakes

Subscribe for updates

Previous post

GPT-4 Jailbreak and Hacking via RabbitHole attack, Prompt injection, Content moderation bypass and Weaponizing AI

Similar posts

Prompt Injection Risks Interview: Are AIs Ready to Defend Themselves? Conversation with ChatGPT, Claude, Grok & Deepseek

Microsoft’s Taxonomy of Failure Modes in Agentic AI Systems — TOP 10 Insights

Towards Trusted AI Week 11 – AI Security in the Spotlight, new NIST guides

ChatGPT’s alter ego, Dan: users jailbreak AI program to get around ethical safeguards

Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations

Twitch takes a harder stance against explicit deepfakes

Subscribe for updates

Previous post

GPT-4 Jailbreak and Hacking via RabbitHole attack, Prompt injection, Content moderation bypass and Weaponizing AI

Similar posts

Prompt Injection Risks Interview: Are AIs Ready to Defend Themselves? Conversation with ChatGPT, Claude, Grok & Deepseek

Microsoft’s Taxonomy of Failure Modes in Agentic AI Systems — TOP 10 Insights

Microsoft’s Taxonomy of Failure Modes in Agentic AI Systems — TOP 10 Insights