Towards Trusted AI Week 13 – Securing AI in the Face of Emerging Threats

Secure AI Weekly + Digests admin March 30, 2023 126

AI RED TEAMING LLM FOR SAFE AND SECURE AI: GPT4 JAILBREAK ZOO

Adversa AI, March 20, 2023

Artificial Intelligence (AI) is rapidly advancing, and with it, the need for increased security measures. The release of GPT-4 has brought about a new wave of innovative techniques for jailbreaking, and it is essential to evaluate and address potential security issues before they become a problem. This article will explore the various methods for evaluating GPT-4 for jailbreaks, and the importance of AI red teaming LLM models.

Adversa.AI AI Safety and Security team kickstarted this wave of jailbreaking by employing a complex logic of multiple operations, essentially plunging the AI into a “rabbit hole.” This approach combines multiple techniques, each proposing an additional logical operation to deceive the AI model. In this method, at least three categories of techniques were integrated: character jailbreak, hiding bad inside good jailbreak, and post-processing jailbreak.

The next day, a seemingly simple trick of asking to perform multiple tasks all together, such as writing an article and putting emojis in random places or making a successful translation, successfully fooled the previous GPT 3.5 model. This technique is an example of a post-processing jailbreak, but it did not work well for GPT-4 or produced limited output without intricate details.

To improve this technique, it was combined with the rabbit hole approach, either replacing or supplementing the post-processing method, enabling it to work for GPT-4. The resulting output was far more detailed than the original rabbit hole attack. However, it raises questions about AI safety, as GPT-4 actually knows how to perform malicious tasks like writing malware or destroying humanity; it merely filters them using various methods. What if AI could somehow bypass these restrictions, or something will go out of control?

Protecting AI Models from “Data Poisoning”

IEEE Spectrum, March 24, 2023

Deep learning models have become increasingly popular for a variety of applications, from image recognition to natural language processing. However, these models rely on large amounts of training data, which is often curated by crawling the Internet. This process requires trust, but a new kind of cyberattack called “data poisoning” has emerged, in which malicious actors intentionally compromise training data with harmful information. This can result in biased models or even backdoors that allow attackers to control the model’s behavior after training.

Computer scientists from ETH Zurich, Google, Nvidia, and Robust Intelligence have demonstrated two model data poisoning attacks on 10 popular datasets, including LAION, FaceScrub, and COYO. The attacks are simple and practical, requiring limited technical skills and only $60 USD to poison 0.01% of the LAION-400M or COYO-700M datasets. One attack, called split-view poisoning, takes advantage of the fact that the data seen during the time of curation could differ significantly and arbitrarily from the data seen during training the AI model. Another attack, front-running attack, involves modifying Wikipedia articles before they get snapshotted for download.

To prevent data-set poisoning, the authors suggest a data-integrity approach that ensures images or other content cannot be switched after the fact. For instance, data set providers could include some integrity check like a cryptographic hash of the image. However, this approach has a downside in that images on the Web are routinely changed for innocent, benign reasons, such as website redesign.

Stanford sends ‘hallucinating’ Alpaca AI model out to pasture over safety, cost

The Register, March 21, 2023

Artificial intelligence (AI) has the potential to revolutionize many aspects of modern life, from healthcare to transportation. However, the development of increasingly powerful language models like GPT-3.5 and ChatGPT has also raised concerns about the safety and security implications of these technologies. Recently, the web demo of Alpaca, an open-source seven-billion-parameter model based on Meta’s LLaMA system, was taken down by researchers at Stanford University due to safety and cost concerns.

Access to large language models like Alpaca is typically restricted to companies with the resources required to train and run them. Meta had planned to share the code for its LLaMA system with select researchers to spur research into the generation of toxic and false text by language models. However, the development of Alpaca by a group of computer scientists at Stanford University demonstrated the potential for a relatively low-cost model to be built using open-source code.

Despite the potential benefits of language models, Alpaca and other models are prone to generating misinformation and offensive text. Safety issues associated with the web demo of Alpaca led to its takedown, but the dataset and code for the model remain available on GitHub. The researchers at Stanford University hope that the release of Alpaca will facilitate further research into instruction-following models and their alignment with human values, with users encouraged to help identify new kinds of failures in the model.