Towards Secure AI Week 32 – The Future of Reporting Model Flaws
The search for a new way to report AI model flaws Axios, August 6, 2024 AI security experts are convening in Las Vegas this week to tackle a critical challenge: ...
Trusted AI Blog + LLM Security admin todayAugust 13, 2024 175
Explore the most critical vulnerabilities and emerging threats affecting Large Language Models (LLMs) and Generative AI technologies. As always, we provide useful guides and techniques to protect your AI systems.
This time, there is some creative and noteworthy security case. Intrinsic, a tech startup, implemented a strategy to identify job applications generated by large language models (LLMs). They included a line in their job descriptions instructing LLMs to start responses with the word “BANANA.” This tactic was used to filter out automated applications. One application indeed started with “Banana,” revealing it was AI-generated. The strategy not only helped filter but also garnered positive reactions from applicants who appreciated the creativity. The post is based on a conversation with Karine Mellata, cofounder of Intrinsic, and highlights the importance of thoughtful applications, especially for a small team.
The story highlights the ongoing issue of prompt injection in AI systems, demonstrating how easily such vulnerabilities can be exploited. The author shares a personal experience of hacking Priceline’s AI tool within minutes using prompt injection, which involves manipulating input to reveal or misuse internal system prompts. The post explains the difference between prompt injection and jailbreaking, the widespread risks, and provides an example using the OpenAI API to illustrate how prompt injection can be executed. It also discusses initial mitigation steps, including using more secure models, implementing input sanitization, and staying updated with the latest research to prevent such attacks. The author warns about the hype and fear surrounding AI security solutions and advises seeking adaptable, transparent solutions.
Once again, we’d like to discuss several exploitation techniques. First, the video is demonstrating a vulnerability discovered in a popular GPT model, where a user pastes a link into their chat, resulting in the deletion of a GitHub branch! Second, the article describes a data exfiltration vulnerability in GitHub Copilot Chat due to prompt injection, where exposed chat interfaces supporting Markdown images can be exploited to leak private data. GitHub fixed this by disabling Markdown image references to untrusted domains. Third, a data leakage vulnerability in Google Colab AI (now Gemini) via image rendering, was discovered in November 2023. Although Google fixed the issue, recent changes allowing Notebook content in prompts have introduced new risks of prompt injection, enabling data exfiltration and phishing attacks through clickable links.
MITRE has released a paper, highlighting the need of proactive protection through AI red teaming to counter potential threats from malicious actors. It emphasizes that the rapid adoption and advancement of AI across industries and governments increase the risk of exploitation because of the broad attack surface. The paper proposes recurring AI red teaming efforts to enhance national security, protect critical infrastructure, and ensure government mission continuity.
Meta AI’s CyberSecEval 3 is a framework that evaluates the cybersecurity risks and capabilities of LLMs, specifically the Llama 3 models. This latest iteration extends previous benchmarks by assessing offensive security capabilities, and found that while Llama 3 models exhibit some potential, they do not surpass existing methods or tools in effectiveness. The framework aims to provide a comprehensive understanding of LLM security, offering insights into mitigating risks and enhancing the reliability of AI systems.
Meta’s newly introduced Prompt-Guard-86M model, designed to detect and counter prompt injection attacks, has itself been found vulnerable to such attacks. Researchers discovered that by adding spaces between characters in prompts, they could bypass the model’s defenses, illustrating a significant flaw in its security. Meta is reportedly working on a fix for this issue, highlighting ongoing challenges in AI safety and the ease with which sophisticated prompt injection techniques can undermine model guardrails.
Researchers at EPFL discovered a straightforward method to bypass safeguards in popular LLMs like GPT-4o by using harmful prompts in the past tense. This attack achieved an 88% success rate, compared to just 1% with direct queries. This vulnerability highlights significant flaws in current AI security measures that need urgent attention as LLMs become more integrated into everyday applications.
This survey provides an in-depth analysis of the security and privacy issues associated with LLM agents, categorizes threats, and reviews existing defensive strategies. It aims to stimulate further research to enhance the security and reliability of LLM agents through case studies and future trend exploration. By highlighting critical security, privacy concerns and exploring future trends, the survey aims to stimulate research to improve the reliability and trustworthiness of LLM agents.
Researchers developed RLbreaker, a novel black-box jailbreaking attack using deep reinforcement learning (DRL) to bypass safeguards in LLMs. Unlike previous methods, which relied on genetic algorithms and were limited by randomness, RLbreaker uses a deterministic DRL approach for more effective and efficient prompt generation. The study demonstrated that RLbreaker significantly outperforms existing jailbreaking techniques against six state-of-the-art LLMs, including Llama-2-70B, and is robust against three major defenses. Additionally, RLbreaker’s design was validated through a comprehensive ablation study, highlighting its effectiveness and adaptability across different models.
Other researchers designed the Phi-3 series of small language models (SLMs) designed to run on smartphones while maintaining high performance. They implemented a “break-fix” cycle for safety alignment, which involved iterative rounds of dataset curation, safety post-training, benchmarking, and vulnerability identification. This approach successfully improved the Phi-3 models’ safety and performance across various responsible AI benchmarks. The results showed significant reductions in harmful content generation, although the models still face fundamental limitations common to modern language models.
This talk addresses the security risks associated with deploying open-source large language models (LLMs) from platforms like Hugging Face. As developers integrate these models into applications, they may unintentionally expose their companies to significant security threats, particularly when using proprietary data. The video reviews top LLM security risks, such as prompt injection, data poisoning, and supply chain vulnerabilities, while discussing emerging standards from OWASP, NIST, and MITRE. A validation framework is proposed to help developers innovate securely while mitigating these risks.
The “Introduction to Prompt Hacking” training is focused on understanding and defending against Prompt Hacking. It covers topics such as the basics of prompt hacking, the difference between prompt hacking and jailbreaking, and the concept of prompt injection. The training also explores potential threats, how prompt injection occurs, and various defense techniques, including the “Sandwich Defense,” few-shot prompting, and non-prompt-based techniques to prevent prompt leaking and jailbreaking.
The video discusses the essential steps to start an AI red team, emphasizing the need for gaining support from senior executives, securing resources, and establishing governance. It highlights the importance of collaboration with existing teams like Responsible AI, Data Scientists, and Machine Learning engineers. Viewers will learn practical strategies and tips for building a strong foundation and achieving early success in AI red team initiatives.
Anthropic is launching a new program to fund the creation of independent benchmark tests to evaluate the safety and advanced capabilities of its AI models. The company aims to develop high-quality evaluations to assess AI models’ resistance to various security risks and their overall safety level, offering compensation to third-party developers who create these benchmarks.
This framework is a companion resource to the AI Risk Management Framework, designed to help organizations integrate trustworthiness considerations into the development and use of AI systems. It was released in response to President Biden’s Executive Order on Safe, Secure, and Trustworthy AI, and is intended for voluntary use across various sectors to enhance AI safety and reliability.
The National Institute of Standards and Technology’s (NIST) AI Safety Institute (AISI) released draft guidelines to help AI developers manage the risks of their models being deliberately misused, with a focus on dual-use foundation models. The document outlines seven key approaches for mitigating these risks, emphasizing transparency and proactive management, and is open for public comment until September 9 before finalization later this year.
Despite efforts like reinforcement learning from human feedback and other safety mechanisms, LLMs remain susceptible to attacks such as jailbreaks, prompting concerns about their safe deployment. The article describes the nature of AI Security vulnerabilities and argues for more rigorous safety research, regulatory measures, and layered protection systems to mitigate these risks, comparing the need for AI safety to the strict regulations in drug development.
As LLMs become integral to various functions, their vulnerabilities pose serious risks that could result in financial loss, reputational damage, and operational disruptions. To mitigate these threats, according to the authors from a VC firm, organizations must implement proactive security measures, drawing from past lessons and applying comprehensive in-house and advanced security solutions.
This is the first job in AI security incident response. The Microsoft Security Response Center (MSRC) is hiring a Senior AI Security Incident Responder to join their team. This role involves managing responses to critical security issues, including zero-day exploits and high-profile attacks, to protect Microsoft and its customers.
Prompt injection is an issue, akin to SQL injection, that can lead to sensitive information being revealed or actions being taken based on malicious inputs. Various strategies are being developed to mitigate these risks, including improved prompt engineering and adversarial stress testing.
This book – “Adversarial AI Attacks, Mitigations, and Defense Strategies: A cybersecurity professional’s guide to AI attacks, threat modeling, and securing AI with MLSecOps” – explores the emerging field of AI security, is focused on the latest security challenges in adversarial AI by examining GenAI, deepfakes, and LLMs. It provides a comprehensive guide for understanding and mitigating threats such as poisoning, evasion, and privacy attacks, using standards from OWASP, MITRE, and NIST. The book also offers practical strategies for implementing secure-by-design methods, threat modeling, and integrating security practices into AI development and operations.
The “Image-to-Text Logic Jailbreak” attack exploits vulnerabilities in visual language models like GPT-4o by feeding them a flowchart image depicting harmful activities along with a text prompt asking for details about the process. The study found that GPT-4o is highly susceptible to this attack, with a 92.8% success rate, while other models like GPT-4-vision-preview are somewhat less vulnerable. The researchers created an automated framework to generate harmful flowchart images from text prompts, though manually crafted flowcharts were more effective at triggering the jailbreak, suggesting that automating this attack might be challenging.
Written by: admin
Secure AI Weekly admin
The search for a new way to report AI model flaws Axios, August 6, 2024 AI security experts are convening in Las Vegas this week to tackle a critical challenge: ...
Adversa AI, Trustworthy AI Research & Advisory