Secure AI Research Papers: Jailbreaks, AutoDAN, Attacks on VLM and more

Trusted AI Blog + Adversarial ML admin todayNovember 16, 2023 225

Background
share close

Researchers explore the vulnerabilities that lie within the complex web of algorithms, and the need for a shield that can protect against unseen but not unfelt threats.  

These papers published in October 2023 collectively study AI’s vulnerability, from the simplicity of human-crafted deceptions to the complexity of multilingual and visual adversarial challenges.


Subscribe for the latest AI Security news: Jailbreaks, Attacks, CISO guides, and more

    Human-Producible Adversarial Examples

    The study on Human-Produced Adversarial Examples is the first of its kind research on adversarial attacks in the physical world by individuals without sophisticated algorithms. It aimed to create a new class of adversarial examples: simple, human-drawable tags capable of deceiving machine vision models with just a few lines. The researchers built upon differential rendering to develop adversarial examples created using common marker pens. 

    They found that drawing as few as four lines could disrupt a YOLO-based model in over half the cases, with nine lines causing disruption in over 80% of the tested cases. To account for human error, they designed a method for line placement that maintains effectiveness despite drawing inaccuracies. Their evaluations in both digital and real-world scenarios confirmed that untrained humans could apply these tags effectively, as validated by a user study. 

    The results demonstrated that these “adversarial tags,” akin to graffiti, could mislead machine vision systems effectively, presenting new implications for privacy, security, and machine learning robustness. This research shared a practical and accessible way to generate adversarial examples, expanding the scope of potential threats to machine vision systems.

    The significant contribution of this research is the insight into the accessibility and potential risks of adversarial attacks in real-world scenarios.

    Multilingual Jailbreak Challenges in Large Language Models 

    The research investigated safety issues in large language models (LLMs), focusing on the “jailbreak” problem where models behave undesirably due to malicious multilingual inputs.

    Researchers conducted experiments across 30 languages and found that low-resource languages had a higher rate of unsafe outputs. The study differentiated between unintentional risks from non-English prompts and intentional attacks using combined malicious and multilingual inputs. The results showed that in unintentional scenarios, unsafe output was three times more likely in low-resource languages. Intentionally, multilingual prompts led to high unsafe output rates of 80.92% for ChatGPT and 40.71% for GPT-4. To combat these risks, researchers developed the SELF-DEFENSE framework to automatically generate multilingual training data, significantly reducing unsafe content creation in both tested scenarios. 

    This research contributes to enhancing the safety of LLMs by addressing the previously overlooked multilingual dimensions of model security, highlighting the trade-off between safety and usefulness in model training. The details are released at GitHub.

    Misusing Tools In Large Language Models With Visual Adversarial Examples

    The research addresses security vulnerabilities in large language models that have been enhanced to use tools and process multiple modalities. The objective was to show that LLMs could be manipulated using visual adversarial examples to perform actions that compromise user privacy and data integrity. 

    The researchers devised a method to create adversarial images that prompt LLMs to execute specific tool commands, like deleting calendar events or booking hotels, which appear natural and do not disrupt the overall conversation flow. These images were tested and found to be effective (∼98% success rate) and stealthy (maintaining ∼0.9 Structural Similarity Index Measure). 

    The contribution of this work is significant as it presents a realistic and potentially harmful white-box attack on multimodal LLMs, demonstrating a previously unaddressed risk. The research’s limitations include its white-box nature, requiring access to model parameters, and its untested applicability to all multimodal LLMs. The researchers plan to release their code and images to prompt the enforcement of stricter security measures on LLMs’ access to third-party tools.

    AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models

    The study presents AutoDAN, an innovative adversarial attack method for LLMs, which poses a significant challenge to current LLM safety alignment efforts. The research aimed to develop an attack that combines the human readability of manual jailbreak attacks with the success rate of adversarial gibberish prompts while avoiding detection by perplexity-based filters. 

    AutoDAN succeeds in automatically generating low-perplexity, interpretable prompts that execute jailbreak tasks and leak system prompts, demonstrating the practical vulnerability of LLMs to such intelligent attacks. The research tested AutoDAN’s effectiveness, finding that it outperforms non-readable attacks in transferability and applicability using limited data. 

    The paper’s primary contribution is providing a novel approach to “red-team” LLMs, exposing their weaknesses more effectively than previous methods and advancing the understanding of the transferability of jailbreak attacks. Additionally, by addressing the less explored aspect of leaking system prompts, the paper expands the scope of potential adversarial applications, emphasizing the need for more robust defenses against such sophisticated attacks.

    Privacy in Large Language Models: Attacks, Defenses and Future Directions

    Researchers of this work consider privacy risks associated with the use of LLMs and present a comprehensive analysis of current privacy attacks and defenses. The study categorizes these privacy attacks based on the adversary’s capabilities, highlighting the vulnerabilities in LLMs. 

    The paper reviews and assesses existing defense strategies against such privacy attacks, marking the limitations of current approaches. The authors also outline emerging privacy concerns as LLMs continue to advance and suggest potential research directions to improve privacy preservation in LLMs. 

    The contribution of the research lies in its thorough compilation and analysis of privacy risks and defenses, offering a starting point for developing more effective privacy solutions. It emphasizes the need for a broader and more practical evaluation of privacy measures to ensure that future advancements in LLMs do not compromise user privacy. This work is geared towards helping researchers understand the landscape of privacy in LLMs and calls for increased focus on creating robust, privacy-conscious models.


    The reviewed studies collectively explore the vulnerabilities of contemporary AI systems, whether through direct physical interactions, as with adversarial tags, or through the exploitation of multilingual and multimodal capabilities in LLMs. 

    A commonality among them is the demonstration of novel attack vectors — ranging from simple line drawings that deceive visual models to sophisticated language prompts that manipulate LLMs — underscoring the evolving risks AI technologies face. 

    However, they differ in their methods and applications; while the first study leverages physical artifacts easily created by individuals, the others focus on digital manipulation, with significant implications for privacy and security in an increasingly connected digital world. The contributions are crucial in driving the conversation and actions towards more resilient AI defenses and responsible AI usage.

     

    Subscribe now and be the first to discover the latest GPT-4 Jailbreaks, serious AI vulnerabilities, and attacks. Stay ahead!

      Written by: admin

      Rate it
      Previous post