Secure AI Research Papers: Breakthroughs and Break-ins in LLMs

Trusted AI Blog + Adversarial ML admin todayDecember 18, 2023 226

Background
share close

A group of pioneering researchers have embarked on a quest to unveil the serious vulnerabilities and strengths of various AI applications from Classic Computer Vision to the latest LLM’s and VLM’s.

Their latest works were collected in this digest for you covering jailbreak prompts, and transferable attacks, shining a light on the delicate balance between AI’s capabilities and its susceptibilities. 

Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems

This paper is a comprehensive study on Transferability attacks. It aims to shed light on transferable attacks in AI systems, focusing on their prevalence in cyber-physical security across various domains like image, text, audio, and video.

The researchers conducted an extensive review of learning-based attacks, emphasizing their transferability. They analyzed these attacks from multiple perspectives – data, process, model, and system and examined their application in real-world scenarios like autonomous driving and speech recognition.

The study revealed advanced techniques that enhance attack transferability, including data augmentation methods, gradient and heuristic approaches, and the utilization of pre-trained models. It highlighted the widespread impact of these attacks across different AI systems, including computer vision and language models.

This research contributes significantly to the field by mapping the landscape of transferable attacks, identifying challenges for future research, and underscoring the need for robust countermeasures to improve AI system security and reliability.

Scalable Extraction of Training Data from (Production) Language Models

This paper is the first detailed research on data extraction from LLM. It investigates ‘extractable memorization’ in machine learning models, focusing on how adversaries can efficiently extract substantial amounts of training data from various language models without prior knowledge of the dataset.

The researchers conducted a large-scale study on extractable memorization across various models, including open-source, semi-open, and closed models like ChatGPT. They developed a new divergence attack for aligned models like ChatGPT, significantly increasing the rate of data extraction.

The study found that larger and more advanced models are more susceptible to data extraction attacks. Surprisingly, models like gpt-3.5-turbo showed minimal memorization, attributed to alignment techniques. However, the researchers devised a method to circumvent this, successfully extracting significant amounts of training data.

This research highlights a critical vulnerability in large language models (LLMs) regarding data security and privacy, emphasizing the need for robust protective measures in privacy-sensitive applications. It also raises intriguing questions about the efficiency and performance of LLMs in relation to the memorized data.

SD-NAE: Generating Natural Adversarial Examples with Stable Diffusion  

The research paper aims to address the challenge of robustly evaluating DL image classifiers by focusing on the generation of Natural Adversarial Examples (NAEs) using a novel approach. 

The researchers developed a method named SD-NAE (Stable Diffusion for Natural Adversarial Examples) that actively synthesizes NAEs through a controlled optimization process. This involves perturbing the token embedding corresponding to a specific class, guided by the gradient of loss from the target classifier, to fool it while resembling the ground-truth class. 

The SD-NAE method demonstrated effectiveness in producing diverse and valid NAEs, showing a substantial fooling rate against an ImageNet classifier. The generated NAEs provided new insights into evaluating and improving the robustness of image classifiers.

This research contributes to the field by introducing an innovative and controlled method for generating NAEs, offering a more flexible and effective way to evaluate and develop robust deep learning models. It opens avenues for further research in utilizing generative models for robust evaluation in various domains like images, text, and speech.

Code is available at GitHub

Assessing Prompt Injection Risks In 200+ Custom GPTS

The paper aims to highlight a significant security vulnerability in user-customized Generative Pre-trained Transformers (GPTs), specifically focusing on prompt injection attacks that can lead to system prompt extraction and file leakage. Recently Adversa AI Research team published its own findings on this topic in the article called “AI Red Teaming GPTs and this research is further explorint this issue at scale.

Researchers conducted extensive testing on over 200 custom GPT models using crafted adversarial prompts. The study involved evaluating the susceptibility of these models to prompt injections, where an adversary could extract system prompts and access uploaded files.

The tests revealed a critical security flaw in the custom GPT framework, demonstrating that a majority of these models are vulnerable to both system prompt extraction and file leakage. The study also assessed the effectiveness of existing defense mechanisms against prompt injections, finding them inadequate against sophisticated adversarial prompts.

This research underscores the urgent need for robust security frameworks in the design and deployment of customizable GPT models. It highlights the potential risks associated with the instruction-following nature of these AI models and contributes to raising awareness in the AI community about the importance of ensuring secure and privacy-compliant AI technologies.

A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

This research paper focuses on addressing the issue of ‘jailbreak’ prompts in LLMs like ChatGPT and GPT-4, which circumvent safeguards and lead to the generation of harmful content, aiming to uncover vulnerabilities and enhance the security of LLMs.

The researchers developed ReNeLLM, an automated framework that generalizes comprehensive jailbreak prompt attacks into two aspects: Prompt Rewriting and Scenario Nesting. This framework uses LLMs themselves to generate effective jailbreak prompts, conducting extensive experiments to evaluate its efficiency and effectiveness.

The study found that ReNeLLM significantly improves the attack success rate and reduces time cost compared to existing methods. It also revealed the inadequacy of current defense methods in safeguarding LLMs against such generalized attacks, offering insights into the failure of LLM defenses.

By introducing ReNeLLM and demonstrating its effectiveness, this research contributes to the understanding of LLM vulnerabilities and the development of robust defenses. It emphasizes the need for safer, more regulated LLMs and encourages the academic and vendor communities to focus on enhancing the security of these models.

Transfer Attacks and Defenses for Large Language Models on Coding Tasks 

This is the first research on attacks of Code generation. It investigates the impact of adversarial perturbations on coding tasks performed by Large Language Models (LLMs) like ChatGPT and develops prompt-based defenses to enhance their resilience against such attacks.

Researchers studied the transferability of adversarial examples from smaller code models to LLMs and proposed novel prompt-based defense techniques. These techniques involve modifying prompts with additional information, such as examples of adversarially perturbed code, and using meta-prompting where the LLM generates its own defense prompts.

The study found that adversarial examples generated using smaller code models can indeed compromise the performance of LLMs. However, the proposed self-defense and meta-prompting techniques significantly reduced the attack success rate (ASR), proving effective in improving the models’ resilience against such attacks.

This research demonstrates the vulnerability of LLMs to adversarial examples in coding tasks and introduces innovative, prompt-based self-defense strategies that effectively mitigate these threats. These findings contribute to enhancing the robustness of LLMs in code-related applications and open up possibilities for further exploration in leveraging LLMs for self-defense in various reasoning tasks.


In the world of AI, as these research papers vividly illustrate, progress and peril are inextricably linked. While each study explores a unique aspect of AI vulnerability — from the subtleties of adversarial perturbations in coding tasks to the complex dynamics of jailbreak prompts — they collectively underscore a common truth: the path to AI advancement is paved with challenges of security and ethics. Yet, their differences lie in their approach and focus — some delve into the intricacies of attack mechanisms, others into innovative defenses, each contributing a distinct piece to the puzzle of AI security.

Written by: admin

Rate it
Previous post