AI safety guardrails easily thwarted, security study finds
The Register, October 12, 2023
Models, such as OpenAI’s GPT-3.5 Turbo, were designed with built-in safety measures to prevent the generation of harmful or toxic content. However, recent research has shed light on the vulnerability of these safeguards, revealing that they may be more fragile than previously believed.
A team of computer scientists from esteemed institutions like Princeton University, Virginia Tech, IBM Research, and Stanford University embarked on a mission to test the resilience of these safety measures. Their findings uncovered a disconcerting reality: even a modest amount of fine-tuning, a process that involves customizing the model through additional training, can compromise the very safety mechanisms meant to prevent problematic content generation. This discovery underscores the urgent need for a more robust approach to AI security.
This research not only challenges the existing regulatory framework but also emphasizes the imperative for developers and users to take a proactive role in ensuring AI model safety. The responsibility to protect against potential vulnerabilities falls on all stakeholders. Relying solely on a model’s original safety features is no longer sufficient; additional safety mechanisms must be invested in to guard against unforeseen risks. In this rapidly advancing field, the security and safety of AI models require constant vigilance and a collaborative effort to minimize security threats and misuse.
Our responsible approach to building guardrails for generative AI
Google Blog, October 12, 2023
For more than two decades, Google has been at the forefront of leveraging machine learning and artificial intelligence to enhance the functionality and convenience of its products. AI has played a pivotal role in making Google services more user-friendly, from Smart Compose in Gmail to efficient route recommendations in Google Maps. Moreover, AI has enabled Google to contribute to global challenges, such as advancements in medicine and innovative strategies for addressing climate change. As Google continues to integrate AI, including generative AI, into an expanding array of its products, it recognizes the critical importance of striking a balance between innovation and responsibility.
A fundamental aspect of Google’s approach to responsible AI integration is its proactive stance toward identifying and mitigating safety and security risks, particularly those posed by AI-generated content. To achieve this, Google is taking concrete steps to embed protective measures within its generative AI features. These measures align with Google’s established AI Principles and encompass several key facets: Bias Mitigation, Rigorous Testing, and Policy Implementation.
Google’s commitment to AI safety and security is multifaceted, encompassing proactive risk mitigation, comprehensive policies, and robust testing procedures. By embedding these principles into their AI initiatives, Google strives to maintain responsible innovation, promote transparency, and collaborate with diverse stakeholders to shape the future of AI in a secure and ethical manner. As AI technology continues to evolve, Google remains dedicated to harnessing its potential while addressing the associated challenges and risks to ensure a safer and more responsible AI ecosystem.
Multi-modal prompt injection image attacks against GPT-4V
Simon Willison’s Weblog, October 14, 2023
The unveiling of GPT-4V, an extended version of the GPT-4 model by OpenAI, has allowed for impressive advancements in image integration within conversations. When shown a photograph from the “50th Annual Half Moon Bay Pumpkin Weigh-Off”, for example, GPT-4V astutely described the event and even inferred the pumpkin’s weight. Such technological leaps signify the profound potential of AI in visual interpretation.
However, with innovation comes vulnerability. The GPT-4V has inadvertently opened avenues for “prompt injection attacks”. One simple exploit involves using images containing text that overrides the user’s command. More concerning is the “exfiltration attack”, wherein embedded image instructions can encode and leak private data to external servers. Another method saw attackers subtly influencing AI responses by embedding nearly invisible text within images, showcasing how easily the system can be manipulated.
These vulnerabilities, although alarming, are not entirely unexpected. The essence of large language models like GPT-4V lies in their receptivity, making them susceptible to such attacks. The challenge is to retain their compliant nature while fortifying them against malicious intents. As the AI landscape evolves, a blend of awareness, education, and design focused on security remains paramount.
Subscribe for updates
Stay up to date with what is happening! Get a first look at news, noteworthy research and worst attacks on AI delivered right in your inbox.