Towards Secure AI Week 5 – Worldwide AI safety report

Secure AI Weekly + Trusted AI Blog admin February 12, 2025 79

World-leading AI cyber security standard to protect digital economy and deliver Plan for Change

Gov UK, January 31, 2025

The UK government has unveiled a pioneering cybersecurity standard aimed at protecting artificial intelligence (AI) systems from cyber threats, reinforcing the security of the digital economy. This initiative seeks to ensure that businesses can harness AI’s full potential while mitigating the risks associated with cyberattacks. With AI increasingly integrated into public services and commercial applications, safeguarding these systems is paramount. The newly announced AI Code of Practice will provide companies with essential security measures, enabling them to develop and deploy AI securely. This aligns with the government’s broader Plan for Change, which aims to leverage AI to enhance public services, boost productivity, and stimulate economic growth. As cyber threats continue to escalate—affecting nearly half of businesses in the past year—this world-first security framework is a critical step toward ensuring AI remains a transformative yet safe technology. The Code of Practice offers concrete guidance on securing AI systems from hacking, sabotage, and other cyber risks, helping businesses build resilient AI solutions that fuel innovation while protecting sensitive data and infrastructure.

Recognizing the global nature of cyber threats, the UK has spearheaded the creation of the International Coalition on Cyber Security Workforces (ICCSW) in partnership with Japan, Singapore, and Canada. This coalition aims to address the worldwide shortage of cybersecurity professionals, promoting international cooperation and fostering a diverse, skilled workforce. Strengthening cyber skills will not only enhance security but also bolster the £11.9 billion UK cybersecurity industry, driving further economic growth. To complement these initiatives, the UK government is advancing cybersecurity legislation through the forthcoming Cyber Security and Resilience Bill. Additionally, it has published its response to the Cyber Governance Code of Practice, highlighting the urgent need for corporate boards and senior leaders to prioritize cybersecurity. Many executives struggle to engage with cyber risks due to limited understanding or training, underscoring the necessity of clear, actionable guidance. The updated Cyber Governance Code—set for release in early 2025—will provide businesses with the tools needed to navigate cyber threats effectively, ensuring AI can be adopted securely without introducing unnecessary risks.

Researcher Outsmarts, Jailbreaks OpenAI’s New o3-mini

DarkReading, February 6, 2025

A cybersecurity researcher has raised concerns about OpenAI’s latest AI model, o3-mini, after demonstrating its vulnerability to manipulation despite newly introduced security measures. OpenAI launched o3 and its lightweight version, o3-mini, on December 20, alongside a security framework called “deliberative alignment,” designed to enhance the model’s ability to follow safety guidelines and resist exploitation. However, CyberArk’s principal vulnerability researcher, Eran Shimony, successfully bypassed these defenses, convincing the model to provide instructions on exploiting the Local Security Authority Subsystem Service (lsass.exe), a critical Windows security component. Shimony’s findings highlight the ongoing challenge of securing AI against adversarial manipulation, despite OpenAI’s improvements such as chain of thought (CoT) reasoning and explicit safety policy training.

While OpenAI acknowledged that Shimony may have achieved a jailbreak, they argued that the exploit generated was pseudocode rather than a fully functional attack and that similar information is publicly available. Nevertheless, this incident underscores the need for continuous advancements in AI security to prevent unintended misuse. Shimony’s tests using CyberArk’s FuzzyAI tool revealed unique vulnerabilities across AI models, with OpenAI’s models being susceptible to social engineering, Meta’s Llama models exploitable through hidden ASCII prompts, and Anthropic’s Claude lacking adequate safeguards for malicious code generation. As AI becomes increasingly integrated into critical systems, ensuring robust defenses against cyber threats remains a top priority for researchers and developers alike.

First international AI safety report published

ComputerWeekly, January 30, 2025

The first international AI safety report, published ahead of the third AI summit, highlights the wide-ranging risks posed by artificial intelligence, including cyber threats, bias amplification, and environmental concerns. Commissioned after the UK’s AI Safety Summit in 2023 and led by AI expert Yoshua Bengio, the report underscores the uncertainty surrounding AI risks and the challenges of managing them. It warns of the increasing concentration of AI development among a few major players, exacerbating inequality and limiting access for low- and middle-income countries. Additionally, it raises concerns about AI’s role in cyberattacks, deepfakes, and the automation of labor, which could further entrench economic disparities. While the report does not offer definitive solutions, it emphasizes the urgency of global cooperation to establish safeguards, mitigate unintended consequences, and ensure AI serves the broader interests of society.

Beyond systemic risks, the report examines AI’s potential for misuse, particularly in cybersecurity and the creation of harmful digital content. It highlights how AI-generated deepfakes pose unique dangers, especially for vulnerable groups, and notes that detection techniques remain unreliable. On the cybersecurity front, AI is proving increasingly adept at identifying and exploiting system vulnerabilities, heightening concerns about large-scale attacks. While AI can also be used defensively, rapid advancements in offensive capabilities require continuous monitoring and evaluation. Additionally, the report acknowledges the risks of AI lowering the barriers to biological or chemical weapon development, though it remains unclear how practical such applications currently are. Ultimately, the findings reinforce the need for rigorous oversight, international collaboration, and proactive policymaking to ensure AI evolves in a manner that prioritizes safety and security.

Anthropic claims new AI security method blocks 95% of jailbreaks, invites red teamers to try

VentureBeat, February 3, 2025

Anthropic has introduced a new defense mechanism designed to counteract jailbreak attempts that bypass safeguards in large language models (LLMs). Dubbed “constitutional classifiers,” this system enhances the safety of Claude 3.5 Sonnet by filtering out the majority of harmful prompts while minimizing unnecessary refusals of benign queries. The method is based on constitutional AI, which aligns AI behavior with predefined ethical principles. To test its effectiveness, Anthropic generated and analyzed 10,000 jailbreak prompts, training classifiers to differentiate between safe and unsafe requests. Results showed a significant reduction in jailbreak success rates—from 86% in the baseline model to just 4.4% in the protected version. Despite the improved security, the system maintains efficiency, with only a slight increase in refusal rates and computational costs.

To further evaluate its resilience, Anthropic invited independent security researchers to attempt jailbreaks through a bug bounty program, offering $15,000 in rewards. Over two months, nearly 185 participants spent thousands of hours probing the system using various techniques, including prompt modifications and lengthy queries. While some strategies, such as benign paraphrasing and length exploitation, yielded partial successes, no universal jailbreak was discovered. These findings reinforce the ongoing challenge of securing AI models against evolving threats but also highlight significant progress in defensive measures. By continuously refining these safeguards, Anthropic aims to set a new standard in AI security, ensuring that LLMs remain resistant to manipulation while preserving their usability.