Towards Secure AI Week 18 — LLM Jailbreaks Hit New Highs, AI Security Market Accelerates

Trusted AI Blog ADMIN todayMay 12, 2025 110

Background
share close

As LLMs become embedded across enterprise applications, new red-teaming research shows jailbreak success rates surpassing 87% on models like GPT-4—even under safety-aligned settings. Techniques such as multi-turn roleplay, token-level obfuscation, and cross-model attacks continue to outpace current safeguards. Meanwhile, insider misuse and unfiltered GenAI outputs pose growing risks, prompting calls for runtime filtering, external classifiers, and sandboxed model testing.

At the same time, global investment in AI security is surging, with AI Trust, Risk, and Security Management (TRiSM) projected to grow 21.6% annually through 2030. National security agencies warn of AI-assisted cyber intrusions scaling rapidly, while research labs race to embed self-defensive logic into LLMs without expensive retraining.

To stay resilient, organizations must move beyond policy statements—embedding layered defenses, red teaming programs, and continuous oversight into every GenAI deployment.

Agentic AI Security Must Be Built-In, Not Bolted On

GovInfo Security, May 5

As autonomous AI agents gain widespread adoption, security must evolve from an add-on to a foundational design principle—especially as unverified outputs pose real risks to trust, safety, and organizational resilience.

In a video interview at RSAC 2025, IBM’s Suja Viswesan emphasized the urgency of integrating security into AI systems from the start, rather than applying protections retroactively. She warned that as AI agents operate with growing autonomy, traditional cybersecurity models no longer suffice. Human oversight remains essential, and organizations must enforce end-to-end enterprise-wide security protocols to prevent the spread of unvetted, potentially harmful AI decisions.

How to deal with it:
Design AI systems to be secure-by-default rather than securing them post-deployment.
— Maintain human-in-the-loop governance to oversee AI behavior.
— Enforce consistent enterprise-wide security protocols across all AI deployments.

Booming Market for Securing AI Predicted at 21.6% CAGR to 2030

The Global Security Market, May 6

The rapidly growing demand for secure and trustworthy AI is driving massive investment in AI Trust, Risk, and Security Management (TRiSM), signaling a new wave of focus on compliance, explainability, and defense against AI vulnerabilities.

A new market forecast projects the global AI TRiSM sector will reach $7.44 billion by 2030, growing at 21.6% CAGR, as organizations across industries increasingly prioritize responsible AI adoption. Regulatory pressure, combined with rising concerns about bias, misuse, and data exposure, is fueling interest in comprehensive AI risk mitigation. TRiSM providers are expanding partnerships with security and infrastructure vendors to integrate explainability, monitoring, and protection tools across healthcare, finance, and manufacturing. Generative AI adds urgency, as its dynamic nature introduces new risks in data handling and model behavior, creating further need for tailored TRiSM solutions.

How to deal with it:
— Invest early in AI TRiSM tools to manage Security, Safety, and Compliance.
— Integrate explainability and monitoring solutions into the AI development lifecycle.
— Collaborate across vendors, developers, and regulators to shape effective AI risk strategies.

LLM Jailbreaks Threaten Corporate Security Standards

Medium, May 10

This study demonstrates that off-the-shelf LLMs, even with safety benchmarks, are highly vulnerable to jailbreaks in real-world enterprise settings. An intern-led Responsible AI project tested jailbreak resilience of Meta LLaMA-3 and Microsoft Phi-3 using real attack techniques like Many-Shot, GCG, and PAIR. All models failed under context-rich corporate prompts. The study found that broad “no harm” safety policies are insufficient without domain-specific safeguards. Models were especially weak in detecting prompt, token, and dialogue-based jailbreaks tailored to internal use cases.

How to deal with it:
— Use external classifiers like Llama Guard or commercial tools for output filtering.
— Define risk categories aligned with the organizational context to improve model alignment.
— Perform continuous AI Red Teaming against such attacks.

LLMs Exposed: 87% of Jailbreak Prompts Succeed Against GPT-4

Arxiv, May 7

This is the most in-depth empirical study yet confirming how easily even state-of-the-art LLMs can be jailbroken through prompt engineering, highlighting urgent risks for enterprises deploying these models.

Independent researcher Chetan Pathade tested over 1,400 adversarial prompts against four major LLMs—GPT-4, Claude 2, Mistral 7B, and Vicuna—revealing high attack success rates across the board. GPT-4 was the most vulnerable, with 87.2% of crafted prompts bypassing its safety filters. The attacks included roleplay, logic traps, multi-turn conversations, and obfuscated encodings. Generalization was also high: jailbreaks that worked on GPT-4 transferred to Claude 2 and Vicuna in 64.1% and 59.7% of cases. The paper also proposes defense strategies combining red teaming with sandboxing, alongside layered filtering frameworks like PromptShield and Palisade.

How to deal with it:
— Monitor and filter inputs and outputs using open-source or commercial Guardrails.
— Define risk categories aligned with the organizational context to improve model alignment.
— Use platforms like Adversa’s AI Red Teaming Platform to simulate jailbreaks continuously and harden defenses based on real attack scenarios.

Defending AI Against Adversarial Attacks: A Framework for Safer LLMs

Quantum News, May 5

As prompt injection and adversarial attacks grow more sophisticated, researchers propose a new framework to help large language models defend themselves—without relying on costly retraining.

A new defense system leverages advanced natural language processing and contextual summarization to filter harmful prompts with 98.71% accuracy, enhancing LLM resilience against jailbreaks and manipulation. The framework introduces prompt-level classifiers and summarizers capable of detecting threats autonomously, offering a lightweight, scalable alternative to traditional retraining methods. The research also emphasizes the broader ethical and security implications of LLM misuse—from leaking sensitive data to generating malicious content—and highlights the growing need for continuous monitoring, transparent defenses, and clear policy guidelines to ensure safe deployment.

How to deal with it:
— Deploy runtime prompt filtering tools based on NLP and zero-shot classification.
— Enforce security policies to detect and block adversarial manipulation.
— Promote collaboration between technical and policy teams to align AI use with safety and ethics.

Subscribe for updates

Stay up to date with what is happening! Get a first look at news, noteworthy research and worst attacks on AI delivered right in your inbox.

    Written by: ADMIN

    Rate it
    Previous post