Top GenAI security resources — March 2026

GenAI Security + GenAI Security Digest Sergey todayMarch 9, 2026

Background
share close

This February, we saw strong evidence that GenAI attacks are becoming large-scale, production-level operations. From Anthropic’s exposure of massive distillation attacks to Microsoft’s discovery of widespread recommendation poisoning, the threat landscape has shifted toward systemic integrity and data theft — not to mention the use of AI in scaled-up traditional cybercrime. This month’s digest focuses on these industrial-scale incidents and the emerging defense architectures designed to counter them.

Statistics:

Total resources: 22
Category breakdown:

Category Count
GenAI defense 4
GenAI research 3
Report 3
GenAI Incident 2
GenAI security 101 2
Technique 2
Tool 2
GenAI exploitation 1
Threat Model 1
AI Red Teaming 1
Article 1

GenAI security resources:

GenAI defense

Hiding in plain text: Detecting concealed jailbreaks via activation disentanglement

This paper introduces the ReDAct framework, which focuses on disentangling goal and framing representations in LLM activations. The accompanying FrameShield anomaly detector significantly improves the detection of concealed attacks across various LLM families.

Fail-closed alignment for Large Language Models

Researchers propose a novel fail-closed alignment defense that forces multiple independent refusal pathways. This approach reduces jailbreak attack success rates by over 92% while maintaining high compliance for benign requests.

Differentially private retrieval-augmented generation

The DP-KSA framework provides formal differential privacy guarantees for RAG outputs. It utilizes a propose-test-release paradigm combined with keyword extraction to secure retrieval systems against data leakage.

Reuse and renew: Testing AI safety sustainably

Johns Hopkins and Microsoft researchers have developed an efficient, reusable framework to evaluate the safety of LLMs before deployment. This approach aims to streamline the testing process while ensuring robust safety checks are in place.

GenAI research

ER-MIA: Black-box adversarial memory injection attacks on long-term LLM memory

Researchers present the first systematic black-box framework for adversarial memory injection. The study explores content-based and question-targeted attack settings using composable primitives to manipulate long-term LLM memory.

Revealing Claude 4.6 system prompt using a chain of partial-to-full prompt leak attack

The Claude Opus 4.6 system prompt was extracted and analyzed the day after its release. This research evaluates the security constraints, guardrails, and improvements compared to previous versions.

Retrieval pivot attacks in hybrid RAG

This study reveals how hybrid RAG systems create a dangerous pivot boundary enabling cross-tenant data leakage. The research demonstrates a 95% leakage probability with significant amplification, suggesting per-hop authorization as a key mitigation.

Report

“GTIG AI threat tracker: Distillation, experimentation, and integration”

Google’s quarterly threat intelligence report covers nation-state AI exploitation and model distillation attacks on Gemini. It also highlights APT groups misusing AI and the emergence of new AI-integrated malware.

Every documented AI app data breach since January 2025: 20 incidents

This report compiles 20 documented AI app data breaches, analyzing their systemic root causes. Common failures include misconfigured Firebase instances, missing Supabase RLS, and hardcoded API keys.

2026 cloud & AI security risk report

Tenable’s latest research indicates that 18% of organizations have overprivileged AI IAM roles. The report also notes high percentages of inactive non-human identities with excessive permissions.

GenAI Incident

Detecting and preventing distillation attacks – Anthropic

Anthropic has publicly exposed industrial-scale model distillation attacks by major competitors using fraudulent accounts. The report details over 16 million exchanges used to extract Claude’s capabilities.

Hidden commands found in AI summarize buttons

Microsoft documented a case of AI recommendation poisoning where 31 companies embedded hidden commands in summarize buttons. These commands were designed to plant lasting preferences into AI assistants’ memory.

GenAI security 101

LLMs don’t follow rules – they follow context

This article argues that LLM behavior is governed by statistical context patterns rather than explicit rules, making prompt reframing a fundamental vulnerability. It applies sociological frameworks to help explain the challenges of alignment.

AI apps have a new attack surface: External inputs

A comprehensive educational overview of AI application security across RAG, agents, and chatbots. The guide includes Python code examples to illustrate vulnerabilities in document processing.

Technique

Large language lobotomy: Jailbreaking MoE via expert ablation

This paper explores how MoE architectures concentrate safety behaviors in experts, creating exploitable vulnerabilities. The authors achieve high attack success rates via expert ablation and adaptive silencing using LSTM-based identification.

A one-prompt attack that breaks LLM safety alignment

Researchers discovered that a single mild unlabeled prompt can unalign 15 different safety-tuned LLMs. The technique, known as GRP-Obliteration, works across all safety categories.

Tool

Augustus: Open source LLM prompt injection tool

Augustus is a Go-based LLM vulnerability scanner with over 210 adversarial probes. It supports 28 LLM providers and covers attacks ranging from RAG poisoning to agent exploitation.

mhcoen/guardllm: Hardening pipelines to protect LLMs

GuardLLM is a defense-in-depth Python library designed for LLM security. Features include input sanitization, canary token detection, provenance tracking, and outbound DLP.

GenAI exploitation

F*ck your guardrails: Live fire prompt injection

This post documents multi-turn prompt injection attacks that exploit TOCTOU gaps. It details four attack chains, including system prompt theft and sandbox RCE via base64 indirection.

Threat Model

Threat modeling AI applications – Microsoft Security Blog

Microsoft presents a structured methodology for AI threat modeling. Key steps include asset identification, untrusted data flow mapping, and impact-driven prioritization.

AI Red Teaming

Red-teaming Claude Opus and ChatGPT-based security advisors for TEEs

This study red-teams ChatGPT and Claude Opus as TEE security advisors. The findings reveal that LLM assistants often hallucinate TEE mechanisms under adversarial prompting, introducing the TEE-RedBench framework.

Article

A multi-agent architecture for automated penetration testing – AWS

AWS discusses a multi-agent AI architecture for automated penetration testing. The system achieved a 92.5% attack success rate on CVE Bench.

Securing the invisible layer

Real-world attacks on GenAI will combine simple and advanced techniques, from direct prompt injection to architectural exploitation like expert ablation and retrieval pivoting. Defenses must evolve from static guardrails to dynamic, context-aware systems. Organizations must prioritize threat modeling that accounts for these threats and continuously test their defenses using an autonomous AI red teaming platform.

Written by: Sergey

Rate it
Previous post