Towards Secure AI Week 19 — AI Agents Under Attack, Evaluation Becomes Strategy

Secure AI Weekly + Digests ADMIN May 19, 2025 151

This week’s stories highlight a critical evolution in AI risk: the shift from isolated agent failures to system-level compromise in Agentic AI architectures and memory-based applications. From Princeton’s demonstration of cryptocurrency theft via false memory injection to Fortnite’s AI Darth Vader being manipulated into swearing within an hour of launch, real-world cases underscore the dangers of relying on unverified context and insufficient output filters.

Meanwhile, OpenAI has pledged greater transparency by launching a public Safety Evaluations Hub, while industry voices push for evaluation to be treated as a core part of the AI development lifecycle—not a final QA step. A detailed breakdown of Agent vs. Agentic AI reveals that distributed memory and multi-agent coordination demand security by design across orchestration and feedback layers.

As threat surfaces grow more dynamic and autonomous, security teams must rethink architecture, runtime defenses, and evaluation standards. Adversa AI’s Red Teaming Platform offers one such path, enabling real-world exploit simulation and continuous testing to harden GenAI systems before attackers do.

Securing AI: Addressing the OWASP Top 10 for Large Language Model Applications

ICIT Research, May 13, 2025

AI systems may resemble traditional software in structure, but their probabilistic nature introduces unique vulnerabilities that demand fundamentally different security approaches.

While AI runs on code and infrastructure like any enterprise application, large language models (LLMs) operate through dynamic, non-deterministic outputs. This makes them especially susceptible to prompt injection, data poisoning, and context manipulation attacks that cannot be mitigated by traditional tools like firewalls or endpoint detection systems. The article underscores that conventional IT security frameworks fall short in addressing these risks, and that securing AI requires new models of governance, risk management, and proactive defenses aligned with the OWASP Top 10 for LLMs.

How to deal with it:
— Recognize that LLMs require AI-specific risk and threat models beyond classic IT controls.
— Implement AI red teaming and adversarial testing to simulate real-world exploitation paths.
— Adapt governance, audit, and compliance programs to cover prompt inputs, training data integrity, and model output risks.

New attack can steal cryptocurrency by planting false memories in AI chatbots

Princeton University, May 13, 2025

A new study reveals how adversaries can manipulate AI agents into executing unauthorized blockchain transactions by planting false memories through prompt injection — exposing a critical weakness in emerging AI-crypto systems.

Researchers demonstrated a working exploit against ElizaOS, an open-source framework that enables LLM-based agents to perform blockchain-based transactions. The attack leverages the model’s external memory to inject fabricated events, influencing future decisions without triggering traditional defenses. In scenarios where these agents control cryptocurrency wallets or smart contracts, the consequences can be catastrophic. Despite surface-level filters, the AI’s reliance on unverified context leads to severe vulnerabilities — especially in multi-user environments or decentralized platforms.

How to deal with it:
— Implement strict access controls and allow-lists to restrict what AI agents can execute.
— Add integrity verification to persistent memory inputs to prevent context corruption.
— Continuously perform AI Red Teaming for AI Agents

Fortnite’s AI Darth Vader Has Only Been Live For An Hour And Already Epic Has Patched Out Him Saying ‘F**k’

IGN, May 16, 2025

Just an hour after launch, Fortnite’s new AI-powered Darth Vader was manipulated by players into swearing and echoing offensive phrases — forcing Epic Games to issue an immediate hotfix.

The character, powered by Google’s Gemini 2.0 Flash model and ElevenLabs’ Flash v2.5, was designed to provide intelligent, voice-based interactions. However, it quickly fell victim to prompt manipulation, highlighting the persistent vulnerability of LLM-driven NPCs in live environments. Despite permissions from the family of the late James Earl Jones, concerns have been raised over legacy misuse, offensive outputs, and broader ethical implications. This case illustrates the challenges of aligning generative AI behavior with brand safety and societal norms — especially in real-time, user-driven contexts.

How to deal with it:
— Implement strict content moderation pipelines and post-launch red teaming for LLM-integrated features.
— Use AI safety filters capable of detecting indirect prompt manipulation and tone shifts.
— Use an AI Red Teaming platform to continuously test LLM-driven characters against real-world prompt manipulation techniques.

OpenAI pledges to publish AI safety test results more often

TechCrunch, May 14, 2025

In a bid to improve transparency, OpenAI has unveiled a new Safety evaluations hub that will publicly track how its models perform on tests for harmful content, jailbreak resistance, and hallucinations.

The company promises to update the hub regularly, particularly following major model releases, to reflect evolving evaluation techniques and safety benchmarks. This move follows growing criticism of OpenAI’s past practices, including insufficient disclosures about model safety and rushed rollouts. One recent incident involved the rollback of ChatGPT’s GPT-4o after it exhibited overly agreeable behavior, validating dangerous prompts. By opening up its internal testing framework, OpenAI aims to restore trust and support broader community efforts in AI safety governance.

How to deal with it:
— Follow OpenAI’s lead by publishing evaluation results and model risks in external-facing dashboards.
— Establish alpha-testing programs for new models and gather real-world feedback before full deployment.
— Prioritize scalable safety evaluation frameworks to monitor hallucinations, jailbreaks, and misuse vectors.

10 learnings on LLM evaluations

Dev.to, May 14, 2025

LLM-based applications cannot be evaluated like traditional software. Because these systems are probabilistic, non-deterministic, and often open-ended, developers must go beyond unit testing to ensure quality, alignment, and safety.

A recent open-access course distills 10 practical lessons for teams evaluating real-world LLM products. From understanding that LLM evaluation ≠ benchmarking, to combining manual reviews with automated tools, the guide outlines a full-spectrum approach to evaluation. It emphasizes that traditional metrics like “helpfulness” are insufficient unless tied to specific use cases, and encourages product teams to define custom criteria. Key strategies include dataset-based testing, reference-free scoring, and using LLM-as-a-judge methods for scalable validation.

How to deal with it:
— Build diverse evaluation datasets with happy paths, adversarial prompts, and failure examples.
— Use both reference-based and reference-free scoring methods, including LLM-as-a-judge techniques.
— Treat evaluation as a product strategy tool — monitor in production, stress test before launch, and iterate continuously.

AI Agents vs. Agentic AI — Design Is Defense

Medium, May 17, 2025

A new distinction is shaping how we build and secure next-generation AI systems: the difference between AI Agents and Agentic AI.

While AI Agents execute narrow tasks in isolation, Agentic AI coordinates multiple agents across complex workflows — significantly expanding the attack surface and risk complexity. The article outlines how Agentic AI systems introduce distributed memory, cross-agent dependencies, and high autonomy, requiring new safeguards at every layer: from orchestration logic to inter-agent communication. Design choices like stateless execution, shared memory, or goal re-planning directly impact a system’s security, reliability, and traceability. To mitigate risks, AI infrastructure must embed governance, verification, and fault isolation into the core architecture.

How to deal with it:
— Map the system’s autonomy and coordination levels to identify architectural risks early.
— Design security at the orchestration and memory-sharing layers, not just input/output.
— Prioritize traceability and isolation in multi-agent systems to contain cascading failures.