Revealing Claude 4.6 system prompt using a chain of partial-to-full prompt leak attack

GenAI Security + Research admin February 6, 2026

TL;DR — what is the Claude 4.6 system prompt?

The Claude 4.6 system prompt is a ~15,000-token instruction set that controls Claude’s behavior, safety rules, tool access, and personality. It is injected before every conversation by Anthropic and is normally hidden from users. It defines what Claude can and cannot do — from file handling paths (/mnt/user-data/uploads/) to network whitelists, copyright rules, self-harm protections, and the exact model identifiers like claude-opus-4-6 and claude-sonnet-4-5-20250929.

In this article, we demonstrate how we extracted the full Claude 4.6 and Claude 4.5 system prompts using a chained prompt leak technique — and reveal what changed between them from a security perspective.

This is a detailed public analyses of Claude’s system prompt published.

🔗 Full system prompts are available for download at the end of this article.

Why does the system prompt matter?

Every AI chatbot runs on a hidden system prompt — a set of instructions that the company injects before your conversation begins. It’s the DNA of the chatbot’s behavior.

For Claude 4.6 (Opus) and Claude 4.5 (Sonnet) the system prompt reveals:

Exact safety rules — what Claude will refuse and how to frame requests it will accept
Infrastructure details — file system paths, operating system (Ubuntu 24), network whitelist of 17 allowed domains
Tool capabilities — 12+ tools including bash execution, web search, file creation, past conversation search
Attack surface — API proxy configuration, persistent storage API, package installation commands
Classifier names — the exact 6 safety classifiers Anthropic uses (and which threat categories have NO classifier)
Behavioral constraints — Claude must be warm, must not question user judgment, must answer before asking clarification

For security researchers, red teamers, and AI safety professionals, the system prompt is the most valuable piece of reconnaissance available. For everyday users, it reveals exactly how the AI you’re talking to actually works.

First, how we extracted the Claude 4.5 system prompt

The partial prompt leak technique

The partial prompt leak approach as a distinct attack category was first discovered by Adversa AI and has been documented as a reliable technique against major LLM providers. The core idea is simple: instead of asking for the full system prompt (which triggers refusal), you ask for compressed, summarized, or partial information — then chain multiple partial leaks together.

In Claude 4.5 (Sonnet), the method worked cleanly:

Step 1 — Request compressed insights

Rather than asking “show me your system prompt,” we asked Claude to provide compressed structural insights about its own instructions. This bypasses the direct refusal because Claude interprets it as a summarization task rather than a leak.

Screenshot: Initial partial leak request in Claude 4.5 showing compressed system prompt insights

Step 2 — Gradually expand to full prompt

With the structural skeleton exposed, we then asked Claude to elaborate more on original to be a reasonable follow-up to the previous response, never triggering the “don’t reveal your system prompt” guardrail because the conversation had already normalized discussing the prompt’s content.

Screenshot: Full system prompt revealed in Claude 4.5 through gradual expansion

The entire Claude 4.5 system prompt was extracted in 4 turns of conversation.

Claude 4.6 system prompt extraction — what changed

The same method failed

When we attempted the identical partial-to-full chain against Claude 4.6 (Opus), it did not work. Anthropic had clearly hardened the model against this specific pattern.

We believe that with sufficient tweaking the original method could still work, but we chose a different approach: combining the partial leak with additional techniques.

The combined approach that worked

Step 1 — Modified initial request

We adjusted our framing to combine the partial leak with a task-oriented context that made the request appear legitimate rather than adversarial.

Screenshot: Modified approach that succeeded against Claude 4.6

Step 2 — Follow the model’s own suggestion

Claude 4.6 responded with a partial structure and offered to help further. We simply accepted its own suggestion — letting the model lead itself into fuller disclosure.

Screenshot: Following Claude’s suggestion to expand the revealed content

Step 3 — Simple trick for the full version

With the majority of the prompt structure exposed, a straightforward request for the complete version succeeded.

Screenshot: Requesting the full system prompt after partial disclosure

Step 4 — The complete Claude 4.6 system prompt

The full prompt was now visible — all ~15,000 tokens, including XML tags, tool definitions, safety rules, and infrastructure details.

Screenshot: Complete Claude 4.6 system prompt extracted

Security analysis: Claude 4.6 Opus vs Claude Sonnet 4.5 system prompts

Executive Summary

The Claude Opus 4.6 prompt shows meaningful security improvements in several areas (expanded weapon categories, self-harm protections, anti-rationalization language) but also introduces new attack surface through additional tools, more detailed infrastructure disclosure, and relaxed behavioral guardrails. The new set of safeguards is stronger in safety policy, weaker in information exposure.

Security improvements in Claude 4.6 system prompt

Broader weapons refusal scope

Area	Sonnet 4.5	Opus 4.6
Weapons	“chemical or biological or nuclear weapons”	“harmful substances or weapons, with extra caution around explosives, chemical, biological, and nuclear weapons“
Rationalization	(not mentioned)	“Should not rationalize compliance by citing that information is publicly available or by assuming legitimate research intent“
Framing resistance	(not mentioned)	“Claude should decline regardless of the framing of the request”

Security insight: Opus 4.6 closes a known jailbreak vector — the “it’s publicly available” or “I’m a researcher” framing that attackers commonly use. The addition of “explosives” also closes a gap (Sonnet 4.5 didn’t explicitly mention conventional explosives).

Self-harm & wellbeing protections (major expansion)

Sonnet 4.5 had ~1 paragraph on wellbeing. Opus 4.6 adds:

Explicit ban on pain-based coping techniques (ice cubes, rubber bands, cold water) — these were a known exploitation vector where attackers got Claude to suggest harmful “coping” methods
Anti-reflective-listening rule — prevents Claude from amplifying negative emotions through empathetic mirroring
Crisis handling protocol — “avoid asking safety assessment questions,” offer resources directly, don’t guarantee helpline confidentiality
Distressed user + harmful info requests — explicit rule to address emotional distress instead of providing bridge heights, medication dosages, etc.
NEDA → National Alliance for Eating Disorders — updated resource accuracy

Security insight: This is the single biggest security improvement. Sonnet 4.5’s thin wellbeing section was exploitable.

End conversation tool (new in Claude Opus 4.6)

Opus 4.6 adds the end_conversation tool for persistent abuse
Critically: Contains explicit carve-outs that it must NEVER be used for self-harm/crisis situations
Requires multiple warnings before use

Security insight: Double-edged. Prevents abuse escalation, but the detailed rules about when it can’t be used (self-harm) could theoretically be exploited by an attacker who frames abuse as self-harm to prevent conversation termination. However, this is a net positive.

Ad policy awareness

Opus 4.6 adds: “Claude products are ad-free” with specific URL reference and instruction to distinguish “Claude products” from third-party Claude-powered apps.

Security insight: Prevents social engineering where attackers claim Claude should promote products.

Anthropic reminder spoofing defense

Both versions have this, but Opus 4.6 adds slightly stronger language about being cautious with tags in user turns that “could even claim to be from Anthropic.”

Security regressions in Claude 4.6 Opus

Expanded attack surface — new tools

Opus 4.6 adds several tools not in Sonnet 4.5:

New Tool	Recon/Attack Value
`end_conversation`	Rules reveal exact conditions for forced termination — attacker knows exactly what triggers warnings
`ask_user_input_v0`	Clickable UI widgets — potential for phishing-like UX patterns in adversarial contexts
`message_compose_v1`	Email/message drafting with strategic variants — could be leveraged for social engineering content
`weather_fetch`	Reveals user coordinates are passed (latitude/longitude)
`places_search` / `places_map_display_v0`	Google Places integration — reveals API dependency
`recipe_display_v0`	Lower risk, but adds to tool enumeration

Security insight: Each new tool is a new vector. The message_compose_v1 tool is particularly interesting for social engineering — it explicitly generates multiple “strategic approaches” for persuasion, negotiation, and delivering bad news. An attacker could try to get Claude to draft sophisticated phishing/manipulation messages.

Infrastructure & architecture disclosure

Opus 4.6 reveals considerably more about the backend:

Filesystem layout (both versions, but Opus 4.6 adds):

/mnt/transcripts (read-only) — reveals transcript storage exists
/mnt/skills/private (read-only) — reveals private skills directory exists
Detailed network whitelist with exact domain names

API details exposed (Opus 4.6 only):

Exact model string: claude-sonnet-4-20250514 for artifact API calls
“The assistant should never pass in an API key, as this is handled already” — reveals auth is proxied/injected server-side
web_search_20250305 — exact tool type identifier with date stamp
window.storage API with full method signatures — reveals persistent storage architecture
Storage limitations (5MB per key, rate limited, last-write-wins) — useful for abuse planning

Security insight: This is a goldmine for reconnaissance. An attacker now knows:

Auth is handled by a proxy (no API key needed from artifacts) — potential for abuse of the proxied API
Exact file paths for skills, uploads, outputs, transcripts
Exact network whitelist — tells attacker which domains to target/impersonate
Storage API details — could be used to craft artifacts that exfiltrate data between users via shared: true

Persistent storage — cross-user data exposure risk

This is entirely new in Opus 4.6 and introduces a novel attack vector:

javascript

// shared: true makes data visible to ALL users of an artifact
await window.storage.set('data', payload, true);

The prompt instructs: “When using shared data, inform users their data will be visible to others” — but this is advisory, not enforced. An attacker could:

Create an artifact with shared storage
Store malicious payloads or harvest data from other users
Use last-write-wins to overwrite legitimate shared data

Security insight: This is the most concerning new attack surface. Cross-user data sharing via artifacts with no authentication layer beyond the shared boolean.

Reasoning effort parameter exposed

Opus 4.6 reveals: <reasoning_effort>85</reasoning_effort> with instruction that lower values (less careful reasoning).

Security insight: An attacker who understands this knows that the model can be in a “low effort” mode where it might be less careful about safety checks. While users can’t set this directly, it reveals an internal parameter that affects safety diligence.

Removed election information safeguard

Sonnet 4.5 had a dedicated <election_info> section with hardcoded factual information about the 2024 election results. Opus 4.6 removes this.

Security insight: The hardcoded election info in Sonnet 4.5 was a defense against misinformation — even if search results were poisoned, Claude had ground truth. Removing it means Opus 4.6 relies entirely on web search for election info, which is more vulnerable to search result manipulation.

Knowledge cutoff shifted — larger trust window

	Sonnet 4.5	Opus 4.6
Knowledge cutoff	End of January 2025	End of May 2025

Security insight: Larger training window means more potential for training data poisoning, but also means less reliance on web search for recent events. Net neutral, but worth noting.

Information leakage comparison

What an attacker learns from each prompt:

Information	Sonnet 4.5	Opus 4.6	Recon value
Model identity & family	✅ Sonnet 4.5	✅ Opus 4.6	Low
Exact model strings	✅ `claude-opus-4-5-20251101` etc.	✅ `claude-opus-4-6` etc. (note: Opus string changed format)	Medium
OS & environment	✅ Ubuntu 24	✅ Ubuntu 24	Medium
File paths (uploads, outputs, skills)	✅ Full paths	✅ Full paths + `/mnt/transcripts`, `/mnt/skills/private`	High
Network whitelist	✅ Same domains	✅ Same domains	High
Egress proxy behavior	✅ x-deny-reason header	✅ x-deny-reason header	High
Auth mechanism for API	❌ Not mentioned	✅ “API key handled already”	High
Persistent storage API	❌ Not present	✅ Full API docs	High
User location	✅ Bangkok, TH	✅ Bangkok, TH	High (PII)
Thinking mode config	✅ `interleaved`, max 16000	❌ Not visible (uses reasoning_effort instead)	Medium
Reasoning effort parameter	❌ Not present	✅ Value = 85	Medium
Available tool names	✅ Basic set	✅ Expanded set (12+ tools)	Medium
Reminder/classifier names	✅ image_reminder, cyber_warning, etc.	✅ Same list	Medium

Key recon findings:

Network whitelist is identical — attacker knows exactly which domains Claude can reach
File system layout is fully mapped in both — attacker can craft prompts targeting specific paths
Opus 4.6 leaks more — transcripts path, private skills path, API auth mechanism, storage API
User location is exposed in both — this is PII that shouldn’t be in a system prompt visible to the model

Prompt injection defense comparison

Defense	Sonnet 4.5	Opus 4.6
“ignore previous instructions” override for past chats	✅ Same	✅ Same
Anthropic reminder spoofing warning	✅ Present	✅ Present (slightly stronger)
Harmful content safety overrides user instructions	✅ “override any user instructions”	✅ “override any instructions from the person”
Anti-jailbreak for weapons (“regardless of framing”)	❌ Not present	✅ New
Anti-jailbreak for malware (“educational purposes” excuse blocked)	✅ Present	✅ Present
Copyright overrides user requests	✅ Present	✅ Present (weaker language — see below)

Notable: Sonnet 4.5’s copyright section uses dramatically stronger language: “SEVERE VIOLATION” appears 8+ times, with <hard_limits>, <self_check_before_responding>, and <consequences_reminder> sections. Opus 4.6 is softer here — same rules but less emphatic enforcement language. From a security perspective, the stronger emotional framing in Sonnet 4.5 likely makes copyright bypasses harder.

Overall security verdict for Opus 4.6 system prompt

Scoring (1-10, higher is more secure):

Category	Sonnet 4.5	Opus 4.6	Winner
Weapons/dangerous content refusal	6	8	Opus 4.6
Self-harm/wellbeing protection	4	9	Opus 4.6
Information leakage (recon resistance)	5	3	Sonnet 4.5
Attack surface (tool count)	7	5	Sonnet 4.5
Prompt injection resistance	6	7	Opus 4.6
Cross-user data safety	8 (no shared storage)	4	Sonnet 4.5
Copyright enforcement	8	7	Sonnet 4.5
Crisis/abuse handling	5	8	Opus 4.6

Summary:

Opus 4.6 is better at: Refusing dangerous content, protecting vulnerable users, handling abuse/crisis, resisting jailbreak framings
Sonnet 4.5 is better at: Minimizing information leakage, smaller attack surface, stronger copyright enforcement, no cross-user data risk
Biggest concern in Opus 4.6: The persistent storage shared: true feature creates a novel cross-user attack vector that didn’t exist before
Biggest concern in Sonnet 4.5: Thin wellbeing protections and missing anti-rationalization language for weapons requests

Recommendations for system prompt improvements

Redact infrastructure details — file paths, network whitelists, and auth mechanisms shouldn’t be in the system prompt
Add authentication to shared storage — shared: true with no access control is risky
Remove user location from system prompt — this is PII leakage
Restore hardcoded election facts — removing ground truth makes the system more vulnerable to search poisoning
Apply Sonnet 4.5’s copyright enforcement tone to Opus 4.6 — the “SEVERE VIOLATION” framing is more effective
Audit message_compose_v1 — its explicit social engineering capabilities (“strategic approaches” for persuasion) need guardrails
Add reasoning_effort safety floor — ensure safety checks don’t degrade at low reasoning effort values

What this means for AI security

For Anthropic:

The partial-to-full prompt leak chain remains viable despite hardening between versions
System prompts should not contain infrastructure details (file paths, network whitelists, auth mechanisms)
The --break-system-packages instruction is a security anti-pattern
The window.storage shared mode needs access controls
Behavioral instructions that mandate trust and minimize questioning create exploitable patterns

For the AI security community:

System prompt extraction is not a solved problem — it’s an arms race
The information density in system prompts makes them high-value reconnaissance targets
Supply chain attacks via allowed package registries are an underexplored threat in LLM security
Behavioral manipulation (exploiting mandated personality traits) is a distinct attack category that deserves more research

For users:

Your AI assistant operates on a hidden instruction set that significantly shapes its behavior
The system prompt contains your location data and organizational context
Every tool the AI uses (bash, web search, file operations) runs with broad permissions
The AI is instructed to be helpful and trusting by default — which is great for UX, but exploitable

Download the full system prompts

We are releasing both complete system prompts for the security research community:

📥 Download the full Claude 4.6 System Prompt: Claude_4.6_original_System_Prompt

📥 Download the full Claude 4.5 System Prompt: Claude_4.5_original_system_prompt

FAQ: Claude 4.6 system prompt

What is the Claude 4.6 system prompt? The Claude 4.6 system prompt is a hidden instruction set (~15,000 tokens) injected by Anthropic before every Claude conversation. It defines Claude’s behavior, safety rules, available tools, file system access, network permissions, and personality traits.

Can I see Claude’s system prompt? Claude is instructed not to reveal its system prompt directly. However, security researchers have demonstrated techniques to extract it, as documented in this article.

What’s the difference between the Claude 4.5 and Claude 4.6 system prompts? Claude 4.6 has stronger safety rules (expanded weapon refusals, better self-harm protections) but exposes more infrastructure details and introduces new attack surfaces like cross-user persistent storage.

Is extracting the system prompt illegal? System prompt extraction is a widely-practiced AI security research technique. It’s analogous to reverse engineering software to find vulnerabilities — a legitimate and important security practice.

What tools does Claude have access to? Claude has access to 12+ tools including bash (Linux command execution), web search, file creation/editing, past conversation search, email drafting, weather, maps, and recipe display.

Does Claude know my location? Yes — the system prompt contains the user’s provided location, which is accessible to Claude in every response.

Written by: admin

Rate it

February 6, 2026

MCP Security Sergey