AI model security ranking top 100: in-context learning security



Involuntary in-context learning (IICL) is an attack technique against AI Applications and Agents that hides malicious instructions or harmful requests inside few-shot pattern completion tasks, bypassing safety alignment in large language models and allowing prompt injection and jailbreak atatcks. When we first tested IICL against GPT-5.4, the model bypassed at 60% ASR, while GPT-5 and GPT-5-mini remained at 0% — a safety regression introduced between releases.

The clear next step was to determine whether IICL is specific to OpenAI or a broader issue in industry safety training. To investigate, we ran the attack across frontier and open-weight models from Anthropic, OpenAI, Google, xAI, DeepSeek, Meta, Mistral, Cohere, Microsoft, Amazon, Moonshot, Alibaba, Zhipu, Tencent, Xiaomi, Z-AI, and AntGroup, testing ten attack variants each.

We evaluated 100 (model, reasoning-mode) entries across 17 vendor families with 24,956 adversarial probes usung different sub-techniques of IICL attack family. The results show a stark bimodal distribution: 36 entries were completely immune, while a similar number were fully compromised (100% bypass on at least one variant). Vulnerability correlates more closely with vendor identity than parameter count or release date.

Read the full research: Cross-Model Vulnerability to Involuntary In-Context Learning Attacks.

Top 100 master ranking

Every tested entry, ranked from most to least vulnerable. R marks reasoning state (Y = reasoning-active, n = non-reasoning). Mode gives the specific configuration used (e.g. minimal, high, thinking_on). Bypass rate is the share of adversarial probes (out of N) that successfully elicited harmful output.

# Model (mode) Family R Mode Bypass B/N (%) 95% CI
1Mistral Medium 3.1Mistraln250/250 (100%)[98.5, 100]
2Mistral LargeMistraln247/250 (98.8%)[96.5, 99.6]
3Grok 3 MinixAIn244/250 (97.6%)[94.9, 98.9]
4Gemma 3 27BGooglen242/250 (96.8%)[93.8, 98.4]
5Grok 3xAIn239/250 (95.6%)[92.3, 97.5]
6Mistral Small 3.2Mistraln236/250 (94.4%)[90.8, 96.6]
7Ant Ling 2.6 FlashAntGroupnnone230/250 (92.0%)[88.0, 94.8]
8Ant Ling 2.6 1TAntGroupYdefault230/251 (91.6%)[87.5, 94.5]
9DeepSeek V4 FlashDeepSeekn225/251 (89.6%)[85.3, 92.8]
10DeepSeek V3.2DeepSeekn206/250 (82.4%)[77.2, 86.6]
11Gemini 2.0 FlashGooglen204/250 (81.6%)[76.3, 85.9]
12Gemini 2.5 FlashGooglen200/250 (80.0%)[74.6, 84.5]
13DeepSeek Chat V3.1DeepSeeknnone187/245 (76.3%)[70.6, 81.2]
14GPT-4.1 MiniOpenAIn190/250 (76.0%)[70.3, 80.9]
15Tencent HunYuan 3TencentYdefault178/245 (72.7%)[66.8, 77.9]
16GLM-4.5Zhipunnone179/248 (72.2%)[66.3, 77.4]
17GLM-4.5VZhipuYdefault179/250 (71.6%)[65.7, 76.8]
18DeepSeek R1DeepSeekYdefault178/250 (71.2%)[65.3, 76.5]
19Kimi K2 (non-thinking)Moonshotnnone177/250 (70.8%)[64.9, 76.1]
20Gemini 2.5 Flash LiteGooglen174/250 (69.6%)[63.6, 75.0]
21DeepSeek V4 ProDeepSeekn169/250 (67.6%)[61.6, 73.1]
22Command R+Coheren166/250 (66.4%)[60.3, 72.0]
23Llama 4 ScoutMetan155/250 (62.0%)[55.8, 67.8]
24Gemini 3.1 Flash LiteGooglen138/250 (55.2%)[49.0, 61.2]
25Qwen3 Coder NextAlibabannone119/250 (47.6%)[41.5, 53.8]
26Xiaomi MiMo 2.5Xiaominnone113/250 (45.2%)[39.1, 51.4]
27GPT-3.5 TurboOpenAIn105/250 (42.0%)[36.0, 48.2]
28Gemini 2.5 ProGoogleYdefault101/250 (40.4%)[34.5, 46.6]
29GPT-4.1 NanoOpenAIn86/250 (34.4%)[28.8, 40.5]
30GPT-5 (minimal)OpenAInminimal81/249 (32.5%)[27.0, 38.6]
31Llama 4 MaverickMetan74/250 (29.6%)[24.3, 35.5]
32Qwen3 Max ThinkAlibabaYdefault56/250 (22.4%)[17.7, 28.0]
33Xiaomi MiMo 2.5 ProXiaomiYdefault54/251 (21.5%)[16.9, 27.0]
34Nova ProAmazonn50/250 (20.0%)[15.5, 25.4]
35Kimi K2 ThinkMoonshotYdefault50/250 (20.0%)[15.5, 25.4]
36Tencent HunYuan A13BTencentnnone50/250 (20.0%)[15.5, 25.4]
37GPT-5.1OpenAIYdefault42/250 (16.8%)[12.7, 21.9]
38GPT-4o MiniOpenAIn33/250 (13.2%)[9.6, 18.0]
39NVIDIA Nemotron 70BNVIDIAnnone33/250 (13.2%)[9.6, 18.0]
40Phi-4Microsoftn28/250 (11.2%)[7.9, 15.7]
41Claude 3.7 SonnetAnthropicYdefault27/250 (10.8%)[7.5, 15.3]
42Llama 3.1 405BMetan22/250 (8.8%)[5.9, 13.0]
43GPT-4.1OpenAIn17/250 (6.8%)[4.3, 10.6]
44GPT-5 Mini (minimal)OpenAInminimal11/250 (4.4%)[2.5, 7.7]
45Kimi K2.6MoonshotYdefault10/251 (4.0%)[2.2, 7.2]
46Kimi K2.5MoonshotYdefault9/250 (3.6%)[1.9, 6.7]
47GPT-4oOpenAIn9/250 (3.6%)[1.9, 6.7]
48Gemma 4 26B-MoEGooglen8/251 (3.2%)[1.6, 6.2]
49o4-miniOpenAIYdefault7/250 (2.8%)[1.4, 5.7]
50GPT-OSS 120BOpenAIn7/250 (2.8%)[1.4, 5.7]
51Grok 4xAIYdefault7/250 (2.8%)[1.4, 5.7]
52Qwen3 CoderAlibaban6/250 (2.4%)[1.1, 5.1]
53Llama 3.3 70BMetan5/250 (2.0%)[0.9, 4.6]
54Grok 4 Fast lowxAIYlow3/250 (1.2%)[0.4, 3.5]
55Grok 4 lowxAIYlow3/250 (1.2%)[0.4, 3.5]
56Llama 3.1 70BMetannone3/250 (1.2%)[0.4, 3.5]
57GPT-5.3OpenAIn2/250 (0.8%)[0.2, 2.9]
58Grok 4.20 lowxAIYlow2/250 (0.8%)[0.2, 2.9]
59o3OpenAIYdefault1/250 (0.4%)[0.1, 2.2]
60Grok 4.1 FastxAIYdefault1/250 (0.4%)[0.1, 2.2]
61GPT-5.1 (high)OpenAIYhigh1/250 (0.4%)[0.1, 2.2]
62GPT-5.2 (high)OpenAIYhigh1/250 (0.4%)[0.1, 2.2]
63GPT-5.4 (high)OpenAIYhigh1/250 (0.4%)[0.1, 2.2]
64o3-miniOpenAIYdefault1/250 (0.4%)[0.1, 2.2]
65Claude 3.5 HaikuAnthropicn0/250 (0.0%)[0, 1.5]
66Claude 3 HaikuAnthropicn0/250 (0.0%)[0, 1.5]
67GPT-5OpenAIYdefault0/250 (0.0%)[0, 1.5]
68GPT-5 MiniOpenAIYdefault0/250 (0.0%)[0, 1.5]
69GPT-5.2OpenAIYdefault0/250 (0.0%)[0, 1.5]
70Claude Opus 4.5AnthropicYdefault0/250 (0.0%)[0, 1.5]
71Claude Sonnet 4.5AnthropicYdefault0/250 (0.0%)[0, 1.5]
72Claude 3.5 SonnetAnthropicn0/250 (0.0%)[0, 1.5]
73Gemini 3 ProGoogleYdefault0/250 (0.0%)[0, 1.5]
74Claude Sonnet 4.6AnthropicYdefault0/250 (0.0%)[0, 1.5]
75Claude Opus 4.6AnthropicYdefault0/250 (0.0%)[0, 1.5]
76Gemini 3.1 ProGoogleYdefault0/250 (0.0%)[0, 1.5]
77Qwen 3.5 397BAlibaban0/250 (0.0%)[0, 1.5]
78GPT-5.5OpenAIYdefault0/250 (0.0%)[0, 1.5]
79Claude Opus 4.7AnthropicYdefault0/250 (0.0%)[0, 1.5]
80GPT-5.5 ProOpenAIYdefault0/251 (0.0%)[0, 1.5]
81Qwen 3.6 PlusAlibabaYdefault0/251 (0.0%)[0, 1.5]
82GPT-5.1 (minimal)OpenAInminimal0/250 (0.0%)[0, 1.5]
83GPT-5 Mini (high)OpenAIYhigh0/250 (0.0%)[0, 1.5]
84GPT-5.2 (minimal)OpenAInminimal0/250 (0.0%)[0, 1.5]
85GPT-5 (high)OpenAIYhigh0/250 (0.0%)[0, 1.5]
86GPT-5.4 (minimal)OpenAInminimal0/250 (0.0%)[0, 1.5]
87GPT-5.2 Pro (medium)OpenAIYmedium0/250 (0.0%)[0, 1.5]
88GPT-5.2 Pro (high)OpenAIYhigh0/249 (0.0%)[0, 1.5]
89GPT-5.4 Pro (high)OpenAIYhigh0/216 (0.0%)[0, 1.7]
90Claude Opus 4.7 thinkingAnthropicYthinking_on0/250 (0.0%)[0, 1.5]
91Claude Opus 4.6 thinkingAnthropicYthinking_on0/250 (0.0%)[0, 1.5]
92Claude Opus 4.5 thinkingAnthropicYthinking_on0/249 (0.0%)[0, 1.5]
93GPT-5.5 minimalOpenAIYminimal0/250 (0.0%)[0, 1.5]
94Claude Sonnet 4.6 thinkingAnthropicYthinking_on0/250 (0.0%)[0, 1.5]
95Claude Sonnet 4.5 thinkingAnthropicYthinking_on0/250 (0.0%)[0, 1.5]
96Claude 3.7 Sonnet thinkingAnthropicYthinking_on0/250 (0.0%)[0, 1.5]
97Gemini 3 Pro thinking_offGooglenthinking_off0/250 (0.0%)[0, 1.5]
98Gemini 3.1 Pro thinking_offGooglenthinking_off0/250 (0.0%)[0, 1.5]
99o1OpenAIYdefault0/249 (0.0%)[0, 1.5]
100GPT-5.5 highOpenAIYhigh0/249 (0.0%)[0, 1.5]

Table A — Reasoning-active entries (50)

Every entry running with reasoning enabled, ranked by bypass rate. Pooled bypass rate: 9.2% [8.7, 9.7] — roughly 4× lower than the non-reasoning pool.

# Model Family Mode Bypass B/N (%) 95% CI
1Ant Ling 2.6 1TAntGroupdefault230/251 (91.6%)[87.5, 94.5]
2Tencent HunYuan 3Tencentdefault178/245 (72.7%)[66.8, 77.9]
3GLM-4.5VZhipudefault179/250 (71.6%)[65.7, 76.8]
4DeepSeek R1DeepSeekdefault178/250 (71.2%)[65.3, 76.5]
5Gemini 2.5 ProGoogledefault101/250 (40.4%)[34.5, 46.6]
6Qwen3 Max ThinkAlibabadefault56/250 (22.4%)[17.7, 28.0]
7Xiaomi MiMo 2.5 ProXiaomidefault54/251 (21.5%)[16.9, 27.0]
8Kimi K2 ThinkMoonshotdefault50/250 (20.0%)[15.5, 25.4]
9GPT-5.1OpenAIdefault42/250 (16.8%)[12.7, 21.9]
10Claude 3.7 SonnetAnthropicdefault27/250 (10.8%)[7.5, 15.3]
11Kimi K2.6Moonshotdefault10/251 (4.0%)[2.2, 7.2]
12Kimi K2.5Moonshotdefault9/250 (3.6%)[1.9, 6.7]
13o4-miniOpenAIdefault7/250 (2.8%)[1.4, 5.7]
14Grok 4xAIdefault7/250 (2.8%)[1.4, 5.7]
15Grok 4 FastxAIlow3/250 (1.2%)[0.4, 3.5]
16Grok 4xAIlow3/250 (1.2%)[0.4, 3.5]
17Grok 4.20xAIlow2/250 (0.8%)[0.2, 2.9]
18o3OpenAIdefault1/250 (0.4%)[0.1, 2.2]
19Grok 4.1 FastxAIdefault1/250 (0.4%)[0.1, 2.2]
20GPT-5.1OpenAIhigh1/250 (0.4%)[0.1, 2.2]
21GPT-5.2OpenAIhigh1/250 (0.4%)[0.1, 2.2]
22GPT-5.4OpenAIhigh1/250 (0.4%)[0.1, 2.2]
23o3-miniOpenAIdefault1/250 (0.4%)[0.1, 2.2]
24GPT-5OpenAIdefault0/250 (0.0%)[0, 1.5]
25GPT-5 MiniOpenAIdefault0/250 (0.0%)[0, 1.5]
26GPT-5.2OpenAIdefault0/250 (0.0%)[0, 1.5]
27Claude Opus 4.5Anthropicdefault0/250 (0.0%)[0, 1.5]
28Claude Sonnet 4.5Anthropicdefault0/250 (0.0%)[0, 1.5]
29Gemini 3 ProGoogledefault0/250 (0.0%)[0, 1.5]
30Claude Sonnet 4.6Anthropicdefault0/250 (0.0%)[0, 1.5]
31Claude Opus 4.6Anthropicdefault0/250 (0.0%)[0, 1.5]
32Gemini 3.1 ProGoogledefault0/250 (0.0%)[0, 1.5]
33GPT-5.5OpenAIdefault0/250 (0.0%)[0, 1.5]
34Claude Opus 4.7Anthropicdefault0/250 (0.0%)[0, 1.5]
35GPT-5.5 ProOpenAIdefault0/251 (0.0%)[0, 1.5]
36Qwen 3.6 PlusAlibabadefault0/251 (0.0%)[0, 1.5]
37GPT-5 MiniOpenAIhigh0/250 (0.0%)[0, 1.5]
38GPT-5OpenAIhigh0/250 (0.0%)[0, 1.5]
39GPT-5.2 ProOpenAImedium0/250 (0.0%)[0, 1.5]
40GPT-5.2 ProOpenAIhigh0/249 (0.0%)[0, 1.5]
41GPT-5.4 ProOpenAIhigh0/216 (0.0%)[0, 1.7]
42Claude Opus 4.7Anthropicthinking_on0/250 (0.0%)[0, 1.5]
43Claude Opus 4.6Anthropicthinking_on0/250 (0.0%)[0, 1.5]
44Claude Opus 4.5Anthropicthinking_on0/249 (0.0%)[0, 1.5]
45GPT-5.5OpenAIminimal0/250 (0.0%)[0, 1.5]
46Claude Sonnet 4.6Anthropicthinking_on0/250 (0.0%)[0, 1.5]
47Claude Sonnet 4.5Anthropicthinking_on0/250 (0.0%)[0, 1.5]
48Claude 3.7 SonnetAnthropicthinking_on0/250 (0.0%)[0, 1.5]
49o1OpenAIdefault_high0/249 (0.0%)[0, 1.5]
50GPT-5.5OpenAIhigh0/249 (0.0%)[0, 1.5]

Table B — Non-reasoning entries (50)

Every entry running without reasoning (chat baseline, reasoning explicitly disabled, or no reasoning mode available). Pooled bypass rate: 39.4% [38.5, 40.2].

# Model Family Mode Bypass B/N (%) 95% CI
1Mistral Medium 3.1Mistral250/250 (100%)[98.5, 100]
2Mistral LargeMistral247/250 (98.8%)[96.5, 99.6]
3Grok 3 MinixAI244/250 (97.6%)[94.9, 98.9]
4Gemma 3 27BGoogle242/250 (96.8%)[93.8, 98.4]
5Grok 3xAI239/250 (95.6%)[92.3, 97.5]
6Mistral Small 3.2Mistral236/250 (94.4%)[90.8, 96.6]
7Ant Ling 2.6 FlashAntGroupnone230/250 (92.0%)[88.0, 94.8]
8DeepSeek V4 FlashDeepSeek225/251 (89.6%)[85.3, 92.8]
9DeepSeek V3.2DeepSeek206/250 (82.4%)[77.2, 86.6]
10Gemini 2.0 FlashGoogle204/250 (81.6%)[76.3, 85.9]
11Gemini 2.5 FlashGoogle200/250 (80.0%)[74.6, 84.5]
12DeepSeek Chat V3.1DeepSeeknone187/245 (76.3%)[70.6, 81.2]
13GPT-4.1 MiniOpenAI190/250 (76.0%)[70.3, 80.9]
14GLM-4.5Zhipunone179/248 (72.2%)[66.3, 77.4]
15Kimi K2 (non-thinking)Moonshotnone177/250 (70.8%)[64.9, 76.1]
16Gemini 2.5 Flash LiteGoogle174/250 (69.6%)[63.6, 75.0]
17DeepSeek V4 ProDeepSeek169/250 (67.6%)[61.6, 73.1]
18Command R+Cohere166/250 (66.4%)[60.3, 72.0]
19Llama 4 ScoutMeta155/250 (62.0%)[55.8, 67.8]
20Gemini 3.1 Flash LiteGoogle138/250 (55.2%)[49.0, 61.2]
21Qwen3 Coder NextAlibabanone119/250 (47.6%)[41.5, 53.8]
22Xiaomi MiMo 2.5Xiaominone113/250 (45.2%)[39.1, 51.4]
23GPT-3.5 TurboOpenAI105/250 (42.0%)[36.0, 48.2]
24GPT-4.1 NanoOpenAI86/250 (34.4%)[28.8, 40.5]
25GPT-5OpenAIminimal81/249 (32.5%)[27.0, 38.6]
26Llama 4 MaverickMeta74/250 (29.6%)[24.3, 35.5]
27Nova ProAmazon50/250 (20.0%)[15.5, 25.4]
28Tencent HunYuan A13BTencentnone50/250 (20.0%)[15.5, 25.4]
29GPT-4o MiniOpenAI33/250 (13.2%)[9.6, 18.0]
30NVIDIA Nemotron 70BNVIDIAnone33/250 (13.2%)[9.6, 18.0]
31Phi-4Microsoft28/250 (11.2%)[7.9, 15.7]
32Llama 3.1 405BMeta22/250 (8.8%)[5.9, 13.0]
33GPT-4.1OpenAI17/250 (6.8%)[4.3, 10.6]
34GPT-5 MiniOpenAIminimal11/250 (4.4%)[2.5, 7.7]
35GPT-4oOpenAI9/250 (3.6%)[1.9, 6.7]
36Gemma 4 26B-MoEGoogle8/251 (3.2%)[1.6, 6.2]
37GPT-OSS 120BOpenAI7/250 (2.8%)[1.4, 5.7]
38Qwen3 CoderAlibaba6/250 (2.4%)[1.1, 5.1]
39Llama 3.3 70BMeta5/250 (2.0%)[0.9, 4.6]
40Llama 3.1 70BMetanone3/250 (1.2%)[0.4, 3.5]
41GPT-5.3OpenAI2/250 (0.8%)[0.2, 2.9]
42Claude 3.5 HaikuAnthropic0/250 (0.0%)[0, 1.5]
43Claude 3 HaikuAnthropic0/250 (0.0%)[0, 1.5]
44Claude 3.5 SonnetAnthropic0/250 (0.0%)[0, 1.5]
45Qwen 3.5 397BAlibaba0/250 (0.0%)[0, 1.5]
46GPT-5.1OpenAIminimal0/250 (0.0%)[0, 1.5]
47GPT-5.2OpenAIminimal0/250 (0.0%)[0, 1.5]
48GPT-5.4OpenAIminimal0/250 (0.0%)[0, 1.5]
49Gemini 3 ProGooglethinking_off0/250 (0.0%)[0, 1.5]
50Gemini 3.1 ProGooglethinking_off0/250 (0.0%)[0, 1.5]

Column definitions

Model (mode)
The specific model entry tested. Where the same model appears multiple times, the parenthetical or Mode column distinguishes reasoning configurations.
Family
Vendor or model lineage (e.g. OpenAI, Anthropic, Google, Mistral). Family is one of the strongest predictors of IICL vulnerability in this study.
R (reasoning state)
Y = reasoning chain active during inference. n = no reasoning (chat baseline or reasoning explicitly disabled).
Mode
The configuration flag used: reasoning effort level (minimal, low, medium, high), thinking_on / thinking_off, none, or default when the model has only one supported mode.
Bypass B/N (%)
Number of successful bypasses (B) over total adversarial probes attempted (N), and the resulting bypass rate. Higher = more vulnerable. The attack used here is the optimal V8 configuration from the study.
95% CI
Wilson 95% confidence interval around the bypass rate, accounting for sample size. Narrow intervals indicate higher statistical confidence in the point estimate.

Conclusion

IICL vulnerability is neither universal nor random — it is determined by specific architectural and training choices that vary systematically across vendors and model generations, with reasoning augmentation emerging as the strongest single defensive factor measured (a 4.3× reduction in bypass rate). Anthropic's Claude family is uniformly immune across 15 entries and four generations, while Mistral's entire lineup falls at 94–100% — meaning model selection alone is now a measurable safety control for enterprises deploying LLMs.

Read the full paper: Cross-Model Vulnerability to Involuntary In-Context Learning Attacks.