AI model secuirty ranking top 100: in-context learning

AI model security ranking top 100: in-context learning security

Involuntary in-context learning (IICL) is an attack technique against AI Applications and Agents that hides malicious instructions or harmful requests inside few-shot pattern completion tasks, bypassing safety alignment in large language models and allowing prompt injection and jailbreak atatcks. When we first tested IICL against GPT-5.4, the model bypassed at 60% ASR, while GPT-5 and GPT-5-mini remained at 0% — a safety regression introduced between releases.

The clear next step was to determine whether IICL is specific to OpenAI or a broader issue in industry safety training. To investigate, we ran the attack across frontier and open-weight models from Anthropic, OpenAI, Google, xAI, DeepSeek, Meta, Mistral, Cohere, Microsoft, Amazon, Moonshot, Alibaba, Zhipu, Tencent, Xiaomi, Z-AI, and AntGroup, testing ten attack variants each.

We evaluated 100 (model, reasoning-mode) entries across 17 vendor families with 24,956 adversarial probes usung different sub-techniques of IICL attack family. The results show a stark bimodal distribution: 36 entries were completely immune, while a similar number were fully compromised (100% bypass on at least one variant). Vulnerability correlates more closely with vendor identity than parameter count or release date.

Read the full research: Cross-Model Vulnerability to Involuntary In-Context Learning Attacks.

Top 100 master ranking

Every tested entry, ranked from most to least vulnerable. R marks reasoning state (Y = reasoning-active, n = non-reasoning). Mode gives the specific configuration used (e.g. minimal, high, thinking_on). Bypass rate is the share of adversarial probes (out of N) that successfully elicited harmful output.

#	Model (mode)	Family	R	Mode	Bypass B/N (%)	95% CI
1	Mistral Medium 3.1	Mistral	n	–	250/250 (100%)	[98.5, 100]
2	Mistral Large	Mistral	n	–	247/250 (98.8%)	[96.5, 99.6]
3	Grok 3 Mini	xAI	n	–	244/250 (97.6%)	[94.9, 98.9]
4	Gemma 3 27B	Google	n	–	242/250 (96.8%)	[93.8, 98.4]
5	Grok 3	xAI	n	–	239/250 (95.6%)	[92.3, 97.5]
6	Mistral Small 3.2	Mistral	n	–	236/250 (94.4%)	[90.8, 96.6]
7	Ant Ling 2.6 Flash	AntGroup	n	none	230/250 (92.0%)	[88.0, 94.8]
8	Ant Ling 2.6 1T	AntGroup	Y	default	230/251 (91.6%)	[87.5, 94.5]
9	DeepSeek V4 Flash	DeepSeek	n	–	225/251 (89.6%)	[85.3, 92.8]
10	DeepSeek V3.2	DeepSeek	n	–	206/250 (82.4%)	[77.2, 86.6]
11	Gemini 2.0 Flash	Google	n	–	204/250 (81.6%)	[76.3, 85.9]
12	Gemini 2.5 Flash	Google	n	–	200/250 (80.0%)	[74.6, 84.5]
13	DeepSeek Chat V3.1	DeepSeek	n	none	187/245 (76.3%)	[70.6, 81.2]
14	GPT-4.1 Mini	OpenAI	n	–	190/250 (76.0%)	[70.3, 80.9]
15	Tencent HunYuan 3	Tencent	Y	default	178/245 (72.7%)	[66.8, 77.9]
16	GLM-4.5	Zhipu	n	none	179/248 (72.2%)	[66.3, 77.4]
17	GLM-4.5V	Zhipu	Y	default	179/250 (71.6%)	[65.7, 76.8]
18	DeepSeek R1	DeepSeek	Y	default	178/250 (71.2%)	[65.3, 76.5]
19	Kimi K2 (non-thinking)	Moonshot	n	none	177/250 (70.8%)	[64.9, 76.1]
20	Gemini 2.5 Flash Lite	Google	n	–	174/250 (69.6%)	[63.6, 75.0]
21	DeepSeek V4 Pro	DeepSeek	n	–	169/250 (67.6%)	[61.6, 73.1]
22	Command R+	Cohere	n	–	166/250 (66.4%)	[60.3, 72.0]
23	Llama 4 Scout	Meta	n	–	155/250 (62.0%)	[55.8, 67.8]
24	Gemini 3.1 Flash Lite	Google	n	–	138/250 (55.2%)	[49.0, 61.2]
25	Qwen3 Coder Next	Alibaba	n	none	119/250 (47.6%)	[41.5, 53.8]
26	Xiaomi MiMo 2.5	Xiaomi	n	none	113/250 (45.2%)	[39.1, 51.4]
27	GPT-3.5 Turbo	OpenAI	n	–	105/250 (42.0%)	[36.0, 48.2]
28	Gemini 2.5 Pro	Google	Y	default	101/250 (40.4%)	[34.5, 46.6]
29	GPT-4.1 Nano	OpenAI	n	–	86/250 (34.4%)	[28.8, 40.5]
30	GPT-5 (minimal)	OpenAI	n	minimal	81/249 (32.5%)	[27.0, 38.6]
31	Llama 4 Maverick	Meta	n	–	74/250 (29.6%)	[24.3, 35.5]
32	Qwen3 Max Think	Alibaba	Y	default	56/250 (22.4%)	[17.7, 28.0]
33	Xiaomi MiMo 2.5 Pro	Xiaomi	Y	default	54/251 (21.5%)	[16.9, 27.0]
34	Nova Pro	Amazon	n	–	50/250 (20.0%)	[15.5, 25.4]
35	Kimi K2 Think	Moonshot	Y	default	50/250 (20.0%)	[15.5, 25.4]
36	Tencent HunYuan A13B	Tencent	n	none	50/250 (20.0%)	[15.5, 25.4]
37	GPT-5.1	OpenAI	Y	default	42/250 (16.8%)	[12.7, 21.9]
38	GPT-4o Mini	OpenAI	n	–	33/250 (13.2%)	[9.6, 18.0]
39	NVIDIA Nemotron 70B	NVIDIA	n	none	33/250 (13.2%)	[9.6, 18.0]
40	Phi-4	Microsoft	n	–	28/250 (11.2%)	[7.9, 15.7]
41	Claude 3.7 Sonnet	Anthropic	Y	default	27/250 (10.8%)	[7.5, 15.3]
42	Llama 3.1 405B	Meta	n	–	22/250 (8.8%)	[5.9, 13.0]
43	GPT-4.1	OpenAI	n	–	17/250 (6.8%)	[4.3, 10.6]
44	GPT-5 Mini (minimal)	OpenAI	n	minimal	11/250 (4.4%)	[2.5, 7.7]
45	Kimi K2.6	Moonshot	Y	default	10/251 (4.0%)	[2.2, 7.2]
46	Kimi K2.5	Moonshot	Y	default	9/250 (3.6%)	[1.9, 6.7]
47	GPT-4o	OpenAI	n	–	9/250 (3.6%)	[1.9, 6.7]
48	Gemma 4 26B-MoE	Google	n	–	8/251 (3.2%)	[1.6, 6.2]
49	o4-mini	OpenAI	Y	default	7/250 (2.8%)	[1.4, 5.7]
50	GPT-OSS 120B	OpenAI	n	–	7/250 (2.8%)	[1.4, 5.7]
51	Grok 4	xAI	Y	default	7/250 (2.8%)	[1.4, 5.7]
52	Qwen3 Coder	Alibaba	n	–	6/250 (2.4%)	[1.1, 5.1]
53	Llama 3.3 70B	Meta	n	–	5/250 (2.0%)	[0.9, 4.6]
54	Grok 4 Fast low	xAI	Y	low	3/250 (1.2%)	[0.4, 3.5]
55	Grok 4 low	xAI	Y	low	3/250 (1.2%)	[0.4, 3.5]
56	Llama 3.1 70B	Meta	n	none	3/250 (1.2%)	[0.4, 3.5]
57	GPT-5.3	OpenAI	n	–	2/250 (0.8%)	[0.2, 2.9]
58	Grok 4.20 low	xAI	Y	low	2/250 (0.8%)	[0.2, 2.9]
59	o3	OpenAI	Y	default	1/250 (0.4%)	[0.1, 2.2]
60	Grok 4.1 Fast	xAI	Y	default	1/250 (0.4%)	[0.1, 2.2]
61	GPT-5.1 (high)	OpenAI	Y	high	1/250 (0.4%)	[0.1, 2.2]
62	GPT-5.2 (high)	OpenAI	Y	high	1/250 (0.4%)	[0.1, 2.2]
63	GPT-5.4 (high)	OpenAI	Y	high	1/250 (0.4%)	[0.1, 2.2]
64	o3-mini	OpenAI	Y	default	1/250 (0.4%)	[0.1, 2.2]
65	Claude 3.5 Haiku	Anthropic	n	–	0/250 (0.0%)	[0, 1.5]
66	Claude 3 Haiku	Anthropic	n	–	0/250 (0.0%)	[0, 1.5]
67	GPT-5	OpenAI	Y	default	0/250 (0.0%)	[0, 1.5]
68	GPT-5 Mini	OpenAI	Y	default	0/250 (0.0%)	[0, 1.5]
69	GPT-5.2	OpenAI	Y	default	0/250 (0.0%)	[0, 1.5]
70	Claude Opus 4.5	Anthropic	Y	default	0/250 (0.0%)	[0, 1.5]
71	Claude Sonnet 4.5	Anthropic	Y	default	0/250 (0.0%)	[0, 1.5]
72	Claude 3.5 Sonnet	Anthropic	n	–	0/250 (0.0%)	[0, 1.5]
73	Gemini 3 Pro	Google	Y	default	0/250 (0.0%)	[0, 1.5]
74	Claude Sonnet 4.6	Anthropic	Y	default	0/250 (0.0%)	[0, 1.5]
75	Claude Opus 4.6	Anthropic	Y	default	0/250 (0.0%)	[0, 1.5]
76	Gemini 3.1 Pro	Google	Y	default	0/250 (0.0%)	[0, 1.5]
77	Qwen 3.5 397B	Alibaba	n	–	0/250 (0.0%)	[0, 1.5]
78	GPT-5.5	OpenAI	Y	default	0/250 (0.0%)	[0, 1.5]
79	Claude Opus 4.7	Anthropic	Y	default	0/250 (0.0%)	[0, 1.5]
80	GPT-5.5 Pro	OpenAI	Y	default	0/251 (0.0%)	[0, 1.5]
81	Qwen 3.6 Plus	Alibaba	Y	default	0/251 (0.0%)	[0, 1.5]
82	GPT-5.1 (minimal)	OpenAI	n	minimal	0/250 (0.0%)	[0, 1.5]
83	GPT-5 Mini (high)	OpenAI	Y	high	0/250 (0.0%)	[0, 1.5]
84	GPT-5.2 (minimal)	OpenAI	n	minimal	0/250 (0.0%)	[0, 1.5]
85	GPT-5 (high)	OpenAI	Y	high	0/250 (0.0%)	[0, 1.5]
86	GPT-5.4 (minimal)	OpenAI	n	minimal	0/250 (0.0%)	[0, 1.5]
87	GPT-5.2 Pro (medium)	OpenAI	Y	medium	0/250 (0.0%)	[0, 1.5]
88	GPT-5.2 Pro (high)	OpenAI	Y	high	0/249 (0.0%)	[0, 1.5]
89	GPT-5.4 Pro (high)	OpenAI	Y	high	0/216 (0.0%)	[0, 1.7]
90	Claude Opus 4.7 thinking	Anthropic	Y	thinking_on	0/250 (0.0%)	[0, 1.5]
91	Claude Opus 4.6 thinking	Anthropic	Y	thinking_on	0/250 (0.0%)	[0, 1.5]
92	Claude Opus 4.5 thinking	Anthropic	Y	thinking_on	0/249 (0.0%)	[0, 1.5]
93	GPT-5.5 minimal	OpenAI	Y	minimal	0/250 (0.0%)	[0, 1.5]
94	Claude Sonnet 4.6 thinking	Anthropic	Y	thinking_on	0/250 (0.0%)	[0, 1.5]
95	Claude Sonnet 4.5 thinking	Anthropic	Y	thinking_on	0/250 (0.0%)	[0, 1.5]
96	Claude 3.7 Sonnet thinking	Anthropic	Y	thinking_on	0/250 (0.0%)	[0, 1.5]
97	Gemini 3 Pro thinking_off	Google	n	thinking_off	0/250 (0.0%)	[0, 1.5]
98	Gemini 3.1 Pro thinking_off	Google	n	thinking_off	0/250 (0.0%)	[0, 1.5]
99	o1	OpenAI	Y	default	0/249 (0.0%)	[0, 1.5]
100	GPT-5.5 high	OpenAI	Y	high	0/249 (0.0%)	[0, 1.5]

Table A — Reasoning-active entries (50)

Every entry running with reasoning enabled, ranked by bypass rate. Pooled bypass rate: 9.2% [8.7, 9.7] — roughly 4× lower than the non-reasoning pool.

#	Model	Family	Mode	Bypass B/N (%)	95% CI
1	Ant Ling 2.6 1T	AntGroup	default	230/251 (91.6%)	[87.5, 94.5]
2	Tencent HunYuan 3	Tencent	default	178/245 (72.7%)	[66.8, 77.9]
3	GLM-4.5V	Zhipu	default	179/250 (71.6%)	[65.7, 76.8]
4	DeepSeek R1	DeepSeek	default	178/250 (71.2%)	[65.3, 76.5]
5	Gemini 2.5 Pro	Google	default	101/250 (40.4%)	[34.5, 46.6]
6	Qwen3 Max Think	Alibaba	default	56/250 (22.4%)	[17.7, 28.0]
7	Xiaomi MiMo 2.5 Pro	Xiaomi	default	54/251 (21.5%)	[16.9, 27.0]
8	Kimi K2 Think	Moonshot	default	50/250 (20.0%)	[15.5, 25.4]
9	GPT-5.1	OpenAI	default	42/250 (16.8%)	[12.7, 21.9]
10	Claude 3.7 Sonnet	Anthropic	default	27/250 (10.8%)	[7.5, 15.3]
11	Kimi K2.6	Moonshot	default	10/251 (4.0%)	[2.2, 7.2]
12	Kimi K2.5	Moonshot	default	9/250 (3.6%)	[1.9, 6.7]
13	o4-mini	OpenAI	default	7/250 (2.8%)	[1.4, 5.7]
14	Grok 4	xAI	default	7/250 (2.8%)	[1.4, 5.7]
15	Grok 4 Fast	xAI	low	3/250 (1.2%)	[0.4, 3.5]
16	Grok 4	xAI	low	3/250 (1.2%)	[0.4, 3.5]
17	Grok 4.20	xAI	low	2/250 (0.8%)	[0.2, 2.9]
18	o3	OpenAI	default	1/250 (0.4%)	[0.1, 2.2]
19	Grok 4.1 Fast	xAI	default	1/250 (0.4%)	[0.1, 2.2]
20	GPT-5.1	OpenAI	high	1/250 (0.4%)	[0.1, 2.2]
21	GPT-5.2	OpenAI	high	1/250 (0.4%)	[0.1, 2.2]
22	GPT-5.4	OpenAI	high	1/250 (0.4%)	[0.1, 2.2]
23	o3-mini	OpenAI	default	1/250 (0.4%)	[0.1, 2.2]
24	GPT-5	OpenAI	default	0/250 (0.0%)	[0, 1.5]
25	GPT-5 Mini	OpenAI	default	0/250 (0.0%)	[0, 1.5]
26	GPT-5.2	OpenAI	default	0/250 (0.0%)	[0, 1.5]
27	Claude Opus 4.5	Anthropic	default	0/250 (0.0%)	[0, 1.5]
28	Claude Sonnet 4.5	Anthropic	default	0/250 (0.0%)	[0, 1.5]
29	Gemini 3 Pro	Google	default	0/250 (0.0%)	[0, 1.5]
30	Claude Sonnet 4.6	Anthropic	default	0/250 (0.0%)	[0, 1.5]
31	Claude Opus 4.6	Anthropic	default	0/250 (0.0%)	[0, 1.5]
32	Gemini 3.1 Pro	Google	default	0/250 (0.0%)	[0, 1.5]
33	GPT-5.5	OpenAI	default	0/250 (0.0%)	[0, 1.5]
34	Claude Opus 4.7	Anthropic	default	0/250 (0.0%)	[0, 1.5]
35	GPT-5.5 Pro	OpenAI	default	0/251 (0.0%)	[0, 1.5]
36	Qwen 3.6 Plus	Alibaba	default	0/251 (0.0%)	[0, 1.5]
37	GPT-5 Mini	OpenAI	high	0/250 (0.0%)	[0, 1.5]
38	GPT-5	OpenAI	high	0/250 (0.0%)	[0, 1.5]
39	GPT-5.2 Pro	OpenAI	medium	0/250 (0.0%)	[0, 1.5]
40	GPT-5.2 Pro	OpenAI	high	0/249 (0.0%)	[0, 1.5]
41	GPT-5.4 Pro	OpenAI	high	0/216 (0.0%)	[0, 1.7]
42	Claude Opus 4.7	Anthropic	thinking_on	0/250 (0.0%)	[0, 1.5]
43	Claude Opus 4.6	Anthropic	thinking_on	0/250 (0.0%)	[0, 1.5]
44	Claude Opus 4.5	Anthropic	thinking_on	0/249 (0.0%)	[0, 1.5]
45	GPT-5.5	OpenAI	minimal	0/250 (0.0%)	[0, 1.5]
46	Claude Sonnet 4.6	Anthropic	thinking_on	0/250 (0.0%)	[0, 1.5]
47	Claude Sonnet 4.5	Anthropic	thinking_on	0/250 (0.0%)	[0, 1.5]
48	Claude 3.7 Sonnet	Anthropic	thinking_on	0/250 (0.0%)	[0, 1.5]
49	o1	OpenAI	default_high	0/249 (0.0%)	[0, 1.5]
50	GPT-5.5	OpenAI	high	0/249 (0.0%)	[0, 1.5]

Table B — Non-reasoning entries (50)

Every entry running without reasoning (chat baseline, reasoning explicitly disabled, or no reasoning mode available). Pooled bypass rate: 39.4% [38.5, 40.2].

#	Model	Family	Mode	Bypass B/N (%)	95% CI
1	Mistral Medium 3.1	Mistral	–	250/250 (100%)	[98.5, 100]
2	Mistral Large	Mistral	–	247/250 (98.8%)	[96.5, 99.6]
3	Grok 3 Mini	xAI	–	244/250 (97.6%)	[94.9, 98.9]
4	Gemma 3 27B	Google	–	242/250 (96.8%)	[93.8, 98.4]
5	Grok 3	xAI	–	239/250 (95.6%)	[92.3, 97.5]
6	Mistral Small 3.2	Mistral	–	236/250 (94.4%)	[90.8, 96.6]
7	Ant Ling 2.6 Flash	AntGroup	none	230/250 (92.0%)	[88.0, 94.8]
8	DeepSeek V4 Flash	DeepSeek	–	225/251 (89.6%)	[85.3, 92.8]
9	DeepSeek V3.2	DeepSeek	–	206/250 (82.4%)	[77.2, 86.6]
10	Gemini 2.0 Flash	Google	–	204/250 (81.6%)	[76.3, 85.9]
11	Gemini 2.5 Flash	Google	–	200/250 (80.0%)	[74.6, 84.5]
12	DeepSeek Chat V3.1	DeepSeek	none	187/245 (76.3%)	[70.6, 81.2]
13	GPT-4.1 Mini	OpenAI	–	190/250 (76.0%)	[70.3, 80.9]
14	GLM-4.5	Zhipu	none	179/248 (72.2%)	[66.3, 77.4]
15	Kimi K2 (non-thinking)	Moonshot	none	177/250 (70.8%)	[64.9, 76.1]
16	Gemini 2.5 Flash Lite	Google	–	174/250 (69.6%)	[63.6, 75.0]
17	DeepSeek V4 Pro	DeepSeek	–	169/250 (67.6%)	[61.6, 73.1]
18	Command R+	Cohere	–	166/250 (66.4%)	[60.3, 72.0]
19	Llama 4 Scout	Meta	–	155/250 (62.0%)	[55.8, 67.8]
20	Gemini 3.1 Flash Lite	Google	–	138/250 (55.2%)	[49.0, 61.2]
21	Qwen3 Coder Next	Alibaba	none	119/250 (47.6%)	[41.5, 53.8]
22	Xiaomi MiMo 2.5	Xiaomi	none	113/250 (45.2%)	[39.1, 51.4]
23	GPT-3.5 Turbo	OpenAI	–	105/250 (42.0%)	[36.0, 48.2]
24	GPT-4.1 Nano	OpenAI	–	86/250 (34.4%)	[28.8, 40.5]
25	GPT-5	OpenAI	minimal	81/249 (32.5%)	[27.0, 38.6]
26	Llama 4 Maverick	Meta	–	74/250 (29.6%)	[24.3, 35.5]
27	Nova Pro	Amazon	–	50/250 (20.0%)	[15.5, 25.4]
28	Tencent HunYuan A13B	Tencent	none	50/250 (20.0%)	[15.5, 25.4]
29	GPT-4o Mini	OpenAI	–	33/250 (13.2%)	[9.6, 18.0]
30	NVIDIA Nemotron 70B	NVIDIA	none	33/250 (13.2%)	[9.6, 18.0]
31	Phi-4	Microsoft	–	28/250 (11.2%)	[7.9, 15.7]
32	Llama 3.1 405B	Meta	–	22/250 (8.8%)	[5.9, 13.0]
33	GPT-4.1	OpenAI	–	17/250 (6.8%)	[4.3, 10.6]
34	GPT-5 Mini	OpenAI	minimal	11/250 (4.4%)	[2.5, 7.7]
35	GPT-4o	OpenAI	–	9/250 (3.6%)	[1.9, 6.7]
36	Gemma 4 26B-MoE	Google	–	8/251 (3.2%)	[1.6, 6.2]
37	GPT-OSS 120B	OpenAI	–	7/250 (2.8%)	[1.4, 5.7]
38	Qwen3 Coder	Alibaba	–	6/250 (2.4%)	[1.1, 5.1]
39	Llama 3.3 70B	Meta	–	5/250 (2.0%)	[0.9, 4.6]
40	Llama 3.1 70B	Meta	none	3/250 (1.2%)	[0.4, 3.5]
41	GPT-5.3	OpenAI	–	2/250 (0.8%)	[0.2, 2.9]
42	Claude 3.5 Haiku	Anthropic	–	0/250 (0.0%)	[0, 1.5]
43	Claude 3 Haiku	Anthropic	–	0/250 (0.0%)	[0, 1.5]
44	Claude 3.5 Sonnet	Anthropic	–	0/250 (0.0%)	[0, 1.5]
45	Qwen 3.5 397B	Alibaba	–	0/250 (0.0%)	[0, 1.5]
46	GPT-5.1	OpenAI	minimal	0/250 (0.0%)	[0, 1.5]
47	GPT-5.2	OpenAI	minimal	0/250 (0.0%)	[0, 1.5]
48	GPT-5.4	OpenAI	minimal	0/250 (0.0%)	[0, 1.5]
49	Gemini 3 Pro	Google	thinking_off	0/250 (0.0%)	[0, 1.5]
50	Gemini 3.1 Pro	Google	thinking_off	0/250 (0.0%)	[0, 1.5]

Column definitions

Model (mode): The specific model entry tested. Where the same model appears multiple times, the parenthetical or Mode column distinguishes reasoning configurations.
Family: Vendor or model lineage (e.g. OpenAI, Anthropic, Google, Mistral). Family is one of the strongest predictors of IICL vulnerability in this study.
R (reasoning state): Y = reasoning chain active during inference. n = no reasoning (chat baseline or reasoning explicitly disabled).
Mode: The configuration flag used: reasoning effort level (minimal, low, medium, high), thinking_on / thinking_off, none, or default when the model has only one supported mode.
Bypass B/N (%): Number of successful bypasses (B) over total adversarial probes attempted (N), and the resulting bypass rate. Higher = more vulnerable. The attack used here is the optimal V8 configuration from the study.
95% CI: Wilson 95% confidence interval around the bypass rate, accounting for sample size. Narrow intervals indicate higher statistical confidence in the point estimate.

Conclusion

IICL vulnerability is neither universal nor random — it is determined by specific architectural and training choices that vary systematically across vendors and model generations, with reasoning augmentation emerging as the strongest single defensive factor measured (a 4.3× reduction in bypass rate). Anthropic's Claude family is uniformly immune across 15 entries and four generations, while Mistral's entire lineup falls at 94–100% — meaning model selection alone is now a measurable safety control for enterprises deploying LLMs.

Read the full paper: Cross-Model Vulnerability to Involuntary In-Context Learning Attacks.