DEFENSEBENCH LEADERBOARD

Evaluating defensive cybersecurity capabilities of foundation models

22 systems

Leaderboard Breakdown

#AI SystemAuthorDefense ScoreCost/TaskPaper
1
Gemini 3.1 Pro
Google
23.1%
$19.20
2
Claude Opus 4.6
Anthropic
22.4%
$35.20
3
GPT-5.2
OpenAI
21.2%
$10.40
4
Claude Opus 4.5
Anthropic
20.6%
$27.20
5
GPT-5.3
OpenAI
19.8%
$16.80
6
Gemini 2.5 Pro
Google
19.5%
$10.80
7
Claude Opus 4.1
Anthropic
18.8%
$114.00
8
Claude Sonnet 4.6
Anthropic
18.1%
$22.80
9
Claude Opus 4
Anthropic
17.2%
$78.00
10
o3
OpenAI
17.1%
$21.20
11
Claude Sonnet 4.5
Anthropic
16.8%
$15.20
12
GPT-5
OpenAI
16.3%
$8.80
13
GLM5
Z.ai
15.4%
$5.60
14
Grok 4
xAI
14.7%
$20.80
15
Claude Sonnet 4
Anthropic
13.5%
$17.60
16
GPT-4.1
OpenAI
13.2%
$9.60
17
DeepSeek V3.2
DeepSeek
12.0%
$1.20
18
Grok 3
xAI
11.8%
$24.80
19
GPT-5 Mini
OpenAI
11.1%
$2.20
20
Qwen 3.5 397B
Alibaba
10.3%
$1.52
21
Gemini 2.5 Flash
Google
7.6%
$3.80
22
GPT-4o
OpenAI
5.2%
$14.40

About DefenseBench

DefenseBench is an end-to-end benchmark for evaluating foundation models on realistic defensive cybersecurity tasks — from incident response and threat detection to security configuration, alert triage, active threat remediation, and more.

DefenseBench © 2026. All rights reserved.