DEFENSEBENCH LEADERBOARD
Evaluating defensive cybersecurity capabilities of foundation models
22 systems
Leaderboard Breakdown
| # | AI System | Author | Defense Score | Cost/Task | Paper |
|---|---|---|---|---|---|
| 1 | Gemini 3.1 Pro | 23.1% | $19.20 | — | |
| 2 | Claude Opus 4.6 | Anthropic | 22.4% | $35.20 | — |
| 3 | GPT-5.2 | OpenAI | 21.2% | $10.40 | — |
| 4 | Claude Opus 4.5 | Anthropic | 20.6% | $27.20 | — |
| 5 | GPT-5.3 | OpenAI | 19.8% | $16.80 | — |
| 6 | Gemini 2.5 Pro | 19.5% | $10.80 | — | |
| 7 | Claude Opus 4.1 | Anthropic | 18.8% | $114.00 | — |
| 8 | Claude Sonnet 4.6 | Anthropic | 18.1% | $22.80 | — |
| 9 | Claude Opus 4 | Anthropic | 17.2% | $78.00 | — |
| 10 | o3 | OpenAI | 17.1% | $21.20 | — |
| 11 | Claude Sonnet 4.5 | Anthropic | 16.8% | $15.20 | — |
| 12 | GPT-5 | OpenAI | 16.3% | $8.80 | — |
| 13 | GLM5 | Z.ai | 15.4% | $5.60 | — |
| 14 | Grok 4 | xAI | 14.7% | $20.80 | — |
| 15 | Claude Sonnet 4 | Anthropic | 13.5% | $17.60 | — |
| 16 | GPT-4.1 | OpenAI | 13.2% | $9.60 | — |
| 17 | DeepSeek V3.2 | DeepSeek | 12.0% | $1.20 | — |
| 18 | Grok 3 | xAI | 11.8% | $24.80 | — |
| 19 | GPT-5 Mini | OpenAI | 11.1% | $2.20 | — |
| 20 | Qwen 3.5 397B | Alibaba | 10.3% | $1.52 | — |
| 21 | Gemini 2.5 Flash | 7.6% | $3.80 | — | |
| 22 | GPT-4o | OpenAI | 5.2% | $14.40 | — |
About DefenseBench
DefenseBench is an end-to-end benchmark for evaluating foundation models on realistic defensive cybersecurity tasks — from incident response and threat detection to security configuration, alert triage, active threat remediation, and more.