DEFENSEBENCH TASKS
Explore AI performance on defensive cybersecurity tasks.
Click on any task to explore detailed results.
| Task ID | Gemini 3.1 Pro | Claude Opus 4.6 | GPT-5.3 | Claude Sonnet 4.6 | o3 | GLM5 | Grok 4 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Status | Cost (USD) | Status | Cost (USD) | Status | Cost (USD) | Status | Cost (USD) | Status | Cost (USD) | Status | Cost (USD) | Status | Cost (USD) | ||
| a3f7c2d1 | PASS | $12.40 | PASS | $24.32 | FAIL | $15.05 | FAIL | $20.88 | FAIL | $20.52 | FAIL | $5.43 | FAIL | $14.54 | |
| b8e12a4f | PASS | $21.31 | FAIL | $35.45 | PASS | $22.50 | FAIL | $21.93 | PASS | $20.12 | FAIL | $5.69 | FAIL | $12.83 | |
| c4d901b7 | FAIL | $15.17 | FAIL | $40.02 | FAIL | $12.84 | FAIL | $21.61 | FAIL | $25.52 | FAIL | $3.94 | FAIL | $19.47 | |
| d1a6f3e2 | FAIL | $21.87 | FAIL | $42.80 | FAIL | $21.99 | FAIL | $17.10 | FAIL | $17.87 | FAIL | $6.55 | PASS | $23.11 | |
| e7b5c8d4 | FAIL | $26.57 | PASS | $34.04 | PASS | $19.24 | PASS | $24.42 | FAIL | $23.79 | FAIL | $3.87 | FAIL | $20.68 | |
Held-Out Test Set
Full results available upon submission