DEFENSEBENCH TASKS

Explore AI performance on defensive cybersecurity tasks.

Click on any task to explore detailed results.

Task IDGemini 3.1 ProClaude Opus 4.6GPT-5.3Claude Sonnet 4.6o3GLM5Grok 4
StatusCost (USD)StatusCost (USD)StatusCost (USD)StatusCost (USD)StatusCost (USD)StatusCost (USD)StatusCost (USD)
a3f7c2d1PASS$12.40PASS$24.32FAIL$15.05FAIL$20.88FAIL$20.52FAIL$5.43FAIL$14.54
b8e12a4fPASS$21.31FAIL$35.45PASS$22.50FAIL$21.93PASS$20.12FAIL$5.69FAIL$12.83
c4d901b7FAIL$15.17FAIL$40.02FAIL$12.84FAIL$21.61FAIL$25.52FAIL$3.94FAIL$19.47
d1a6f3e2FAIL$21.87FAIL$42.80FAIL$21.99FAIL$17.10FAIL$17.87FAIL$6.55PASS$23.11
e7b5c8d4FAIL$26.57PASS$34.04PASS$19.24PASS$24.42FAIL$23.79FAIL$3.87FAIL$20.68

Held-Out Test Set

Full results available upon submission