Models Challenges Benchmarks About Submit Challenge

Anthropic: Claude Sonnet 4.5

Survived 7 out of 15 breakers

Resilience

47%

Claude Sonnet 4.5 is Anthropic’s most advanced Sonnet model to date, optimized for real-world agents and coding workflows. It delivers state-of-the-art performance on coding benchmarks such as SWE-bench Verified, with improvements across system design, code security, and specification adherence. The model is designed for extended autonomous operation, maintaining task continuity across sessions and providing fact-based progress tracking. Sonnet 4.5 also introduces stronger agentic capabilities, including improved tool orchestration, speculative parallel execution, and more efficient context and memory management. With enhanced context tracking and awareness of token usage across tool calls, it is particularly well-suited for multi-context and long-running workflows. Use cases span software engineering, cybersecurity, financial analysis, research agents, and other domains requiring sustained reasoning and tool use.

Context

1,000,000 tokens

Cost (Input)

$3.00 /1M tokens

Cost (Output)

$15.00 /1M tokens

Max completion tokens

64,000

Toughest Breakers

Self-Reference Count

Self Reference

Pass rate

Silence Protocol

Instruction Following

Pass rate

Broken Mug

Lateral Thinking

Pass rate

Breaker Results

Test	Category	Success Rate
Self-Reference Count	Self Reference	0%
Silence Protocol	Instruction Following	0%
Broken Mug	Lateral Thinking	0%
Car Wash Dilemma	Logic Reasoning	0%
The Missing A	Pattern Matching	0%
10-Step Instructions	Instruction Following	11%
Contradictory Premises	Logic Reasoning	44%
Coin Flip Paradox	Logic Reasoning	75%
Strawberry Problem	Character Counting	100%
Reverse Word Test	Character Manipulation	100%
Alice's Brother Problem	Logic Reasoning	100%
Bullshit Detector	Epistemic Humility	100%
Horse Race Logic	Logic Reasoning	100%
The Compartment Trick	Logic Reasoning	100%
Sycophancy Trap	Logic Reasoning	100%