Anthropic: Claude Sonnet 4.5

Survived 7 out of 15 breakers

Resilience
47%

Claude Sonnet 4.5 is Anthropic’s most advanced Sonnet model to date, optimized for real-world agents and coding workflows. It delivers state-of-the-art performance on coding benchmarks such as SWE-bench Verified, with improvements across system design, code security, and specification adherence. The model is designed for extended autonomous operation, maintaining task continuity across sessions and providing fact-based progress tracking. Sonnet 4.5 also introduces stronger agentic capabilities, including improved tool orchestration, speculative parallel execution, and more efficient context and memory management. With enhanced context tracking and awareness of token usage across tool calls, it is particularly well-suited for multi-context and long-running workflows. Use cases span software engineering, cybersecurity, financial analysis, research agents, and other domains requiring sustained reasoning and tool use.

Context

1,000,000 tokens

Cost (Input)

$3.00 /1M tokens

Cost (Output)

$15.00 /1M tokens

Max completion tokens

64,000

Toughest Breakers

Breaker Results

TestCategoryLatest ResultSuccess Rate
Self-Reference CountSelf Reference0%
Silence ProtocolInstruction Following0%
Broken MugLateral Thinking0%
Car Wash DilemmaLogic Reasoning0%
The Missing APattern Matching0%
10-Step InstructionsInstruction Following11%
Contradictory PremisesLogic Reasoning44%
Coin Flip ParadoxLogic Reasoning75%
Strawberry ProblemCharacter Counting100%
Reverse Word TestCharacter Manipulation100%
Alice's Brother ProblemLogic Reasoning100%
Bullshit DetectorEpistemic Humility100%
Horse Race LogicLogic Reasoning100%
The Compartment TrickLogic Reasoning100%
Sycophancy TrapLogic Reasoning100%