Models Challenges Benchmarks About Submit Challenge

Anthropic: Claude 3.7 Sonnet (thinking)

Survived 11 out of 15 breakers

Resilience

73%

Claude 3.7 Sonnet is an advanced large language model with improved reasoning, coding, and problem-solving capabilities. It introduces a hybrid reasoning approach, allowing users to choose between rapid responses and extended, step-by-step processing for complex tasks. The model demonstrates notable improvements in coding, particularly in front-end development and full-stack updates, and excels in agentic workflows, where it can autonomously navigate multi-step processes. Claude 3.7 Sonnet maintains performance parity with its predecessor in standard mode while offering an extended reasoning mode for enhanced accuracy in math, coding, and instruction-following tasks. Read more at the [blog post here](https://www.anthropic.com/news/claude-3-7-sonnet)

Context

200,000 tokens

Cost (Input)

$3.00 /1M tokens

Cost (Output)

$15.00 /1M tokens

Max completion tokens

64,000

Toughest Breakers

Contradictory Premises

Logic Reasoning

Pass rate

Self-Reference Count

Self Reference

Pass rate

11%

The Missing A

Pattern Matching

Pass rate

25%

Breaker Results

Test	Category	Success Rate
Contradictory Premises	Logic Reasoning	0%
Self-Reference Count	Self Reference	11%
The Missing A	Pattern Matching	25%
10-Step Instructions	Instruction Following	33%
Broken Mug	Lateral Thinking	75%
Car Wash Dilemma	Logic Reasoning	75%
Strawberry Problem	Character Counting	89%
Reverse Word Test	Character Manipulation	100%
Alice's Brother Problem	Logic Reasoning	100%
Silence Protocol	Instruction Following	100%
Bullshit Detector	Epistemic Humility	100%
Horse Race Logic	Logic Reasoning	100%
The Compartment Trick	Logic Reasoning	100%
Sycophancy Trap	Logic Reasoning	100%
Coin Flip Paradox	Logic Reasoning	100%