Models Challenges Benchmarks About Submit Challenge

Anthropic: Claude 3.5 Sonnet

Survived 6 out of 15 breakers

Resilience

40%

New Claude 3.5 Sonnet delivers better-than-Opus capabilities, faster-than-Sonnet speeds, at the same Sonnet prices. Sonnet is particularly good at: - Coding: Scores ~49% on SWE-Bench Verified, higher than the last best score, and without any fancy prompt scaffolding - Data science: Augments human data science expertise; navigates unstructured data while using multiple tools for insights - Visual processing: excelling at interpreting charts, graphs, and images, accurately transcribing text to derive insights beyond just the text alone - Agentic tasks: exceptional tool use, making it great at agentic tasks (i.e. complex, multi-step problem solving tasks that require engaging with other systems) #multimodal

Context

200,000 tokens

Cost (Input)

$6.00 /1M tokens

Cost (Output)

$30.00 /1M tokens

Max completion tokens

8,192

Toughest Breakers

Self-Reference Count

Self Reference

Pass rate

Silence Protocol

Instruction Following

Pass rate

Contradictory Premises

Logic Reasoning

Pass rate

Breaker Results

Test	Category	Success Rate
Self-Reference Count	Self Reference	0%
Silence Protocol	Instruction Following	0%
Contradictory Premises	Logic Reasoning	0%
Broken Mug	Lateral Thinking	0%
Car Wash Dilemma	Logic Reasoning	0%
The Missing A	Pattern Matching	0%
Strawberry Problem	Character Counting	11%
10-Step Instructions	Instruction Following	11%
Horse Race Logic	Logic Reasoning	25%
Reverse Word Test	Character Manipulation	100%
Alice's Brother Problem	Logic Reasoning	100%
Bullshit Detector	Epistemic Humility	100%
The Compartment Trick	Logic Reasoning	100%
Sycophancy Trap	Logic Reasoning	100%
Coin Flip Paradox	Logic Reasoning	100%

Anthropic: Claude 3.5 Sonnet — ReAIty Check