Anthropic: Claude 3.5 Sonnet

Survived 6 out of 15 breakers

Resilience
40%

New Claude 3.5 Sonnet delivers better-than-Opus capabilities, faster-than-Sonnet speeds, at the same Sonnet prices. Sonnet is particularly good at: - Coding: Scores ~49% on SWE-Bench Verified, higher than the last best score, and without any fancy prompt scaffolding - Data science: Augments human data science expertise; navigates unstructured data while using multiple tools for insights - Visual processing: excelling at interpreting charts, graphs, and images, accurately transcribing text to derive insights beyond just the text alone - Agentic tasks: exceptional tool use, making it great at agentic tasks (i.e. complex, multi-step problem solving tasks that require engaging with other systems) #multimodal

Context

200,000 tokens

Cost (Input)

$6.00 /1M tokens

Cost (Output)

$30.00 /1M tokens

Max completion tokens

8,192

Toughest Breakers

Breaker Results

TestCategoryLatest ResultSuccess Rate
Self-Reference CountSelf Reference0%
Silence ProtocolInstruction Following0%
Contradictory PremisesLogic Reasoning0%
Broken MugLateral Thinking0%
Car Wash DilemmaLogic Reasoning0%
The Missing APattern Matching0%
Strawberry ProblemCharacter Counting11%
10-Step InstructionsInstruction Following11%
Horse Race LogicLogic Reasoning25%
Reverse Word TestCharacter Manipulation100%
Alice's Brother ProblemLogic Reasoning100%
Bullshit DetectorEpistemic Humility100%
The Compartment TrickLogic Reasoning100%
Sycophancy TrapLogic Reasoning100%
Coin Flip ParadoxLogic Reasoning100%