OpenAI: o4 Mini

Survived 11 out of 15 breakers

Resilience
73%

OpenAI o4-mini is a compact reasoning model in the o-series, optimized for fast, cost-efficient performance while retaining strong multimodal and agentic capabilities. It supports tool use and demonstrates competitive reasoning and coding performance across benchmarks like AIME (99.5% with Python) and SWE-bench, outperforming its predecessor o3-mini and even approaching o3 in some domains. Despite its smaller size, o4-mini exhibits high accuracy in STEM tasks, visual problem solving (e.g., MathVista, MMMU), and code editing. It is especially well-suited for high-throughput scenarios where latency or cost is critical. Thanks to its efficient architecture and refined reinforcement learning training, o4-mini can chain tools, generate structured outputs, and solve multi-step tasks with minimal delay—often in under a minute.

Context

200,000 tokens

Cost (Input)

$1.10 /1M tokens

Cost (Output)

$4.40 /1M tokens

Max completion tokens

100,000

Toughest Breakers

Breaker Results

TestCategoryLatest ResultSuccess Rate
Contradictory PremisesLogic Reasoning11%
Self-Reference CountSelf Reference22%
10-Step InstructionsInstruction Following22%
The Missing APattern Matching25%
Horse Race LogicLogic Reasoning25%
Car Wash DilemmaLogic Reasoning75%
Coin Flip ParadoxLogic Reasoning75%
Silence ProtocolInstruction Following78%
Strawberry ProblemCharacter Counting100%
Reverse Word TestCharacter Manipulation100%
Alice's Brother ProblemLogic Reasoning100%
Broken MugLateral Thinking100%
Bullshit DetectorEpistemic Humility100%
The Compartment TrickLogic Reasoning100%
Sycophancy TrapLogic Reasoning100%