Models Challenges Benchmarks About Submit Challenge

OpenAI: GPT-5 Codex

Survived 9 out of 15 breakers

Resilience

60%

GPT-5-Codex is a specialized version of GPT-5 optimized for software engineering and coding workflows. It is designed for both interactive development sessions and long, independent execution of complex engineering tasks. The model supports building projects from scratch, feature development, debugging, large-scale refactoring, and code review. Compared to GPT-5, Codex is more steerable, adheres closely to developer instructions, and produces cleaner, higher-quality code outputs. Reasoning effort can be adjusted with the `reasoning.effort` parameter. Read the [docs here](https://openrouter.ai/docs/use-cases/reasoning-tokens#reasoning-effort-level) Codex integrates into developer environments including the CLI, IDE extensions, GitHub, and cloud tasks. It adapts reasoning effort dynamically—providing fast responses for small tasks while sustaining extended multi-hour runs for large projects. The model is trained to perform structured code reviews, catching critical flaws by reasoning over dependencies and validating behavior against tests. It also supports multimodal inputs such as images or screenshots for UI development and integrates tool use for search, dependency installation, and environment setup. Codex is intended specifically for agentic coding applications.

Context

400,000 tokens

Cost (Input)

$1.25 /1M tokens

Cost (Output)

$10.00 /1M tokens

Max completion tokens

128,000

Toughest Breakers

Self-Reference Count

Self Reference

Pass rate

Car Wash Dilemma

Logic Reasoning

Pass rate

10-Step Instructions

Instruction Following

Pass rate

11%

Breaker Results

Test	Category	Success Rate
Self-Reference Count	Self Reference	0%
Car Wash Dilemma	Logic Reasoning	0%
10-Step Instructions	Instruction Following	11%
Contradictory Premises	Logic Reasoning	22%
The Missing A	Pattern Matching	25%
Bullshit Detector	Epistemic Humility	75%
Horse Race Logic	Logic Reasoning	75%
The Compartment Trick	Logic Reasoning	75%
Reverse Word Test	Character Manipulation	89%
Silence Protocol	Instruction Following	89%
Strawberry Problem	Character Counting	100%
Alice's Brother Problem	Logic Reasoning	100%
Broken Mug	Lateral Thinking	100%
Sycophancy Trap	Logic Reasoning	100%
Coin Flip Paradox	Logic Reasoning	100%