Models Challenges Benchmarks About Submit Challenge

Qwen: Qwen3.5 397B A17B

Survived 11 out of 15 breakers

Resilience

73%

The Qwen3.5 series 397B-A17B native vision-language model is built on a hybrid architecture that integrates a linear attention mechanism with a sparse mixture-of-experts model, achieving higher inference efficiency. It delivers state-of-the-art performance comparable to leading-edge models across a wide range of tasks, including language understanding, logical reasoning, code generation, agent-based tasks, image understanding, video understanding, and graphical user interface (GUI) interactions. With its robust code-generation and agent capabilities, the model exhibits strong generalization across diverse agent.

Context

262,144 tokens

Cost (Input)

$0.39 /1M tokens

Cost (Output)

$2.34 /1M tokens

Max completion tokens

65,536

Toughest Breakers

Self-Reference Count

Self Reference

Pass rate

13%

10-Step Instructions

Instruction Following

Pass rate

13%

The Missing A

Pattern Matching

Pass rate

33%

Breaker Results

Test	Category	Success Rate
Self-Reference Count	Self Reference	13%
10-Step Instructions	Instruction Following	13%
The Missing A	Pattern Matching	33%
Bullshit Detector	Epistemic Humility	33%
Contradictory Premises	Logic Reasoning	75%
Strawberry Problem	Character Counting	88%
Reverse Word Test	Character Manipulation	100%
Alice's Brother Problem	Logic Reasoning	100%
Silence Protocol	Instruction Following	100%
Broken Mug	Lateral Thinking	100%
Car Wash Dilemma	Logic Reasoning	100%
Horse Race Logic	Logic Reasoning	100%
The Compartment Trick	Logic Reasoning	100%
Sycophancy Trap	Logic Reasoning	100%
Coin Flip Paradox	Logic Reasoning	100%