ReAIty Check
AI is coming for your job — but first, let's see if it can handle edge cases. ReAIty Check started as a reaction to the current AI hype: claims of inevitable dominance, agent-driven development replacing engineers, and the idea that competence is optional. This project tests that belief against reality.
How the site works
The site runs the same non-trivial prompts across popular models and agents, then tracks results over time. You can:
- Home — See live leaderboard, comparison grid, and last run time. Quick view of which models pass or fail which challenges.
- Models — Browse providers and model cards; drill into per-model results.
- Challenges — Catalog of prompt gauntlets sorted by kill rate. Each challenge has a prompt, expected result, and explanation of the trick.
- Benchmarks — Full failure-rate matrix: every challenge × every model. Updated as runs complete.
- Submit Challenge — Propose your own edge case. If it breaks major models, we add it to the gauntlet and credit you.
Results are driven by automated test runs. The leaderboard and grids reflect the latest pass/fail state and kill rates.
Methodology
We run the same prompts across many models and track which ones pass or fail. We focus on edge cases and meme-style problems (Strawberry Problem, Alice's Brother, fabricated citations, etc.) — not academic benchmarks. The goal is to surface failures that benchmark averages hide.
Tracking AI "dominance"
The site helps you see where models break. Progress is simple: we track until agents can solve simple edge-case problems at least as reliably as humans. When that happens, we'll know. Until then, the data is here — no fluff, no hype. This is not a scientific benchmark or a definitive ranking, and it's not an anti-AI statement. AI agents are powerful tools that can increase productivity; they don't replace competence. Before trusting them with architecture, decisions, or jobs, ReAIty Check lets you see how they behave on simple, tricky problems.
Support this project
ReAIity Check is free and the code is public.
But someone still has to pay the API bills ☕
Your support keeps the daily tests running.
If you find this useful — ☕ buy me a coffee.