Why can the world’s most advanced AI models solve Olympiad-level mathematics but fail to reliably extract a total from an invoice? This isn’t an abstract question—it’s a real-world challenge I’ve confronted for decades.
For twenty years, I’ve built automation software and processed billions of documents for some of the largest enterprises globally. My company’s experience with real enterprise data, not benchmarks, reveals a stark truth: when AI models can’t handle simple tasks, the consequences are immediate and costly.
The conventional response—math is reasoning, invoices are perception, and better models will solve it—is incomplete. Let’s break it down.
How AI Handles Math vs. Real-World Data
At first glance, AI’s ability to solve complex math problems appears to demonstrate reasoning. But competitive mathematics relies on a finite set of proof techniques—perhaps a few hundred—that are repeatedly recombined. A ‘novel’ problem is often just a new arrangement of familiar blocks. Models trained on tens of thousands of proofs excel at remixing these patterns, a process I call composable pattern matching.
Chess presents the opposite challenge. Every serious middlegame position is genuinely novel in a way that matters. Even with deep knowledge of patterns and tactics, predicting whether a sacrifice will succeed requires concrete calculation. Chess engines solved this not by making neural networks larger, but by building systems around them.
The distinction is critical: most clerical work resembles math, not chess. Claims processing, compliance checks, and loan document reviews apply known rules to new instances. Here, AI can handle 85% to 95% of cases—an impressive feat. But the remaining 5% to 15% is where the real risk lies.
The Danger of Overconfident Mistakes
These edge cases aren’t outliers; they’re the scenarios where the pattern breaks. The dangerous part? The model doesn’t recognize it’s stuck. It delivers a confident answer anyway.
We’ve spent years testing AI models on document extraction—not edge cases, but everyday invoices. The task seems simple: read a value, place it in the right field. No reasoning. No judgment. Just extraction. Yet even the best models can’t achieve 100% accuracy. A less experienced human can.
I remember the moment this became undeniable. I assumed our pipeline was flawed. It wasn’t. We tested multiple models. The results were consistent. And that’s when it struck me: you don’t need to reach the hard parts of the process—judgment calls or exceptions—to expose AI’s limitations.