We benchmarked 18 LLMs on OCR (7k+ calls) — cheaper/old models oftentimes win. Full dataset + framework open-sourced. [R]
TLDR; We were overpaying for OCR, so we compared flagship models with cheaper and older models. New mini-bench + leaderboard. Free tool to test your own documents. Open Source.
We’ve been looking at OCR / document extraction workflows and kept seeing the same pattern:
Too many teams are either stuck in legacy OCR pipelines, or are overpaying badly for LLM calls by defaulting to the newest/ biggest model.
We put together a curated set of 42 standard documents and ran every model 10 times under identical conditions; 7,560 total calls. Main takeaway: for standard OCR, smaller and older models match premium accuracy at a fraction of the cost.
We track pass^n (reliability at scale), cost-per-success, latency, and critical field accuracy.
Everything is open source: https://github.com/ArbitrHq/ocr-mini-bench
Leaderboard: https://arbitrhq.ai/leaderboards/
Curious whether this matches what others here are seeing.
[link] [comments]
Want to read more?
Check out the full article on the original site