I Spent Months Improving OCR Accuracy.
The Real Breaks Were Somewhere Else.
After building DocPipeline — a document extraction system handling PDFs and images across multiple channels — here are three principles that reshaped how I think about document AI. None of them are about the model.
Schema validation is where extraction actually breaks
When I built the extraction layer in DocPipeline, early results looked promising. Fields were being pulled. Values were appearing. Demos worked.
Then I started looking at what was actually landing in downstream systems.
The LLM was doing its job. It was finding the right fields on the right pages. But what came back was inconsistent in ways that broke everything downstream — not spectacularly, but quietly. The kind of failures that don't show up in accuracy benchmarks because the value was technically found.
Every one of these failures made it past OCR. Every one of them made it past extraction. They failed silently at the point of use — a broken lookup query, a miscalculated total, a mismatched record in the ERP system.
The part most teams skip: the schema must be per document type. A purchase order schema is not the same as a utility bill schema. In DocPipeline, each supported document type carries its own schema — field names, types, required flags, format rules, and confidence floor. That schema is the source of truth for what “extracted successfully” actually means.
The test to run: Take five real documents from production. Extract them. Then look not at whether fields were found — but at whether the values were in the exact type and format your downstream system expects. That gap is your schema problem.
The question to ask any document AI vendor: what happens when a field is extracted but in the wrong format? What happens when a required field comes back null? If the answer is “we return whatever the model gave us” — that is not production-ready. That is a demo.
Queue design determines your failure recovery story
A document processing job is not a single operation. By the time a PDF has been ingested, split into pages, converted to images, OCR'd, extracted, validated, and exported — there are 15 to 30 discrete steps that can each fail independently.
I learned this the hard way. An early version of DocPipeline would fail an entire job if any single page failed. A 20-page invoice with one unreadable page — discarded entirely. No partial results. No retry. No visibility into which page caused the failure. The job was just gone.
The architecture that fixed this was separating the pipeline into two independent queues, each with its own worker.
OCR is GPU-bound and fast. Extraction is LLM-bound and slower. Decoupling them means each stage scales and fails independently.
A field failed validation — wrong type, missing required value, confidence below threshold. The page is requeued to ocr_results with an updated prompt or stricter extraction instruction. OCR is not re-run — the raw text is already in the database.
A page image failed OCR — unreadable scan, GPU timeout, engine error. The page is requeued to ocr_queue with a fallback OCR engine (if PaddleOCR failed, retry with Azure Document Intelligence, and vice versa). The rest of the job is unaffected.
A reviewer corrects a field in the review UI and flags the page for re-extraction with updated context. The correction is logged, the page is requeued, and the audit trail records who triggered the reprocess and why.
A 20-page document where page 7 failed OCR does not become a lost job. Page 7 retries. Pages 1–6 and 8–20 are already done. The job completes as partial_complete with a clear record of which page needs attention — not a silent failure with no trail.
Ask any document AI system: if page 7 of a 20-page document fails extraction, what happens to the other 19 pages? The answer tells you everything about how the system handles production reality.
Human-in-the-loop is a design decision, not a fallback — and the industry already knows this
This assumption has a hidden cost. But before I share what DocPipeline does, it is worth looking at what the industry has already figured out — because the strategies teams reach for reveal a lot about what the real problem actually is.
HITL is not one thing. There are at least four distinct strategies in active use across document AI, healthcare, finance, and logistics. Each solves a different problem. Picking the wrong one does not just add friction — it adds the wrong kind of friction.
DocPipeline handles receipts, purchase orders, utility bills, insurance documents, and expense reports — across customers with different layouts, scan quality, and document conventions. No two vendors format an invoice the same way. No two customers send documents in the same condition.
A risk-based mandatory review approach would be too blunt — it would send everything to a human and defeat the purpose of automation. An active learning loop is the right long-term goal but requires labeled data infrastructure not yet justified at this stage. Exception-based routing assumes the exception categories are known in advance — which they are not, because document variety is one of the core problems.
That left confidence-threshold routing as the primary strategy — but with one critical addition that most implementations skip.
Teams evaluating document AI often frame the HITL question as: “How often does a human need to get involved?” That is the wrong question. It optimises for minimising human involvement rather than maximising the value of human involvement when it does happen.
The reframe that changed how I built this: HITL is not about how often humans intervene. It is about how useful that intervention is when it happens — and whether the system gets smarter because of it. A document AI system that routes 5% of fields to review and captures every correction is more valuable than one that routes 1% and discards them.
“OCR gets the credit. Schema validation, queue design, and human review do the work. The gap between a document AI demo and a document AI product lives in those three layers.”