← Learning Log
Document AI in Production · Part 2April 2026 · 11 min read

I Spent Months Improving OCR Accuracy.
The Real Breaks Were Somewhere Else.

After building DocPipeline — a document extraction system handling PDFs and images across multiple channels — here are three principles that reshaped how I think about document AI. None of them are about the model.

3Principles
15–30Steps per job
95%Not good enough
01

Schema validation is where extraction actually breaks

Common assumption: “If the LLM extracted the field, we have the data.”

When I built the extraction layer in DocPipeline, early results looked promising. Fields were being pulled. Values were appearing. Demos worked.

Then I started looking at what was actually landing in downstream systems.

The LLM was doing its job. It was finding the right fields on the right pages. But what came back was inconsistent in ways that broke everything downstream — not spectacularly, but quietly. The kind of failures that don't show up in accuracy benchmarks because the value was technically found.

Real extraction failures — field present, output broken
FieldWhat arrivedWhat was needed
invoice_date"15th March 2024""2024-03-15" (ISO 8601)
total_amount"$1,234.00" / "1234" / "1,234"1234.00 (float)
invoice_number"INV-0042 " (trailing space)"INV-0042"
vendor_namenull (field existed on doc)required — flag for review
tax_rate"18%" / "0.18" / "18"0.18 (normalised float)

Every one of these failures made it past OCR. Every one of them made it past extraction. They failed silently at the point of use — a broken lookup query, a miscalculated total, a mismatched record in the ERP system.

What I builtThe schema validation layer runs immediately after extraction and enforces three things: type coercion (dates become ISO 8601, amounts become floats, strings are trimmed); required field presence (if a field is missing and required, the job is flagged — not silently passed); and confidence thresholds (if the model scored a value below the floor, it goes to human review, not downstream).

The part most teams skip: the schema must be per document type. A purchase order schema is not the same as a utility bill schema. In DocPipeline, each supported document type carries its own schema — field names, types, required flags, format rules, and confidence floor. That schema is the source of truth for what “extracted successfully” actually means.

The test to run: Take five real documents from production. Extract them. Then look not at whether fields were found — but at whether the values were in the exact type and format your downstream system expects. That gap is your schema problem.

The question to ask any document AI vendor: what happens when a field is extracted but in the wrong format? What happens when a required field comes back null? If the answer is “we return whatever the model gave us” — that is not production-ready. That is a demo.

02

Queue design determines your failure recovery story

Common assumption: “The queue is just infrastructure. We'll figure it out later.”

A document processing job is not a single operation. By the time a PDF has been ingested, split into pages, converted to images, OCR'd, extracted, validated, and exported — there are 15 to 30 discrete steps that can each fail independently.

I learned this the hard way. An early version of DocPipeline would fail an entire job if any single page failed. A 20-page invoice with one unreadable page — discarded entirely. No partial results. No retry. No visibility into which page caused the failure. The job was just gone.

The architecture that fixed this was separating the pipeline into two independent queues, each with its own worker.

Two-queue GPU worker architecture
ocr_queue
Handles raw page images. The GPU worker picks up each page independently, runs OCR, and publishes the raw text result back to the queue. If a page fails OCR — bad scan, skewed image, corrupted file — only that page is retried. The rest of the document continues processing. The GPU worker is stateless: it reads a page, returns text, moves on.
ocr_results
Receives completed OCR output page by page. A separate extraction worker picks up each result, runs LLM field extraction, applies schema validation, and writes the structured output to the database. This worker operates independently of the OCR worker — if extraction is slow or a downstream system is down, OCR continues unblocked.

OCR is GPU-bound and fast. Extraction is LLM-bound and slower. Decoupling them means each stage scales and fails independently.

How a job actually moves through the system
1

PDF arrives via upload, webhook, or cloud folder watch. It is split into individual page images. Each page becomes its own job record in the database with status pending_ocr.

2

Each page image is pushed to ocr_queue. The GPU worker processes pages concurrently. On completion, raw OCR text is written to the database and the page status moves to ocr_complete. The result is published to ocr_results.

3

The extraction worker reads from ocr_results. It runs LLM extraction on the OCR text, applies the document-type schema, scores field confidence, and writes structured output. Page status moves to extraction_complete or review_required.

4

Once all pages in a job reach a terminal status, the job is assembled. Results are exported to the configured output channel — webhook, Google Sheets, Slack, cloud storage, or UI review. Job status moves to complete or partial_complete if some pages failed.

5

Failed pages sit in a dead-letter state with full event history. They can be retried individually, reprocessed with a different OCR engine, or escalated to human review — without touching the pages that already succeeded.

Reprocessing triggers
Schema failure

A field failed validation — wrong type, missing required value, confidence below threshold. The page is requeued to ocr_results with an updated prompt or stricter extraction instruction. OCR is not re-run — the raw text is already in the database.

OCR failure

A page image failed OCR — unreadable scan, GPU timeout, engine error. The page is requeued to ocr_queue with a fallback OCR engine (if PaddleOCR failed, retry with Azure Document Intelligence, and vice versa). The rest of the job is unaffected.

Manual trigger

A reviewer corrects a field in the review UI and flags the page for re-extraction with updated context. The correction is logged, the page is requeued, and the audit trail records who triggered the reprocess and why.

A 20-page document where page 7 failed OCR does not become a lost job. Page 7 retries. Pages 1–6 and 8–20 are already done. The job completes as partial_complete with a clear record of which page needs attention — not a silent failure with no trail.

Ask any document AI system: if page 7 of a 20-page document fails extraction, what happens to the other 19 pages? The answer tells you everything about how the system handles production reality.

03

Human-in-the-loop is a design decision, not a fallback — and the industry already knows this

Common assumption: “If the AI is good enough, we won't need humans in the loop.”

This assumption has a hidden cost. But before I share what DocPipeline does, it is worth looking at what the industry has already figured out — because the strategies teams reach for reveal a lot about what the real problem actually is.

HITL is not one thing. There are at least four distinct strategies in active use across document AI, healthcare, finance, and logistics. Each solves a different problem. Picking the wrong one does not just add friction — it adds the wrong kind of friction.

The four HITL strategies the industry uses
Most commonConfidence-threshold routing

The model scores its own output. Anything below a set confidence threshold — typically 0.7 to 0.8 — is routed to a human reviewer. Above threshold, the output passes automatically. This is the dominant pattern in document processing. It is efficient and measurable: you know exactly what percentage of documents are touching a human, and you can tune that percentage by adjusting the threshold.

Used by:Google Document AI, AWS Textract, most IDP vendors
For complex docsException-based routing

Instead of routing on confidence, the system routes on document characteristics — handwritten content, non-standard layouts, missing required fields, or document types outside the trained distribution. The AI handles clean, structured documents autonomously. Exceptions are defined in advance by the engineering or ops team. Common in logistics and insurance where document variety is high but exception categories are known and stable.

Used by:Logistics platforms, insurance IDP, customs document processing
Regulated industriesRisk-based mandatory review

Certain fields or document types always go to a human — regardless of model confidence. The AI is an accelerant, not a decision-maker. In healthcare, a diagnosis or medication field always requires a clinician sign-off. In finance, any transaction above a value threshold requires analyst review. Confidence is irrelevant — the decision carries enough risk that human accountability is legally or operationally required. The EU AI Act now codifies this in some verticals.

Used by:Healthcare AI, financial compliance, legal document review
Maturing fastActive learning loops

Human corrections do not just fix the current output — they feed back into model retraining. The system actively selects the most uncertain or highest-value samples for human review, using those corrections to improve the model over time. This is the strategy that closes the gap between production accuracy and benchmark accuracy. It requires infrastructure — a feedback pipeline, annotation tooling, a retraining schedule — but it is the only strategy that makes the system get better the longer it runs.

Used by:Mature ML teams at scale, Scale AI, Labelbox
99.9%extraction accuracy with HITL vs 92% AI-only
90%accuracy increase in loan processing with human oversight
39%of AI bots pulled back or reworked in 2024 due to errors without HITL

DocPipeline handles receipts, purchase orders, utility bills, insurance documents, and expense reports — across customers with different layouts, scan quality, and document conventions. No two vendors format an invoice the same way. No two customers send documents in the same condition.

A risk-based mandatory review approach would be too blunt — it would send everything to a human and defeat the purpose of automation. An active learning loop is the right long-term goal but requires labeled data infrastructure not yet justified at this stage. Exception-based routing assumes the exception categories are known in advance — which they are not, because document variety is one of the core problems.

That left confidence-threshold routing as the primary strategy — but with one critical addition that most implementations skip.

What DocPipeline does — confidence routing with field-level evidence
Field-level confidence, not document-level

Most systems flag an entire document for review if confidence is low. DocPipeline flags individual fields. A 20-field invoice where 18 fields extracted cleanly and 2 are uncertain does not block the entire document — only those 2 fields go to review. The reviewer sees exactly which fields need attention and why.

Page image as evidence

When a field is flagged, the reviewer sees the original page image alongside the extracted value — not just the value in isolation. They can see what the model saw. This turns review from guesswork into confirmation, and reduces review time significantly.

Corrections feed back into the job record

Every reviewer correction is logged against the field, the page, and the document type. This is not yet a full active learning loop — but it is the data layer that makes one possible. The audit trail exists. The corrections are captured. When the volume justifies retraining, the labeled data is already there.

Schema-driven thresholds per document type

A confidence floor for a purchase order total is not the same as a confidence floor for a utility bill reference number. Each document type carries its own per-field threshold. A field that is high-risk in one document type can be auto-accepted in another where the consequences of an error are lower.

Teams evaluating document AI often frame the HITL question as: “How often does a human need to get involved?” That is the wrong question. It optimises for minimising human involvement rather than maximising the value of human involvement when it does happen.

Questions that reveal HITL maturity
When a field is flagged for review, does the reviewer see the original document page or just the extracted value?
Are thresholds set at the document level or the field level?
Are reviewer corrections logged and attributed? Is there an audit trail?
Do corrections feed back into anything — or are they one-off fixes that disappear?
Can you tune the confidence threshold per document type, or is it a single global setting?

The reframe that changed how I built this: HITL is not about how often humans intervene. It is about how useful that intervention is when it happens — and whether the system gets smarter because of it. A document AI system that routes 5% of fields to review and captures every correction is more valuable than one that routes 1% and discards them.

“OCR gets the credit. Schema validation, queue design, and human review do the work. The gap between a document AI demo and a document AI product lives in those three layers.”

NK
Naveen Karakavalasa
Engineering Leader · nkspace.dev