← Projects
Research2026

DocPipeline

DocPipeline is a cloud-ready document extraction platform that lets users upload PDFs or images, run OCR and field extraction, and receive structured outputs such as JSON, CSV, or text. It is designed as a modern full-stack pipeline with a React/Next.js frontend, Python backend services, pluggable OCR, async job processing, storage, authentication, and usage-based billing.

Document AIOCRResearch Prototype
Try it live →

How the automation works

01

Upload or Drop

Drag a PDF or image into the secure upload area, paste a URL, or route files automatically from a watched folder or API endpoint.

02

OCR + Extraction

Each page is OCR-processed, then AI pulls structured fields against your document schema. Results are validated before leaving the engine.

03

Output Delivery

Structured data is pushed directly to Google Sheets, Slack, Teams, a webhook endpoint, or delivered as JSON / CSV — your workflow, unchanged.

Supported document types

ReceiptsUtility BillsPurchase OrdersShipping / Packing SlipsQuote / EstimateService InvoiceProduct CatalogExpense Report

Integrations

Input channels

Direct Upload

Drag and drop a PDF or image into the app — processing starts immediately.

Webhook / API

POST a file to the ingest endpoint from any system or automation pipeline.

OneDrive Watch

Connect a OneDrive folder and any file dropped there is picked up automatically.

Google Drive Watch

Monitor a Drive folder — new uploads trigger extraction without any manual step.

DocPipeline

Output channels

Google Sheets

Extracted fields land as new rows in any spreadsheet automatically.

Slack

Post a structured summary to any channel the moment extraction completes.

Microsoft Teams

Deliver results as a Teams message with key fields highlighted.

Webhook

POST structured JSON to any endpoint — connect any downstream system.

Email

Send extracted data as an email body or CSV attachment to any address.

JSON / CSV / TXT

Download clean output files directly from the results view.

How it works

Frontend

Built with Next.js 14 + TypeScript, the frontend handles document upload, document type selection, job status polling, result viewing, history, authentication, and billing. It acts as the user-facing control plane for the platform.

API / Backend

The backend is implemented with Python on Azure Functions, exposing endpoints for upload, job creation, status tracking, results retrieval, downloads, and billing. This keeps the platform serverless-friendly and easy to deploy in Azure while still supporting local development.

Document processing

When a file is uploaded, the system validates it, checks credits, detects protected PDFs, splits multi-page PDFs into page images, stores assets, and queues each page for processing. OCR is performed per page, then extraction logic converts raw text into structured document fields.

OCR and extraction

The OCR layer is pluggable. The system can switch between OCR providers without changing the core business logic. Extracted text is passed through typed field extraction logic so each document type returns a predictable schema.

Storage and data

The platform stores original files, page images, and output artifacts in Blob Storage, while job state, results, billing, and transactions are stored in PostgreSQL in cloud deployments or SQLite in local mode.

Auth and billing

Authentication uses Google OAuth via NextAuth, with backend token validation for secure access. Billing is implemented with a credit-based Stripe workflow, including free starter credits, credit balance enforcement, and webhook-based payment handling.

Architecture

                                      ┌──────────────────────────────┐
                                      │         Input Channels       │
                                      └──────────────┬───────────────┘
                                                     │
          ┌─────────────────────────────┬────────────┼───────────────┬─────────────────────────────┐
          │                             │            │               │                             │
          ▼                             ▼            ▼               ▼                             ▼
┌──────────────────┐         ┌────────────────┐  ┌──────────────┐  ┌────────────────┐   ┌─────────────────┐
│ User Upload UI   │         │ Webhook Input  │  │ Google Drive │  │ OneDrive Watch │   │ Future Sources  │
│ drag/drop PDF or │         │ external apps  │  │ folder watch │  │ folder watch   │   │ email/API/etc.  │
│ image            │         │ send docs in   │  │ ingest docs  │  │ ingest docs    │   │                 │
└────────┬─────────┘         └───────┬────────┘  └──────┬───────┘  └──────┬─────────┘   └────────┬────────┘
         │                           │                  │                  │                       │
         └───────────────────────────┴──────────────────┴──────────────────┴───────────────────────┘
                                                      │
                                                      ▼
                                      ┌──────────────────────────────┐
                                      │   Ingestion / API Layer      │
                                      │ Next.js + Python Functions   │
                                      └──────────────┬───────────────┘
                                                     │
                                                     ▼
                                      ┌──────────────────────────────┐
                                      │ Job Creation / Validation    │
                                      │ - auth / source validation   │
                                      │ - file type / size checks    │
                                      │ - credit checks              │
                                      │ - PDF password detection     │
                                      │ - document type selection or │
                                      │   auto-detect                │
                                      └──────────────┬───────────────┘
                                                     │
                                                     ▼
                                      ┌──────────────────────────────┐
                                      │   Preprocessing Layer        │
                                      │ - split PDFs into pages      │
                                      │ - convert pages to images    │
                                      │ - store original + page imgs │
                                      └──────────────┬───────────────┘
                                                     │
                                                     ▼
                                      ┌──────────────────────────────┐
                                      │ Queue / Orchestration Layer  │
                                      │ local queue or Service Bus   │
                                      └──────────────┬───────────────┘
                                                     │
                                                     ▼
                                      ┌──────────────────────────────┐
                                      │ OCR + Extraction Pipeline    │
                                      │ - OCR engine                 │
                                      │ - field extraction           │
                                      │ - normalization / validation │
                                      │ - schema shaping             │
                                      └──────────────┬───────────────┘
                                                     │
                          ┌──────────────────────────┼──────────────────────────┐
                          │                          │                          │
                          ▼                          ▼                          ▼
              ┌────────────────────┐     ┌────────────────────┐     ┌────────────────────┐
              │ Blob / File Store  │     │ Database           │     │ Export / Delivery  │
              │ originals, pages,  │     │ jobs, pages,       │     │ UI download,       │
              │ outputs            │     │ results, events,   │     │ webhook push,      │
              │                    │     │ credits            │     │ downstream systems │
              └────────────────────┘     └────────────────────┘     └────────────────────┘
                                                     │
                                                     ▼
                                      ┌──────────────────────────────┐
                                      │ Results & History UI         │
                                      │ review, export, audit trail  │
                                      └──────────────┬───────────────┘
                                                     │
                                                     ▼
                                      ┌──────────────────────────────┐
                                      │       Output Channels        │
                                      └──────────────┬───────────────┘
                                                     │
       ┌─────────────────────┬──────────────────────┬┴─────────────────────┬─────────────────────┐
       │                     │                      │                      │                     │
       ▼                     ▼                      ▼                      ▼                     ▼
┌─────────────────┐  ┌───────────────┐  ┌────────────────────┐  ┌───────────────┐  ┌─────────────────┐
│ User UI Review  │  │ Webhook Push  │  │ Google Sheets /    │  │ Slack / Teams │  │ Cloud Storage   │
│ - field-level   │  │ - automation  │  │ Excel              │  │ - alerts      │  │ - Google Drive  │
│   view          │  │               │  │ - structured       │  │ - notif-      │  │ - OneDrive      │
│ - page evidence │  │               │  │   export           │  │   ications    │  │ - Dropbox       │
└─────────────────┘  └───────────────┘  └────────────────────┘  └───────────────┘  └─────────────────┘

┌─────────────────────────────────────┐       ┌──────────────────────────────────────┐
│ Email Delivery                      │       │ Manual Export / Future Integrations  │
│ - send results as email body        │       │ - JSON / CSV / TXT download          │
│ - attachments                       │       │ - QuickBooks / ERP / APIs            │
└─────────────────────────────────────┘       └──────────────────────────────────────┘