How to Extract Data from a PDF Invoice (Without Building an OCR Pipeline)

Blog → How to Extract Data from a PDF Invoice (Without Building an OCR Pipeline)

You have a PDF invoice. You need the data inside it — vendor name, invoice number, total, line items — in a format your code can use. Here's the fastest path from PDF to structured data.

API call

~3s

processing time

regex needed

Free

to start

The Problem

The Two Approaches (and Why One Is Better)

Approach 1 — Build it yourself: Use pdfplumber, PyMuPDF, or Camelot to extract raw text, then write regex patterns to find the invoice number, dates, and totals. Handle the edge cases when the text extraction fails on scanned PDFs. Write different patterns for different vendor formats. Maintain all of it when vendors update their invoice templates.

This takes 1–3 weeks for a working version, and it never fully works — there's always another vendor format that breaks it.

Approach 2 — Use an API: Send the PDF to an endpoint. Receive structured JSON. Done.

This guide covers Approach 2.

Build it yourself — weeks of work

import pdfplumber
import re

# regex for every vendor format
# breaks when template changes
# no scanned PDF support
# 300+ lines of brittle code

→API call

Use an API — one afternoon

response = requests.post(
  "https://docuparseapi.com/api/v1/extract",
  files={"file": f}
)
data = response.json()
# merchant, total, tax, date
# line_items — all ready

The build-vs-buy decision for invoice parsing.

Quick Start

The Request

bash · 3 lines

curl -X POST https://docuparseapi.com/api/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@invoice.pdf"

That's the complete integration. The API returns this:

response — invoice extracted✓ success · ~3s

merchant"Riverside Consulting LLC"

invoice_id"INV-2026-0091"

date"2026-05-01"

due_date"2026-05-31"

total"3850.00"

tax"350.00"

currency"USD"

line_items[{ description, quantity, unit_price, total }]

✓ Every field named and typed. No parsing, no regex, no bounding boxes.

json · 23 lines

{
  "success": true,
  "document_type": "invoice",
  "merchant": "Riverside Consulting LLC",
  "invoice_id": "INV-2026-0091",
  "date": "2026-05-01",
  "due_date": "2026-05-31",
  "currency": "USD",
  "subtotal": "3500.00",
  "tax": "350.00",
  "tax_rate": "10%",
  "total": "3850.00",
  "payment_method": null,
  "line_items": [
    {
      "description": "Strategy Consulting — April",
      "quantity": 35,
      "unit_price": "100.00",
      "total": "3500.00"
    }
  ],
  "processing_time_ms": 2980
}

Python

Python Implementation

Get your API key at docuparseapi.com/signup — free, no credit card.

python · 47 lines

import os
import requests

def extract_invoice(pdf_path: str) -> dict:
    """
    Extract structured data from a PDF invoice.
    Returns a dict with: merchant, invoice_id, date, due_date,
    currency, subtotal, tax, total, line_items
    """
    with open(pdf_path, "rb") as f:
        response = requests.post(
            "https://docuparseapi.com/api/v1/extract",
            headers={"Authorization": f"Bearer {os.environ['DOCUPARSE_API_KEY']}"},
            files={"file": (os.path.basename(pdf_path), f)},
            timeout=30,
        )
    
    response.raise_for_status()
    data = response.json()
    
    if not data.get("success"):
        code = data.get("error", {}).get("code", "UNKNOWN")
        msg = data.get("error", {}).get("message", "")
        raise RuntimeError(f"Extraction failed [{code}]: {msg}")
    
    return data


# Extract an invoice
invoice = extract_invoice("invoice.pdf")

# Use the data
print(f"From:     {invoice['merchant']}")
print(f"Invoice:  {invoice['invoice_id']}")
print(f"Date:     {invoice['date']}")
print(f"Due:      {invoice['due_date']}")
print(f"Subtotal: {invoice['currency']} {invoice['subtotal']}")
print(f"Tax:      {invoice['currency']} {invoice['tax']}")
print(f"Total:    {invoice['currency']} {invoice['total']}")
print()
print("Line items:")
for item in invoice.get("line_items", []):
    qty = item.get("quantity", 1)
    desc = item.get("description", "")
    price = item.get("unit_price", item.get("amount", "?"))
    total = item.get("total", "?")
    print(f"  {qty}× {desc} @ {price} = {total}")

Output:

text · 10 lines

From:     Riverside Consulting LLC
Invoice:  INV-2026-0091
Date:     2026-05-01
Due:      2026-05-31
Subtotal: USD 3500.00
Tax:      USD 350.00
Total:    USD 3850.00

Line items:
  35× Strategy Consulting — April @ 100.00 = 3500.00

This code is ready to run.

Get your free API key — 20 documents/month — free forever, no credit card, no expiry.

Get Free API Key →Try Without Signing Up

Next Steps

Working With the Extracted Data

Map to a database model

python · 32 lines

import sqlite3
import json

def store_invoice(conn: sqlite3.Connection, data: dict):
    conn.execute("""
        CREATE TABLE IF NOT EXISTS invoices (
            document_id TEXT PRIMARY KEY,
            vendor TEXT,
            invoice_number TEXT,
            invoice_date TEXT,
            due_date TEXT,
            currency TEXT,
            total REAL,
            tax REAL,
            line_items TEXT
        )
    """)
    
    conn.execute("""
        INSERT OR IGNORE INTO invoices VALUES (?,?,?,?,?,?,?,?,?)
    """, (
        data["document_id"],
        data.get("merchant"),
        data.get("invoice_id"),
        data.get("date"),
        data.get("due_date"),
        data.get("currency"),
        float(data.get("total") or 0),
        float(data.get("tax") or 0),
        json.dumps(data.get("line_items", []))
    ))
    conn.commit()

Push to QuickBooks or Xero

The extracted fields map directly to the bill/invoice fields in accounting APIs:

python · 18 lines

# QuickBooks bill creation (using quickbooks-online SDK)
def create_quickbooks_bill(invoice_data: dict, vendor_id: str):
    from quickbooks.objects.bill import Bill
    from quickbooks.objects.billline import BillLine
    
    bill = Bill()
    bill.VendorRef = {"value": vendor_id}
    bill.TxnDate = invoice_data["date"]
    bill.DueDate = invoice_data.get("due_date")
    bill.TotalAmt = float(invoice_data["total"])
    
    for item in invoice_data.get("line_items", []):
        line = BillLine()
        line.Amount = float(item.get("total", item.get("amount", 0)))
        line.Description = item.get("description")
        bill.Line.append(line)
    
    return bill.save(qb=client)

Send to a webhook or downstream service

python · 17 lines

import httpx

async def forward_invoice(invoice_data: dict, webhook_url: str):
    """Forward extracted invoice data to a webhook endpoint."""
    async with httpx.AsyncClient() as client:
        response = await client.post(webhook_url, json={
            "event": "invoice.extracted",
            "invoice": {
                "vendor": invoice_data.get("merchant"),
                "number": invoice_data.get("invoice_id"),
                "date": invoice_data.get("date"),
                "amount": invoice_data.get("total"),
                "currency": invoice_data.get("currency"),
                "items": invoice_data.get("line_items", []),
            }
        })
    return response.status_code == 200

Handling Scanned PDFs

If the invoice was created by scanning a paper document, the PDF contains an image, not machine-readable text. Regular PDF parsers return nothing from these files. DocuParseAPI detects scanned PDFs automatically and applies OCR — your request is identical:

python · 2 lines

# Works for both digital PDFs and scanned PDFs
invoice = extract_invoice("scanned_supplier_invoice.pdf")

Common Issues and Fixes

Issue: EXTRACTION_FAILED error The document may be too degraded to extract cleanly (heavy compression, very low scan resolution, or a non-invoice document). Try with a higher quality scan, or submit a digital PDF if available.

Issue: Missing fields return null Some invoices don't have due dates; some don't have line items. Check for None before using a field:

python · 3 lines

due_date = invoice.get("due_date")  # May be None
if due_date:
    # use it

Issue: Total is a string, not a number The API returns monetary values as strings to preserve decimal precision. Convert before arithmetic:

python · 1 line

total = float(invoice["total"])

Issue: Line items list is empty Not all invoices have extractable line items — some use image-embedded tables or unusual formats. The total field is always extracted when possible, even if line items are missing.

Pricing

Free tier: 20 PDF invoices/month, no credit card required
Starter: $14.99/month for 3,000 documents
Pro: $22.99/month for 5,000 documents

View full pricing