Blog"How to Extract Data from a PDF Invoice (Without Building an OCR Pipeline)"

"How to Extract Data from a PDF Invoice (Without Building an OCR Pipeline)"

2026-05-22 · 5 min read

How to Extract Data from a PDF Invoice (Without Building an OCR Pipeline)

You have a PDF invoice. You need the data inside it — vendor name, invoice number, total, line items — in a format your code can use. Here's the fastest path from PDF to structured data.

The Two Approaches (and Why One Is Better)

Approach 1 — Build it yourself: Use pdfplumber, PyMuPDF, or Camelot to extract raw text, then write regex patterns to find the invoice number, dates, and totals. Handle the edge cases when the text extraction fails on scanned PDFs. Write different patterns for different vendor formats. Maintain all of it when vendors update their invoice templates.

This takes 1–3 weeks for a working version, and it never fully works — there's always another vendor format that breaks it.

Approach 2 — Use an API: Send the PDF to an endpoint. Receive structured JSON. Done.

This guide covers Approach 2.

The Request

curl -X POST https://docuparseapi.com/api/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@invoice.pdf"

That's the complete integration. The API returns this:

{
  "success": true,
  "document_type": "invoice",
  "merchant": "Riverside Consulting LLC",
  "invoice_id": "INV-2026-0091",
  "date": "2026-05-01",
  "due_date": "2026-05-31",
  "currency": "USD",
  "subtotal": "3500.00",
  "tax": "350.00",
  "tax_rate": "10%",
  "total": "3850.00",
  "payment_method": null,
  "line_items": [
    {
      "description": "Strategy Consulting — April",
      "quantity": 35,
      "unit_price": "100.00",
      "total": "3500.00"
    }
  ],
  "processing_time_ms": 1050
}

Python Implementation

Get your API key at docuparseapi.com/signup — free, no credit card.

import os
import requests

def extract_invoice(pdf_path: str) -> dict:
    """
    Extract structured data from a PDF invoice.
    Returns a dict with: merchant, invoice_id, date, due_date,
    currency, subtotal, tax, total, line_items
    """
    with open(pdf_path, "rb") as f:
        response = requests.post(
            "https://docuparseapi.com/api/v1/extract",
            headers={"Authorization": f"Bearer {os.environ['DOCUPARSE_API_KEY']}"},
            files={"file": (os.path.basename(pdf_path), f)},
            timeout=30,
        )
    
    response.raise_for_status()
    data = response.json()
    
    if not data.get("success"):
        code = data.get("error", {}).get("code", "UNKNOWN")
        msg = data.get("error", {}).get("message", "")
        raise RuntimeError(f"Extraction failed [{code}]: {msg}")
    
    return data


# Extract an invoice
invoice = extract_invoice("invoice.pdf")

# Use the data
print(f"From:     {invoice['merchant']}")
print(f"Invoice:  {invoice['invoice_id']}")
print(f"Date:     {invoice['date']}")
print(f"Due:      {invoice['due_date']}")
print(f"Subtotal: {invoice['currency']} {invoice['subtotal']}")
print(f"Tax:      {invoice['currency']} {invoice['tax']}")
print(f"Total:    {invoice['currency']} {invoice['total']}")
print()
print("Line items:")
for item in invoice.get("line_items", []):
    qty = item.get("quantity", 1)
    desc = item.get("description", "")
    price = item.get("unit_price", item.get("amount", "?"))
    total = item.get("total", "?")
    print(f"  {qty}× {desc} @ {price} = {total}")

Output:

From:     Riverside Consulting LLC
Invoice:  INV-2026-0091
Date:     2026-05-01
Due:      2026-05-31
Subtotal: USD 3500.00
Tax:      USD 350.00
Total:    USD 3850.00

Line items:
  35× Strategy Consulting — April @ 100.00 = 3500.00

Working With the Extracted Data

Map to a database model

import sqlite3
import json

def store_invoice(conn: sqlite3.Connection, data: dict):
    conn.execute("""
        CREATE TABLE IF NOT EXISTS invoices (
            document_id TEXT PRIMARY KEY,
            vendor TEXT,
            invoice_number TEXT,
            invoice_date TEXT,
            due_date TEXT,
            currency TEXT,
            total REAL,
            tax REAL,
            line_items TEXT
        )
    """)
    
    conn.execute("""
        INSERT OR IGNORE INTO invoices VALUES (?,?,?,?,?,?,?,?,?)
    """, (
        data["document_id"],
        data.get("merchant"),
        data.get("invoice_id"),
        data.get("date"),
        data.get("due_date"),
        data.get("currency"),
        float(data.get("total") or 0),
        float(data.get("tax") or 0),
        json.dumps(data.get("line_items", []))
    ))
    conn.commit()

Push to QuickBooks or Xero

The extracted fields map directly to the bill/invoice fields in accounting APIs:

# QuickBooks bill creation (using quickbooks-online SDK)
def create_quickbooks_bill(invoice_data: dict, vendor_id: str):
    from quickbooks.objects.bill import Bill
    from quickbooks.objects.billline import BillLine
    
    bill = Bill()
    bill.VendorRef = {"value": vendor_id}
    bill.TxnDate = invoice_data["date"]
    bill.DueDate = invoice_data.get("due_date")
    bill.TotalAmt = float(invoice_data["total"])
    
    for item in invoice_data.get("line_items", []):
        line = BillLine()
        line.Amount = float(item.get("total", item.get("amount", 0)))
        line.Description = item.get("description")
        bill.Line.append(line)
    
    return bill.save(qb=client)

Send to a webhook or downstream service

import httpx

async def forward_invoice(invoice_data: dict, webhook_url: str):
    """Forward extracted invoice data to a webhook endpoint."""
    async with httpx.AsyncClient() as client:
        response = await client.post(webhook_url, json={
            "event": "invoice.extracted",
            "invoice": {
                "vendor": invoice_data.get("merchant"),
                "number": invoice_data.get("invoice_id"),
                "date": invoice_data.get("date"),
                "amount": invoice_data.get("total"),
                "currency": invoice_data.get("currency"),
                "items": invoice_data.get("line_items", []),
            }
        })
    return response.status_code == 200

Handling Scanned PDFs

If the invoice was created by scanning a paper document, the PDF contains an image, not machine-readable text. Regular PDF parsers return nothing from these files. DocuParseAPI detects scanned PDFs automatically and applies OCR — your request is identical:

# Works for both digital PDFs and scanned PDFs
invoice = extract_invoice("scanned_supplier_invoice.pdf")

Common Issues and Fixes

Issue: EXTRACTION_FAILED error The document may be too degraded to extract cleanly (heavy compression, very low scan resolution, or a non-invoice document). Try with a higher quality scan, or submit a digital PDF if available.

Issue: Missing fields return null Some invoices don't have due dates; some don't have line items. Check for None before using a field:

due_date = invoice.get("due_date")  # May be None
if due_date:
    # use it

Issue: Total is a string, not a number The API returns monetary values as strings to preserve decimal precision. Convert before arithmetic:

total = float(invoice["total"])

Issue: Line items list is empty Not all invoices have extractable line items — some use image-embedded tables or unusual formats. The total field is always extracted when possible, even if line items are missing.

Pricing

  • Free tier: 20 PDF invoices/month, no credit card required
  • Starter: $14.99/month for 3,000 documents
  • Pro: $22.99/month for 5,000 documents

View full pricing

Next Steps

Ready to start parsing documents?

More from the blog