How to Extract Data from a PDF Invoice (Without Building an OCR Pipeline)
You have a PDF invoice. You need the data inside it — vendor name, invoice number, total, line items — in a format your code can use. Here's the fastest path from PDF to structured data.
The Two Approaches (and Why One Is Better)
Approach 1 — Build it yourself: Use pdfplumber, PyMuPDF, or Camelot to extract raw text, then write regex patterns to find the invoice number, dates, and totals. Handle the edge cases when the text extraction fails on scanned PDFs. Write different patterns for different vendor formats. Maintain all of it when vendors update their invoice templates.
This takes 1–3 weeks for a working version, and it never fully works — there's always another vendor format that breaks it.
Approach 2 — Use an API: Send the PDF to an endpoint. Receive structured JSON. Done.
This guide covers Approach 2.
The Request
curl -X POST https://docuparseapi.com/api/v1/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@invoice.pdf"
That's the complete integration. The API returns this:
{
"success": true,
"document_type": "invoice",
"merchant": "Riverside Consulting LLC",
"invoice_id": "INV-2026-0091",
"date": "2026-05-01",
"due_date": "2026-05-31",
"currency": "USD",
"subtotal": "3500.00",
"tax": "350.00",
"tax_rate": "10%",
"total": "3850.00",
"payment_method": null,
"line_items": [
{
"description": "Strategy Consulting — April",
"quantity": 35,
"unit_price": "100.00",
"total": "3500.00"
}
],
"processing_time_ms": 1050
}
Python Implementation
Get your API key at docuparseapi.com/signup — free, no credit card.
import os
import requests
def extract_invoice(pdf_path: str) -> dict:
"""
Extract structured data from a PDF invoice.
Returns a dict with: merchant, invoice_id, date, due_date,
currency, subtotal, tax, total, line_items
"""
with open(pdf_path, "rb") as f:
response = requests.post(
"https://docuparseapi.com/api/v1/extract",
headers={"Authorization": f"Bearer {os.environ['DOCUPARSE_API_KEY']}"},
files={"file": (os.path.basename(pdf_path), f)},
timeout=30,
)
response.raise_for_status()
data = response.json()
if not data.get("success"):
code = data.get("error", {}).get("code", "UNKNOWN")
msg = data.get("error", {}).get("message", "")
raise RuntimeError(f"Extraction failed [{code}]: {msg}")
return data
# Extract an invoice
invoice = extract_invoice("invoice.pdf")
# Use the data
print(f"From: {invoice['merchant']}")
print(f"Invoice: {invoice['invoice_id']}")
print(f"Date: {invoice['date']}")
print(f"Due: {invoice['due_date']}")
print(f"Subtotal: {invoice['currency']} {invoice['subtotal']}")
print(f"Tax: {invoice['currency']} {invoice['tax']}")
print(f"Total: {invoice['currency']} {invoice['total']}")
print()
print("Line items:")
for item in invoice.get("line_items", []):
qty = item.get("quantity", 1)
desc = item.get("description", "")
price = item.get("unit_price", item.get("amount", "?"))
total = item.get("total", "?")
print(f" {qty}× {desc} @ {price} = {total}")
Output:
From: Riverside Consulting LLC
Invoice: INV-2026-0091
Date: 2026-05-01
Due: 2026-05-31
Subtotal: USD 3500.00
Tax: USD 350.00
Total: USD 3850.00
Line items:
35× Strategy Consulting — April @ 100.00 = 3500.00
Working With the Extracted Data
Map to a database model
import sqlite3
import json
def store_invoice(conn: sqlite3.Connection, data: dict):
conn.execute("""
CREATE TABLE IF NOT EXISTS invoices (
document_id TEXT PRIMARY KEY,
vendor TEXT,
invoice_number TEXT,
invoice_date TEXT,
due_date TEXT,
currency TEXT,
total REAL,
tax REAL,
line_items TEXT
)
""")
conn.execute("""
INSERT OR IGNORE INTO invoices VALUES (?,?,?,?,?,?,?,?,?)
""", (
data["document_id"],
data.get("merchant"),
data.get("invoice_id"),
data.get("date"),
data.get("due_date"),
data.get("currency"),
float(data.get("total") or 0),
float(data.get("tax") or 0),
json.dumps(data.get("line_items", []))
))
conn.commit()
Push to QuickBooks or Xero
The extracted fields map directly to the bill/invoice fields in accounting APIs:
# QuickBooks bill creation (using quickbooks-online SDK)
def create_quickbooks_bill(invoice_data: dict, vendor_id: str):
from quickbooks.objects.bill import Bill
from quickbooks.objects.billline import BillLine
bill = Bill()
bill.VendorRef = {"value": vendor_id}
bill.TxnDate = invoice_data["date"]
bill.DueDate = invoice_data.get("due_date")
bill.TotalAmt = float(invoice_data["total"])
for item in invoice_data.get("line_items", []):
line = BillLine()
line.Amount = float(item.get("total", item.get("amount", 0)))
line.Description = item.get("description")
bill.Line.append(line)
return bill.save(qb=client)
Send to a webhook or downstream service
import httpx
async def forward_invoice(invoice_data: dict, webhook_url: str):
"""Forward extracted invoice data to a webhook endpoint."""
async with httpx.AsyncClient() as client:
response = await client.post(webhook_url, json={
"event": "invoice.extracted",
"invoice": {
"vendor": invoice_data.get("merchant"),
"number": invoice_data.get("invoice_id"),
"date": invoice_data.get("date"),
"amount": invoice_data.get("total"),
"currency": invoice_data.get("currency"),
"items": invoice_data.get("line_items", []),
}
})
return response.status_code == 200
Handling Scanned PDFs
If the invoice was created by scanning a paper document, the PDF contains an image, not machine-readable text. Regular PDF parsers return nothing from these files. DocuParseAPI detects scanned PDFs automatically and applies OCR — your request is identical:
# Works for both digital PDFs and scanned PDFs
invoice = extract_invoice("scanned_supplier_invoice.pdf")
Common Issues and Fixes
Issue: EXTRACTION_FAILED error
The document may be too degraded to extract cleanly (heavy compression, very low scan resolution, or a non-invoice document). Try with a higher quality scan, or submit a digital PDF if available.
Issue: Missing fields return null
Some invoices don't have due dates; some don't have line items. Check for None before using a field:
due_date = invoice.get("due_date") # May be None
if due_date:
# use it
Issue: Total is a string, not a number The API returns monetary values as strings to preserve decimal precision. Convert before arithmetic:
total = float(invoice["total"])
Issue: Line items list is empty
Not all invoices have extractable line items — some use image-embedded tables or unusual formats. The total field is always extracted when possible, even if line items are missing.
Pricing
- Free tier: 20 PDF invoices/month, no credit card required
- Starter: $14.99/month for 3,000 documents
- Pro: $22.99/month for 5,000 documents