BlogHow to Parse Invoices in Python Using an API (2026 Guide)

How to Parse Invoices in Python Using an API (2026 Guide)

2026-05-19 · 6 min read

10
lines of code
~3s
extraction time
0
regex needed
Free
20 documents/month — free forever

Every developer building a finance app eventually hits the same afternoon: you need to extract structured data from PDF invoices, and what looks like a two-hour task turns into two weeks of fighting PDF parsers, OCR libraries, and regex patterns that break the moment a vendor changes their template.

This guide shows you a faster path. You'll have working invoice extraction in Python in under 10 minutes, returning clean JSON with every financial field already named and normalized.

Raw text from PDF
ACME CORP
Invoice No: INV-0042
Date: 05/10/2026
Cloud Server x3  $1,200.00
Total: $3,600.00
API call
DocuParseAPI — named JSON fields
{
  "merchant": "Acme Corp",
  "invoice_id": "INV-0042",
  "date": "2026-05-10",
  "total": "3600.00",
  "line_items": [...]
}

Same document. Regex approach takes weeks. API approach takes one call.

Prerequisites

What You'll Need

  • Python 3.8+
  • requests library (pip install requests)
  • A DocuParseAPI key — get one free (20 documents/month, no credit card)
  • A PDF invoice to test with
Quick Start

The One-Call Pattern

DocuParseAPI works as a single POST request. You send a file; you receive structured JSON. There's no pipeline to configure, no template to define per vendor, no model to train.

python · 44 lines
import os
import requests

def parse_invoice(file_path: str) -> dict:
    """
    Parse an invoice PDF and return structured JSON data.
    
    Args:
        file_path: Path to the PDF, JPG, or PNG invoice file
        
    Returns:
        dict with fields: merchant, total, subtotal, tax, date, 
                          due_date, invoice_id, currency, line_items
    """
    api_key = os.environ["DOCUPARSE_API_KEY"]
    
    with open(file_path, "rb") as f:
        response = requests.post(
            "https://docuparseapi.com/api/v1/extract",
            headers={"Authorization": f"Bearer {api_key}"},
            files={"file": (os.path.basename(file_path), f)},
        )
    
    response.raise_for_status()
    data = response.json()
    
    if not data.get("success"):
        raise RuntimeError(
            f"Extraction failed: [{data['error']['code']}] {data['error']['message']}"
        )
    
    return data

# Usage
result = parse_invoice("invoice.pdf")

print(f"Vendor: {result['merchant']}")
print(f"Total:  {result['total']} {result['currency']}")
print(f"Tax:    {result['tax']}")
print(f"Date:   {result['date']}")
print(f"Due:    {result['due_date']}")

for item in result.get("line_items", []):
    print(f"  - {item['description']}: {item['quantity']} × {item['unit_price']}")

Set your API key as an environment variable before running:

bash · 2 lines
export DOCUPARSE_API_KEY="dex_your_key_here"
python parse_invoice.py
Don't have an invoice handy? Try it on a sample.
Upload any PDF invoice or receipt. See the exact JSON your code would receive. No signup required.
Open Live Demo →
what your code looks like after this tutorial
result = parse_invoice("invoice.pdf")
print(result['merchant']) # "Acme Supplies Ltd"
print(result['total']) # "1320.00"
Free tier: 20 documents/month — free forever · No credit card · No account needed for the demo
API Response

What the Response Looks Like

json response — invoice extracted✓ success · 2.98s
merchant"Acme Supplies Ltd"Vendor name
invoice_id"INV-2026-0042"Invoice number
date"2026-05-10"ISO 8601
due_date"2026-06-10"Payment due
total"1320.00"Final amount
tax"120.00"Tax charged
currency"USD"ISO 4217
line_items[{ description, qty, unit_price, total }]Item array
✓ Named fields. Normalized values. Access data['merchant'] directly — no parsing.
json · 24 lines
{
  "success": true,
  "document_id": "doc_clx7abc123",
  "document_type": "invoice",
  "merchant": "Acme Supplies Ltd",
  "date": "2026-05-10",
  "due_date": "2026-06-10",
  "currency": "USD",
  "subtotal": "1200.00",
  "tax": "120.00",
  "tax_rate": "10%",
  "total": "1320.00",
  "invoice_id": "INV-2026-0042",
  "payment_method": null,
  "line_items": [
    {
      "description": "Cloud Server - Monthly",
      "quantity": 3,
      "unit_price": "400.00",
      "total": "1200.00"
    }
  ],
  "processing_time_ms": 2980
}

Every field is already named, typed, and normalized. No bounding boxes. No confidence scores to interpret. No post-processing required.

Batch Processing

Processing Multiple Invoices in Batch

For processing a folder of invoices — a common accounts payable use case — here's a clean batch pattern:

python · 61 lines
import os
import json
import requests
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor, as_completed

API_KEY = os.environ["DOCUPARSE_API_KEY"]
BASE_URL = "https://docuparseapi.com/api/v1/extract"

def extract_single(file_path: Path) -> dict:
    """Extract one invoice. Returns result dict or error dict."""
    try:
        with open(file_path, "rb") as f:
            response = requests.post(
                BASE_URL,
                headers={"Authorization": f"Bearer {API_KEY}"},
                files={"file": (file_path.name, f)},
                timeout=30,
            )
        data = response.json()
        return {"file": file_path.name, "status": "ok", "data": data}
    except Exception as e:
        return {"file": file_path.name, "status": "error", "error": str(e)}


def batch_extract(folder_path: str, max_workers: int = 5) -> list[dict]:
    """
    Extract all invoices in a folder concurrently.
    
    Args:
        folder_path: Directory containing PDF/JPG/PNG invoices
        max_workers: Concurrent requests (keep ≤ 10 to stay within rate limits)
    """
    invoice_dir = Path(folder_path)
    files = list(invoice_dir.glob("*.pdf")) + list(invoice_dir.glob("*.jpg")) + list(invoice_dir.glob("*.png"))
    
    results = []
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(extract_single, f): f for f in files}
        for future in as_completed(futures):
            result = future.result()
            results.append(result)
            status = "✓" if result["status"] == "ok" else "✗"
            print(f"{status} {result['file']}")
    
    return results


# Run batch extraction
results = batch_extract("./invoices/")

# Summarize
successful = [r for r in results if r["status"] == "ok" and r["data"].get("success")]
failed = [r for r in results if r["status"] == "error" or not r["data"].get("success")]

print(f"\nExtracted: {len(successful)}/{len(results)}")
print(f"Failed:    {len(failed)}")

# Save results
with open("invoice_results.json", "w") as f:
    json.dump(results, f, indent=2)
Ready to run this on your invoices?
Get your API key in 60 seconds. 20 documents/month — free forever, no credit card, no expiry.
Error Handling

Error Handling: What Can Go Wrong

DocuParseAPI returns typed error codes so your error handling is explicit:

python · 40 lines
def parse_invoice_safe(file_path: str) -> dict | None:
    """Returns result dict, or None on unrecoverable error."""
    try:
        api_key = os.environ.get("DOCUPARSE_API_KEY")
        if not api_key:
            raise EnvironmentError("DOCUPARSE_API_KEY not set")
        
        with open(file_path, "rb") as f:
            response = requests.post(
                "https://docuparseapi.com/api/v1/extract",
                headers={"Authorization": f"Bearer {api_key}"},
                files={"file": (os.path.basename(file_path), f)},
                timeout=30,
            )
        
        data = response.json()
        
        if not data.get("success"):
            code = data.get("error", {}).get("code", "UNKNOWN")
            
            if code == "LIMIT_EXCEEDED":
                print("Monthly document limit reached. Upgrade at docuparseapi.com/pricing")
            elif code == "UNSUPPORTED_FILE_TYPE":
                print(f"File type not supported: {file_path}. Use PDF, JPG, or PNG.")
            elif code == "FILE_TOO_LARGE":
                print(f"File exceeds 10MB: {file_path}")
            elif code == "EXTRACTION_FAILED":
                print(f"Extraction failed for {file_path}. Try a cleaner scan.")
            else:
                print(f"API error [{code}]: {data['error']['message']}")
            return None
        
        return data
        
    except requests.exceptions.Timeout:
        print(f"Request timed out for {file_path}")
        return None
    except FileNotFoundError:
        print(f"File not found: {file_path}")
        return None
Database Storage

Storing Extracted Invoice Data

Once you have the structured JSON, storing it is straightforward. Here's a pattern for SQLite — easily adapted to PostgreSQL or any other database:

python · 54 lines
import sqlite3
import json
from datetime import datetime

def store_invoice(conn: sqlite3.Connection, result: dict) -> int:
    """Insert extracted invoice data into the database."""
    cursor = conn.cursor()
    
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS invoices (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            document_id TEXT UNIQUE,
            merchant TEXT,
            invoice_id TEXT,
            date TEXT,
            due_date TEXT,
            currency TEXT,
            subtotal REAL,
            tax REAL,
            total REAL,
            line_items TEXT,  -- JSON string
            created_at TEXT
        )
    """)
    
    cursor.execute("""
        INSERT OR IGNORE INTO invoices 
        (document_id, merchant, invoice_id, date, due_date, 
         currency, subtotal, tax, total, line_items, created_at)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, (
        result.get("document_id"),
        result.get("merchant"),
        result.get("invoice_id"),
        result.get("date"),
        result.get("due_date"),
        result.get("currency"),
        float(result.get("subtotal") or 0),
        float(result.get("tax") or 0),
        float(result.get("total") or 0),
        json.dumps(result.get("line_items", [])),
        datetime.utcnow().isoformat(),
    ))
    
    conn.commit()
    return cursor.lastrowid


# Usage
conn = sqlite3.connect("invoices.db")
result = parse_invoice("invoice.pdf")
if result:
    row_id = store_invoice(conn, result)
    print(f"Stored invoice #{row_id}: {result['merchant']} — {result['total']} {result['currency']}")
Async / FastAPI

Using an Async Client (httpx)

For async Python applications (FastAPI, aiohttp, etc.):

python · 41 lines
import os
import httpx

async def parse_invoice_async(file_path: str) -> dict:
    """Async version using httpx."""
    api_key = os.environ["DOCUPARSE_API_KEY"]
    
    async with httpx.AsyncClient(timeout=30) as client:
        with open(file_path, "rb") as f:
            response = await client.post(
                "https://docuparseapi.com/api/v1/extract",
                headers={"Authorization": f"Bearer {api_key}"},
                files={"file": (os.path.basename(file_path), f.read())},
            )
    
    response.raise_for_status()
    data = response.json()
    
    if not data.get("success"):
        raise RuntimeError(f"Extraction failed: {data['error']['code']}")
    
    return data


# In a FastAPI route
from fastapi import FastAPI, UploadFile
app = FastAPI()

@app.post("/invoices/parse")
async def parse_uploaded_invoice(file: UploadFile):
    import tempfile, shutil
    
    # Save upload to temp file
    with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
        shutil.copyfileobj(file.file, tmp)
        tmp_path = tmp.name
    
    result = await parse_invoice_async(tmp_path)
    os.unlink(tmp_path)
    
    return {"invoice": result}
Watch Out

Common Mistakes to Avoid

Never put your API key in source code. Always use environment variables or a secrets manager. The key prefix dex_ makes it easy for secret scanners to catch accidental commits.

Don't call the API from browser JavaScript. The API is designed for server-side use. If you need browser-side invoice upload, build a backend route that proxies the request (like the FastAPI example above).

Handle None fields gracefully. Not every invoice has a due date. Not every receipt has an invoice ID. Check for None before parsing or storing.

Use document_id for deduplication. Every successful extraction returns a unique document_id. Store it and check before re-processing to avoid counting the same document twice.

What's Next

Next Steps

Where Python developers typically send their extracted invoice data:
🐍pip install requests → done

Your Python environment is ready.
Your API key isn't.

20 documents/month — free forever. No credit card. No expiry.
The code in this guide works immediately after you get your key.

✓ Python 3.8+✓ requests or httpx✓ sync + async✓ any PDF format

More from the blog