PDF to JSON API — Convert Business Documents to Structured Data

PDF is the universal format for business documents — invoices, receipts, purchase orders, contracts, statements. The problem is that PDFs are designed to be read by humans, not processed by code. A PDF isn't structured data; it's a visual layout. To use the information inside it, you need to extract it.

This guide explains how PDF-to-JSON APIs work, when to use them, and how to extract structured data from business PDFs with a single API call.

How It Works

What a PDF to JSON API Does

Real document → Real output

PDF input — any format, any vendor

JSON output — named fields, ready to use

json response⚡ ~3s

merchant"Grove Market"

date"2026-05-14"

total"37.59"

tax"2.86"

currency"USD"

receipt_id"GVR-20260514-0391"

payment_method"Visa **** 3812"

line_items[{ ... }]

✓ No parsing. No regex. Access data['merchant'] directly.

Any invoice or receipt format → structured JSON in ~3 seconds. No templates. No vendor setup.

PDF — designed for humans

Invoice
Riverside Consulting LLC
INV-2026-0091  May 1, 2026

Strategy Consulting — April
35hrs × $100.00 = $3,500.00

Subtotal: $3,500.00
Tax (10%): $350.00
Total: $3,850.00

→API call

Structured JSON — ready for code

{
  "merchant": "Riverside Consulting LLC",
  "invoice_id": "INV-2026-0091",
  "date": "2026-05-01",
  "total": "3850.00",
  "tax": "350.00",
  "currency": "USD"
}

PDFs are visual layouts. One API call turns them into structured data.

A PDF-to-JSON API accepts a PDF file and returns its data as a structured JSON object with named fields. Instead of:

text · 10 lines

INVOICE
Acme Corp — Invoice #INV-2026-0042
Date: May 10, 2026
Due: June 10, 2026

Cloud Server - Monthly    3    $400.00    $1,200.00
                                          
Subtotal: $1,200.00
Tax (10%): $120.00
Total Due: $1,320.00

You get:

json · 21 lines

{
  "success": true,
  "document_type": "invoice",
  "merchant": "Acme Corp",
  "invoice_id": "INV-2026-0042",
  "date": "2026-05-10",
  "due_date": "2026-06-10",
  "subtotal": "1200.00",
  "tax": "120.00",
  "tax_rate": "10%",
  "total": "1320.00",
  "currency": "USD",
  "line_items": [
    {
      "description": "Cloud Server - Monthly",
      "quantity": 3,
      "unit_price": "400.00",
      "total": "1200.00"
    }
  ]
}

The JSON is directly usable in your application — no text parsing, no regex, no interpretation required.

Python · Node.js

The One-Request Pattern

DocuParseAPI accepts PDFs (and JPG, PNG, CSV files) via a single multipart POST request:

bash · 3 lines

curl -X POST https://docuparseapi.com/api/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@invoice.pdf"

Try it on your own PDF — no signup needed

Upload any invoice or receipt. See the structured JSON back in ~3 seconds.

Open Live Demo →

What you get back

merchantinvoice_iddatedue_datetotaltaxcurrencyline_items[]

Free tier: 20 documents/month — free forever · No credit card · No account needed for the demo

That's the complete integration. The API handles:

PDF text extraction
Scanned PDF OCR
Field identification and normalization
Currency and date normalization
Line item parsing

You don't configure any of this. You send the file; you receive the structured JSON.

Python · Node.js

import os
import requests

def pdf_to_json(pdf_path: str) -> dict:
    with open(pdf_path, "rb") as f:
        response = requests.post(
            "https://docuparseapi.com/api/v1/extract",
            headers={"Authorization": f"Bearer {os.environ['DOCUPARSE_API_KEY']}"},
            files={"file": f},
        )
    data = response.json()
    if not data["success"]:
        raise RuntimeError(data["error"]["message"])
    return data

result = pdf_to_json("invoice.pdf")
print(f"Invoice {result['invoice_id']}: {result['total']} {result['currency']}")

The code above is ready to run.

Get your API key in 60 seconds. 20 documents/month — free forever, no credit card.

Get Free API Key →Try Without Signing Up

Use Cases

📨

Accounts Payable

Suppliers email PDF invoices. API extracts the data. ERP gets updated automatically.

QuickBooks integration →

🧾

Expense Management

Employees submit receipts. App pre-fills merchant, amount, date, and line items.

Notion integration →

📊

Bookkeeping

Accountants save 10–15 minutes per invoice. Data goes straight to Xero or QuickBooks.

Xero integration →

⚙️

No-Code Automation

Connect to n8n, Make, or Zapier. Email arrives — data appears in your spreadsheet.

n8n workflow →

Accounts payable automation: Suppliers email PDF invoices. Your system downloads the attachments, calls the API, and pushes extracted fields into your ERP or accounting software — no human data entry.

Expense management: Employees submit PDF receipts. Your expense app calls the API and pre-fills the expense form with merchant, amount, date, and line items.

Bookkeeping: An accountant handling clients' supplier invoices calls the API for each PDF and writes the extracted data to QuickBooks or Xero — saving 10–15 minutes per invoice.

Audit and compliance: Legal and finance teams need to process large batches of PDF invoices. Batch-calling the API and storing the JSON creates a searchable, queryable record of every invoice — no manual transcription.

Line item analytics: Once invoices are structured data, you can analyze spending by vendor, category, or line item. None of that analysis is possible while the data is locked in PDFs.

OCR Support

Handling Scanned PDFs

Not all PDFs contain machine-readable text. Scanned PDFs — created by scanning a paper document — are images embedded in a PDF wrapper. A regular PDF text extraction library returns nothing from these.

DocuParseAPI handles scanned PDFs automatically. The extraction pipeline detects whether a PDF is text-based or image-based and applies OCR when needed. You don't change anything in your request:

bash · 4 lines

# Same request — works for both digital PDFs and scanned PDFs
curl -X POST https://docuparseapi.com/api/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@scanned_invoice.pdf"

The response format is identical regardless of the source document type.

Processing a Folder of PDFs

For batch processing — converting an entire inbox of PDF invoices to JSON at once:

python · 43 lines

import os
import json
import requests
from pathlib import Path

def batch_pdf_to_json(folder: str, output_file: str = "results.json"):
    api_key = os.environ["DOCUPARSE_API_KEY"]
    pdfs = list(Path(folder).glob("*.pdf"))
    results = []
    
    print(f"Processing {len(pdfs)} PDFs...")
    
    for i, pdf_path in enumerate(pdfs, 1):
        print(f"[{i}/{len(pdfs)}] {pdf_path.name}", end=" ")
        
        with open(pdf_path, "rb") as f:
            response = requests.post(
                "https://docuparseapi.com/api/v1/extract",
                headers={"Authorization": f"Bearer {api_key}"},
                files={"file": f},
                timeout=30,
            )
        
        data = response.json()
        results.append({
            "file": pdf_path.name,
            "data": data,
            "success": data.get("success", False)
        })
        
        if data.get("success"):
            print(f"✓ {data.get('merchant', 'Unknown')} — {data.get('total', '?')} {data.get('currency', '')}")
        else:
            print(f"✗ {data.get('error', {}).get('code', 'UNKNOWN')}")
    
    with open(output_file, "w") as f:
        json.dump(results, f, indent=2)
    
    successful = sum(1 for r in results if r["success"])
    print(f"\nDone: {successful}/{len(results)} extracted successfully → {output_file}")
    return results

batch_pdf_to_json("./supplier_invoices/")

What Fields Are Extracted from Business PDFs

From invoices:

merchant — vendor/supplier name
invoice_id — invoice number
date — invoice date (ISO 8601)
due_date — payment due date
currency — ISO 4217 code
subtotal, tax, tax_rate, total — financial fields
line_items — array of items with description, quantity, unit_price, total

From receipts:

merchant — store name
receipt_id — receipt number
date — transaction date
payment_method — card type, cash, etc.
currency, subtotal, tax, total
line_items — individual purchases

The document_type field in the response tells you whether the API classified the input as a receipt or an invoice.

Limitations to Know

File size: Maximum 10MB per file. Most business PDFs are well under this limit.

Supported formats: PDF, JPG, PNG, CSV. If you have files in other formats (TIFF, BMP, DOCX), convert them to PDF first.

Multi-document PDFs: A PDF containing multiple invoices (a batch scan) is processed as one document. The API will attempt to extract the first invoice's fields. If you need to process each invoice separately, split the PDF before calling the API.

Handwritten documents: The extraction pipeline is trained on typed/printed documents. Fully handwritten invoices may not extract cleanly.

Pricing

Free: 20 PDFs/month, no credit card
Starter: $14.99/month — 3,000 documents
Pro: $22.99/month — 5,000 documents

At $14.99/month for 3,000 documents, the cost per document is $0.005 — a fraction of the time cost of manual data entry.

See full pricing →