PDF to JSON API: Convert Business Documents to Structured Data
PDF is the universal format for business documents — invoices, receipts, purchase orders, contracts, statements. The problem is that PDFs are designed to be read by humans, not processed by code. A PDF isn't structured data; it's a visual layout. To use the information inside it, you need to extract it.
This guide explains how PDF-to-JSON APIs work, when to use them, and how to extract structured data from business PDFs with a single API call.
What a PDF to JSON API Does
A PDF-to-JSON API accepts a PDF file and returns its data as a structured JSON object with named fields. Instead of:
INVOICE
Acme Corp — Invoice #INV-2026-0042
Date: May 10, 2026
Due: June 10, 2026
Cloud Server - Monthly 3 $400.00 $1,200.00
Subtotal: $1,200.00
Tax (10%): $120.00
Total Due: $1,320.00
You get:
{
"success": true,
"document_type": "invoice",
"merchant": "Acme Corp",
"invoice_id": "INV-2026-0042",
"date": "2026-05-10",
"due_date": "2026-06-10",
"subtotal": "1200.00",
"tax": "120.00",
"tax_rate": "10%",
"total": "1320.00",
"currency": "USD",
"line_items": [
{
"description": "Cloud Server - Monthly",
"quantity": 3,
"unit_price": "400.00",
"total": "1200.00"
}
]
}
The JSON is directly usable in your application — no text parsing, no regex, no interpretation required.
The One-Request Pattern
DocuParseAPI accepts PDFs (and JPG, PNG, CSV files) via a single multipart POST request:
curl -X POST https://docuparseapi.com/api/v1/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@invoice.pdf"
That's the complete integration. The API handles:
- PDF text extraction
- Scanned PDF OCR
- Field identification and normalization
- Currency and date normalization
- Line item parsing
You don't configure any of this. You send the file; you receive the structured JSON.
Python: PDF to JSON in 10 Lines
import os
import requests
def pdf_to_json(pdf_path: str) -> dict:
with open(pdf_path, "rb") as f:
response = requests.post(
"https://docuparseapi.com/api/v1/extract",
headers={"Authorization": f"Bearer {os.environ['DOCUPARSE_API_KEY']}"},
files={"file": f},
)
data = response.json()
if not data["success"]:
raise RuntimeError(data["error"]["message"])
return data
result = pdf_to_json("invoice.pdf")
print(f"Invoice {result['invoice_id']}: {result['total']} {result['currency']}")
JavaScript/Node.js: PDF to JSON
const fs = require("fs");
const fetch = require("node-fetch");
const FormData = require("form-data");
async function pdfToJson(pdfPath) {
const form = new FormData();
form.append("file", fs.createReadStream(pdfPath));
const response = await fetch("https://docuparseapi.com/api/v1/extract", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.DOCUPARSE_API_KEY}`,
...form.getHeaders(),
},
body: form,
});
const data = await response.json();
if (!data.success) throw new Error(data.error.message);
return data;
}
const result = await pdfToJson("invoice.pdf");
console.log(`${result.merchant}: $${result.total}`);
Use Cases
Accounts payable automation: Suppliers email PDF invoices. Your system downloads the attachments, calls the API, and pushes extracted fields into your ERP or accounting software — no human data entry.
Expense management: Employees submit PDF receipts. Your expense app calls the API and pre-fills the expense form with merchant, amount, date, and line items.
Bookkeeping: An accountant handling clients' supplier invoices calls the API for each PDF and writes the extracted data to QuickBooks or Xero — saving 10–15 minutes per invoice.
Audit and compliance: Legal and finance teams need to process large batches of PDF invoices. Batch-calling the API and storing the JSON creates a searchable, queryable record of every invoice — no manual transcription.
Line item analytics: Once invoices are structured data, you can analyze spending by vendor, category, or line item. None of that analysis is possible while the data is locked in PDFs.
Handling Scanned PDFs
Not all PDFs contain machine-readable text. Scanned PDFs — created by scanning a paper document — are images embedded in a PDF wrapper. A regular PDF text extraction library returns nothing from these.
DocuParseAPI handles scanned PDFs automatically. The extraction pipeline detects whether a PDF is text-based or image-based and applies OCR when needed. You don't change anything in your request:
# Same request — works for both digital PDFs and scanned PDFs
curl -X POST https://docuparseapi.com/api/v1/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@scanned_invoice.pdf"
The response format is identical regardless of the source document type.
Processing a Folder of PDFs
For batch processing — converting an entire inbox of PDF invoices to JSON at once:
import os
import json
import requests
from pathlib import Path
def batch_pdf_to_json(folder: str, output_file: str = "results.json"):
api_key = os.environ["DOCUPARSE_API_KEY"]
pdfs = list(Path(folder).glob("*.pdf"))
results = []
print(f"Processing {len(pdfs)} PDFs...")
for i, pdf_path in enumerate(pdfs, 1):
print(f"[{i}/{len(pdfs)}] {pdf_path.name}", end=" ")
with open(pdf_path, "rb") as f:
response = requests.post(
"https://docuparseapi.com/api/v1/extract",
headers={"Authorization": f"Bearer {api_key}"},
files={"file": f},
timeout=30,
)
data = response.json()
results.append({
"file": pdf_path.name,
"data": data,
"success": data.get("success", False)
})
if data.get("success"):
print(f"✓ {data.get('merchant', 'Unknown')} — {data.get('total', '?')} {data.get('currency', '')}")
else:
print(f"✗ {data.get('error', {}).get('code', 'UNKNOWN')}")
with open(output_file, "w") as f:
json.dump(results, f, indent=2)
successful = sum(1 for r in results if r["success"])
print(f"\nDone: {successful}/{len(results)} extracted successfully → {output_file}")
return results
batch_pdf_to_json("./supplier_invoices/")
What Fields Are Extracted from Business PDFs
From invoices:
merchant— vendor/supplier nameinvoice_id— invoice numberdate— invoice date (ISO 8601)due_date— payment due datecurrency— ISO 4217 codesubtotal,tax,tax_rate,total— financial fieldsline_items— array of items with description, quantity, unit_price, total
From receipts:
merchant— store namereceipt_id— receipt numberdate— transaction datepayment_method— card type, cash, etc.currency,subtotal,tax,totalline_items— individual purchases
The document_type field in the response tells you whether the API classified the input as a receipt or an invoice.
Limitations to Know
File size: Maximum 10MB per file. Most business PDFs are well under this limit.
Supported formats: PDF, JPG, PNG, CSV. If you have files in other formats (TIFF, BMP, DOCX), convert them to PDF first.
Multi-document PDFs: A PDF containing multiple invoices (a batch scan) is processed as one document. The API will attempt to extract the first invoice's fields. If you need to process each invoice separately, split the PDF before calling the API.
Handwritten documents: The extraction pipeline is trained on typed/printed documents. Fully handwritten invoices may not extract cleanly.
Pricing
- Free: 20 PDFs/month, no credit card
- Starter: $14.99/month — 3,000 documents
- Pro: $22.99/month — 5,000 documents
At $14.99/month for 3,000 documents, the cost per document is $0.005 — a fraction of the time cost of manual data entry.