How can I reliably convert invoices from multiple formats into a standardized CSV?
04:54 11 Jun 2026

I'm building an invoice-processing pipeline and currently handle PDFs with tabula-py and XLSX files with openpyxl.

import tabula
from openpyxl import load_workbook

# PDF extraction
tables = tabula.read_pdf("invoice.pdf", pages="all")

# XLSX extraction
wb = load_workbook("invoice.xlsx")
sheet = wb.active

The problem is HTML invoices. Different vendors use completely different HTML structures, making it difficult to create a generic parser. Some use tables, others use nested divs, and field names vary significantly.

My goal is to normalize all invoice formats into a common CSV schema:

{
    "invoice_number": "",
    "date": "",
    "vendor": "",
    "amount": ""
}

Has anyone implemented a reliable approach for handling HTML invoices at scale? Would you recommend rule-based extraction, template matching, machine learning, or another strategy?

python