I'm building an invoice-processing pipeline and currently handle PDFs with tabula-py and XLSX files with openpyxl.
import tabula
from openpyxl import load_workbook
# PDF extraction
tables = tabula.read_pdf("invoice.pdf", pages="all")
# XLSX extraction
wb = load_workbook("invoice.xlsx")
sheet = wb.active
The problem is HTML invoices. Different vendors use completely different HTML structures, making it difficult to create a generic parser. Some use tables, others use nested divs, and field names vary significantly.
My goal is to normalize all invoice formats into a common CSV schema:
{
"invoice_number": "",
"date": "",
"vendor": "",
"amount": ""
}
Has anyone implemented a reliable approach for handling HTML invoices at scale? Would you recommend rule-based extraction, template matching, machine learning, or another strategy?