go-hocr — hOCR 1.2 Parser for Tesseract
Parse Tesseract hOCR 1.2 output — YAML/HTML export, bounding boxes, OCR confidence.
go-hocr — hOCR 1.2 Parser for Tesseract
go-hocr is a Go library for parsing hOCR 1.2 files from Tesseract — with YAML/HTML export, bounding boxes, and per-word OCR confidence.
Repository: github.com/eSlider/go-hocr · last push 2026-06-15
Why hOCR
Tesseract outputs structured OCR as hOCR (HTML + microformat classes). Downstream pipelines need typed Go structs — not fragile regex over HTML. go-hocr provides a stable parser for document workflows.
Features
- Parse hOCR 1.2 pages, paragraphs, lines, and words
- Export to YAML or HTML for inspection and tooling
- Bounding box coordinates and confidence scores per token
- Foundation for content servers and PDF pipelines
Lineage
Evolved from Dreamteam client work (2022–2023) on HOCR content servers (~200+ commits on hocr and content-serve-hocr). Public library extracted for reuse in go-second-brain document ingest and future OCR stacks.
Related
- Dreamteam HOCR pipelines — client engagement
- mail-archive — archival search stack
Tech stack
Go · Tesseract hOCR · YAML · HTML
This post is licensed under CC BY 4.0 by the author.