Post

go-hocr — hOCR 1.2 Parser for Tesseract

Parse Tesseract hOCR 1.2 output — YAML/HTML export, bounding boxes, OCR confidence.

go-hocr — hOCR 1.2 Parser for Tesseract

go-hocr is a Go library for parsing hOCR 1.2 files from Tesseract — with YAML/HTML export, bounding boxes, and per-word OCR confidence.

Repository: github.com/eSlider/go-hocr · last push 2026-06-15

Why hOCR

Tesseract outputs structured OCR as hOCR (HTML + microformat classes). Downstream pipelines need typed Go structs — not fragile regex over HTML. go-hocr provides a stable parser for document workflows.

Features

  • Parse hOCR 1.2 pages, paragraphs, lines, and words
  • Export to YAML or HTML for inspection and tooling
  • Bounding box coordinates and confidence scores per token
  • Foundation for content servers and PDF pipelines

Lineage

Evolved from Dreamteam client work (2022–2023) on HOCR content servers (~200+ commits on hocr and content-serve-hocr). Public library extracted for reuse in go-second-brain document ingest and future OCR stacks.

Tech stack

Go · Tesseract hOCR · YAML · HTML

This post is licensed under CC BY 4.0 by the author.