🔡 Synthetic data for OCR

How good is current OCR and how can synthetic data improve it?

Table of contents

There are many open source machine learning models for Optical Character Recognition (OCR) as well as related tasks around document parsing, but importantly their accuracy can vary with the domain the model was trained on, which determines what is 'in distribution'. End-to-end Vision-Language Models (VLMs) have not solved the problem despite pretraining on so-called 'web scale' datasets.

In this series, I survey the prior work in synthetic data for OCR and outline some ideas for generating structured documents, and specifically I'm intereted in table of contents pages, with particular emphasis on minimising manual annotation and copyright-compliant use of inputs.

Note: 'Synthetic data' here means creation of realistic documents (e.g. SynthTIGER, SynthDoG) rather than geometric transformations like cropping or warping. The goal is training data that's "in domain", similar to what a model will encounter in the real world.