A primer on synthetic OCR data

Surveying the landscape of synthetic text generation for OCR training.

So I want to train a model to extract the table of contents from scanned books. These are nested hierarchies spanning multiple pages, and I want this to be explicitly captured not something I post-process. I don't have training data for this, so I went looking at what exists for synthetic document generation, and found a field that's not much discussed and perhaps even a little stuck in its ways.

The lineage
The generators
The degradation problem
The elephant in the room
Document layout datasets
Where this leaves me

The lineage

The OCR synthetic data story starts with two Oxford VGG datasets from 2014-2016 that everyone is still using a decade on:

MJSynth (Jaderberg et al., NIPS 2014 workshop) carries 9 million grayscale word images, 90k strong English vocab, in ~1400 fonts. It's scene text recognition focused and available from the VGG data page.

SynthText (Gupta, Vedaldi & Zisserman, CVPR 2016) comprises 800k images with 8 million word instances, composited into natural scenes with depth-aware placement. These are more colourful, and 'harder' than MJSynth.

The fact that these remain the default training mixture tells you either how good they were or how slowly the field has moved (or perhaps both).

The generators

SynthTIGER (NAVER/Clova, ICDAR 2021) analysed what makes synthetic text effective and synthesised their good bits. Its key contribution was handling the long-tail problem in character distributions. Rare characters get undersampled in naive generation, so models struggle with uncommon inputs. SynthTIGER deliberately balances this and subsequently outperformed MJ+ST combined on STR benchmarks.

SynthDoG (NAVER/Clova, ECCV 2022) came packaged with Donut and generates full document pages rather than word crops. It composites Wikipedia text, ImageNet backgrounds, paper textures, and lays them out by "randomly stacking grids". It's multilingual (EN/CN/JP/KR), and ByteDance notably used it for Seed1.5-VL pretraining in 2025.

TRDG (TextRecognitionDataGenerator) is a Swiss army knife for skewing, blur, distortion, backgrounds, and handwriting. It's less sophisticated than SynthTIGER but good for quick experiments.

Others exist (Text Renderer for PaddleOCR compatibility, various SynthText forks for other languages) but these three cover the main approaches. (I personally have found PaddleOCR difficult to work with and prefer not to at this point)

The degradation problem

Making synthetic documents look scanned requires a separate pipeline:

DocCreator (2017) targets historical documents through graphical operations to simulate ink degradation, bleed-through, blur, and 3D paper deformation. This works on grayscale unlike older binary-only models.

Augraphy (ICDAR 2023) targets modern office documents: printing, faxing, scanning, copy machines, ink degradation, and handwritten markings ("scribble"). Written in Python, in pipelined style, it's designed to integrate neatly with training code.

DocCreator is used for historical documents, and Augraphy for everything else.

The elephant in the room

What bothers me here is obvious at a glance at any of the images on the SynthDoG documentation. The outputs are random Wikipedia sentences in random layouts (the "randomly stacked grids").

This is fine for pretraining visual robustness, where the model learns to recognise characters across fonts, backgrounds, degradation, etc. but the content is nonsense and lacks the structure, hierarchy, and logical relationships between blocks that are important for testing many tasks.

For table of contents extraction this is a major deficiency. A ToC needs:

Hierarchical structure of parts in chapters in sections
Consistent formatting with indentation per level, numbering schemes, dot leaders
Valid page references that are non-decreasing, within book range (meaningful!)
Multi-page flow with continuation logic across pages

You cannot learn to parse nested structure from random text blocks.

Document layout datasets

I looked at layout analysis datasets to see if any capture hierarchy:

PubLayNet (IBM, ICDAR 2019 Best Paper): 360k+ images from PubMed Central. Great for academic papers, but I expect would struggle with anything else.

DocLayNet (IBM Research Zurich, KDD 2022): 80k human-annotated pages across financial reports, scientific articles, laws, manuals, patents. More diverse, but still detection-focused rather than hierarchy-focused.

HRDoc (AAAI 2023): 2.5k multi-page documents with line-level annotations including relations. First dataset I found targeting hierarchical reconstruction explicitly. Comes with Comp-HRDoc benchmark evaluating page detection, reading order, ToC extraction, and hierarchy together.

HRDoc is closest to what I need but it's for evaluation, not generation.

Where this leaves me

So the tools exist for rendering text and degrading it, but what's missing is structured content generation. The thing that makes a ToC a ToC rather than random text in a vaguely ToC-shaped layout is still a gap in the market.

Part 2 covers what a ToC generator actually needs.