Generating structured documents

What a synthetic ToC generator needs, and how it connects to dewarping.

Existing generators produce random layouts with random text. You can't learn hierarchy from gibberish. So what does a ToC generator actually need?

The task
What makes ToCs special
Generator requirements
Next steps

The task

Input: Scanned book pages (warped, noisy, multi-page spreads)

Output: Structured hierarchy of entries with page numbers

This decomposes into three problems: layout analysis (find the ToC regions), OCR (read the text), and structure parsing (recover the nesting). Most work treats these separately, but the nature of [potential] cross-page hierarchical structures means they must be considered as more than the sum of their parts.

What makes ToCs special

Real ToCs have structure that random generation can't produce:

Part I: Foundations
  Chapter 1: Introduction .................. 1
    1.1 Background ......................... 3
    1.2 Motivation ......................... 7
  Chapter 2: Related Work ................. 15

Formatting conventions vary, such as indentation, numbering styles, dot leaders, page numbers, and fonts (typeface or style as indicating demarcation itself).

Multi-page spanning adds further complexity, with possible page breaks mid-hierarchy (typically without continuation markers), headers/footers on ToC pages, page numbers of ToC pages vs page numbers in the ToC.

The Penn State ICDAR 2013 paper on ToC extraction notes that different books have different styles, so you need adaptive parsing rather than a single pattern matcher. Their heuristics are sensible but rule-based not learned.

No ToC-specific dataset exists, as far as I can tell.

Generator requirements

The generator will have three components:

1. Structured content source

Rather than random text, it will:

Mine Wikidata hierarchies. P31 (instance of) and P279 (subclass of) to form a taxonomic backbone from which to extract category graphs, use as ToC skeletons, and assign synthetic page numbers.

Programmatic generation. Rules like 3-8 parts, 4-12 chapters each, 2-6 sections each would be needed to produce realistic constraints. It would generate titles from paper titles, Wikipedia articles, or text generation models trained on real ToC entries. Then it would assign page numbers with realistic distributions (chapters: 15-50 pages, sections: 3-15, and so on).

Real ToC extraction + augmentation. If we can acquire some real ToCs: extract structure, swap titles, reassign page numbers, and re-render with different styles. This would be more augmentation rather than synthesis.

2. Layout engine

This must render:

Indentation per nesting level
Dot leaders to right-aligned page numbers
Consistent fonts/sizes per hierarchy level
Page breaks mid-hierarchy
Numbering schemes

We could adapt SynthDoG's rendering, HTML→image via headless browser, LaTeX, or directly render with Pillow/Cairo.

3. Degradation pipeline

Augraphy handles most of this if desired, I don't really know if it'd be so important to be honest, but there's an interesting symmetry with the dewarping that we are seeking to use this dataset to train for.

I maintain page-dewarp, which fits cubic splines to text contours and solves for camera projection. The inverse — applying known warping to flat images — gives perfect training data for the forward problem.

DewarpNet (Stony Brook, ICCV 2019) took this approach with Doc3D: 100k images with ground truth 3D shape. They model geometry explicitly because "the 3D geometry of the document not only determines the warping of its texture but also causes illumination effects."

Synthetic warped ToCs serve a dual purpose then: one could train the structure extractor on flat versions, train the dewarper on warped versions, and compose for an end-to-end solution.

Next steps

Look at SynthDoG's layout code for what's adaptable
Extract Wikidata category subgraphs as hierarchy skeletons
Scan real ToC pages and document style variation
Prototype HTML→image with proper dot leaders and indentation

The goal is semantically coherent ToCs with valid hierarchies, consistent formatting, and realistic page numbers rather than just visually plausible ones.