⚖️ Neurobiber

Fast interpretable stylometric features at scale

  1. Linguistic Style at Scale
  2. From Rules to Representations
  3. Recovering Classic Dimensions

Linguistic style, the patterns of syntactic and lexical choice that persist when content varies, carries information that survives topic changes, genre shifts, and deliberate obfuscation. A forensic analyst examining messages relies on this persistence; so does the researcher building style transfer systems or machine translation pipelines that preserve authorial voice across languages.

Analysts had identified the atoms of style, but high throughput stylometry remained an unsolved problem. Biber's Multidimensional Analysis, the dominant framework in corpus linguistics for quantifying style, requires parsing every token and matching hand-crafted patterns against syntactic trees. At two thousand tokens per second, the approach that revolutionised register studies in the 1980s cannot scale to corpora measured in trillions of tokens.

NEUROBIBER, introduced by the Blablablab at U. Michigan, offers an alternative: by training a transformer to predict whether a parser would find certain constructions—achieving 117,000 tokens per second while maintaining macro-F1 of 0.97 against ground truth labels.

This is an ingenious move that followed the same group's production of a parser, BiberPlus, which allowed them to produce their own training data.

This series examines the theoretical framework that makes such features meaningful, the technical implementation that makes them fast, and the empirical validation that confirms the acceleration preserves what mattered about the features in the first place.