From Rules to Representations

Two complementary tools for Biber feature extraction

Blablalab's contribution comes in the form of two instruments designed for complementary purposes. BIBERPLUS, a rule-based Python implementation, provides ground truth labels and handles applications requiring exact feature counts or integrated statistical analysis. NEUROBIBER, a transformer trained to predict those labels, sacrifices some precision for throughput gains that shift what is computationally feasible. Understanding when each tool is appropriate requires firstly examining what each actually does.

BIBERPLUS: Rule-Based Precision

The architecture of BIBERPLUS will be familiar to anyone who has built feature extractors atop modern NLP pipelines: parse the input text using spaCy's optimised dependency parser, then traverse the resulting syntactic trees matching hand-crafted patterns that identify the constructions Biber defined. A passive construction requires a past participle governed by a form of be; a stranded preposition requires a preposition whose object has been extracted; a that-complement clause requires specific structural configurations distinguishing it from relative clauses and other uses of that. Each of the ninety-six features corresponds to a pattern or set of patterns, encoded in rules that the authors validated against 225 unit tests spanning diverse text types.

Expanded Feature Coverage

The feature inventory extends Biber's original sixty-seven tags with twenty-nine additions drawn from subsequent research on digital registers. Beyond the classical repertoire—past tense markers, nominalisations, agentless passives, stance adverbs, subordination markers—the expanded set captures constructions that Biber, writing before the web existed, could not have anticipated:

Emoji and emoticons receive separate tags (EMOJ and EMOT), recognising that :) and 😊 may distribute differently across registers and authors. Hashtags and @-mentions (HASH and AT) track the metacommentary conventions of social platforms. URLs mark the interpolation of hyperlinks into running text. Laughter tokens (lol, lmao, haha and their variants) receive their own tag (LAUGH), capturing a register marker that distinguishes casual digital communication from its formal counterpart.

These additions would be found useful for researchers working with social media, forum posts, or other born-digital text. A feature set that cannot distinguish between a tweet dense with emoji and one that eschews them misses stylistic variation that any reader immediately perceives.

The Problem of Text Length

A subtler design consideration concerns how features should be counted when texts vary dramatically in length. Frequency normalisation—reporting features per thousand tokens—works well when texts contain enough tokens for stable estimates. A ten-thousand-word academic article that uses the passive voice forty times yields a rate of four per thousand, a measurement precise enough to support meaningful comparison.

The same approach of course founders on short texts, upon which estimates become erratic. A fifty-token tweet containing a single contraction produces a normalised rate of twenty per thousand; a hundred-token tweet with the same single contraction produces ten per thousand. The apparent twofold difference in contraction rate reflects text length, not stylistic variation.

BIBERPLUS offers binary counting alongside frequency measurement to address this instability. In binary mode, the system tracks only whether each feature appears within a chunk (typically one hundred tokens), then averages across chunks. A 500 token text divided into five chunks, with contractions appearing in two of them, yields a binary score of 0.4—a measurement more robust to length variation than the frequency count that would fluctuate wildly depending on exactly where those contractions fell.

The choice between counting modes depends on the corpus and the research question. Frequency counting suits long-form text where rates are stable; binary counting suits heterogeneous collections mixing tweets with articles with forum posts. The paper's own experiments, training on a corpus spanning all these genres, employ binary counting throughout.

Implementation and Performance

The software is installed with pip install biberplus and a spaCy model download. The package exposes a single primary function:

from biberplus.tagger import calculate_tag_frequencies

text = "It doesn't seem likely that we'll finish on time."
freq = calculate_tag_frequencies(text)

The returned dictionary maps feature codes to counts: PIT (the pronoun it) at 1, CONT (contractions) at 1, SMP (seem/appear verbs) at 1, and so on through the ninety-six features. The full inventory occupies several pages of documentation.

Even without neural acceleration, BIBERPLUS processes 4,631 tokens per second; 2.2x faster than Nini's reference implementation. The speedup derives from engineering rather than algorithmic innovation: spaCy's optimised parsing, careful memory management, vectorised operations where the computation permits. Useful for moderate-scale work, but insufficient to bridge the gap between dissertation corpora and web-scale collections.

NEUROBIBER: Reframing Extraction as Prediction

The insight underlying NEUROBIBER is that feature extraction can be reconceived as classification. Given a chunk of text, the question "which Biber features are present?" admits a fixed set of answers—the 96 features, each either present or absent. A model trained to predict these binary labels need not parse; it need only recognise patterns associated with each feature's presence in training data. Parsing happens once, during training data generation; inference requires only a forward pass through the trained model.

Architecture and Training

The model is RoBERTa-base fitted with a classification head producing 96 binary outputs, one per feature. The architecture carries no novelty; the contribution lies in demonstrating that this straightforward setup, trained on appropriate data, reproduces Biber feature extraction with high fidelity at dramatically increased speed.

Training data came from BIBERPLUS itself, applied to a corpus of approximately forty-two million text chunks drawn from seven domains:

The largest single source, Reddit comments spanning fifteen years, contributed 12.3 million samples averaging 117 tokens each. News articles from Common Crawl's Realnews subset added 15.9 million samples of substantially greater length, averaging 609 tokens. Gmane mailing list archives, Wikipedia discussion pages, Wikipedia edit histories, Amazon product reviews, and novels from the Book3Corpus filled out the distribution, ensuring exposure to registers ranging from casual conversation to literary prose.

This domain diversity is deliberate and necessary. A model trained only on formal prose would fail to recognise the features that characterise Reddit; one trained only on social media would miss the syntactic complexity that distinguishes academic writing. The seven-domain mixture forces the model to learn feature detectors that generalise across registers.

For each chunk, BIBERPLUS provided binary labels indicating which features were present. The training objective was simply to predict these labels—knowledge distillation, with the neural student learning to approximate the rule-based teacher at speeds the teacher cannot match.

Texts exceeding RoBERTa's 512-token context window are processed span by span, with features marked present if any span predicts them. This aggregation strategy may occasionally overcount features concentrated in one section of a long document, but it preserves the interpretability that makes Biber features useful: each prediction corresponds to the same linguistic definition that BIBERPLUS employs.

Fidelity and Throughput

Evaluation on a held-out test set of four million chunks yields macro-F1 of 0.97 and micro-F1 of 0.98. The vast majority of features score above 0.95; the hardest cases involve rare constructions like discourse particles (F1 0.83) and emoji (F1 0.84), where training examples are sparse and surface realisation varies widely. For common features—nouns, prepositions, determiners, pronouns—the model achieves near-perfect agreement with the rule-based tagger.

The throughput comparison clarifies what this fidelity enables. On a benchmark of 2.4 million tokens drawn evenly from the seven training domains, running on a single NVIDIA A6000 GPU with batch size 128:

System Tokens per Second
Nini's MAT 2,105
BIBERPLUS 4,631
NEUROBIBER 117,478

NEUROBIBER processes tokens 56x faster than the standard reference implementation. The Common Crawl ballpark of 756 days falls to 13.6 days on eight GPUs, which is still substantial, but within reach of academic research budgets.

Choosing Between Tools

NEUROBIBER excels when the task involves processing large corpora at speed and binary presence/absence signals suffice for downstream analysis. Applications that benefit from throughput and tolerate the minor fidelity loss include those such as characterising the stylistic distribution of a web crawl, filtering documents by register, or generating features for machine learning pipelines that will learn their own weightings.

BIBERPLUS remains preferable when exact feature counts matter, when texts are long enough for frequency estimates to stabilise, when the integrated analytical tools (PCA, factor analysis) that ship with the package are needed, or when the research question demands the certainty that comes from explicit pattern matching rather than neural approximation. Generating training data for downstream models, auditing neural predictions against ground truth, analysing individual texts where every feature instance may matter—these applications warrant the slower, more precise tool.

The HuggingFace model for NEUROBIBER produces a 96D binary vector whose dimensions simply correspond to the documented feature codes. In the next section we'll consider how to verify that these rapidly extracted features preserve the stylistic information that made the framework worth accelerating to begin with.