Recovering Classic Dimensions

Empirical validation through PCA and authorship verification

A neural model that predicts Biber features rapidly is useful only if those predictions preserve the linguistic information that justified the framework's prominence. The acceleration would be hollow if it came at the cost of the interpretability and empirical grounding that distinguish Biber features from arbitrary text statistics. Two demonstrations address this concern: one examining whether classic dimensions of variation emerge when principal component analysis is applied to neural features, another testing whether those features support a practical natural language processing task without task-specific retraining.

Principal Components and Biber's Dimensions

The first validation applies PCA to NEUROBIBER-extracted features across the CORE corpus, a register-balanced collection assembled specifically to support research on linguistic variation. The corpus spans conversation, fiction, news, academic prose, and multiple genres of web text, offering exactly the diversity needed to test whether extracted features track meaningful stylistic differences. If NEUROBIBER captures true variation, principal components computed over its feature vectors should yield interpretable dimensions resembling those that emerged from Biber's original factor analyses on manually coded data.

The Involved/Informational Axis

The second principal component recovers Biber's Dimension 1 with striking fidelity. When registers are plotted along this axis, they cluster exactly as the framework predicts.

Texts loading high on PC2 (e.g. discussion forums, Q&A sites, personal blogs) share the properties Biber associated with 'involved' production: elevated rates of first- and second-person pronouns, frequent contractions, abundant private verbs encoding mental states and subjective assessments. Texts loading low—technical reports, encyclopaedia articles, academic prose—exhibit the informational profile: heavy nominalisation, dense prepositional phrases, attributive adjectives stacked before nouns that themselves derive from verbs.

The features driving this separation are precisely those Biber identified four decades ago, applied now to digital registers he could not have anticipated. The dimension captures communicative orientation—whether a text addresses a present interlocutor or presents information to an absent reader—through the same syntactic and lexical markers across media that differ in every surface respect. A Reddit comment and a face-to-face conversation share properties that distinguish both from a Wikipedia article, and the neural features recover this alignment.

This is not circular validation. NEUROBIBER was trained only to predict feature presence; no supervision encouraged the model to reproduce Biber's factor structure. That the classic dimensions emerge from PCA over neural predictions, without any signal pointing toward them, suggests the features capture underlying variation rather than artifacts of the training procedure.

Where Dimensions Merge

Not every principal component maps cleanly onto a single Biber dimension. The first component groups short stories, speeches, and interviews alongside opinion pieces—genres that serve different communicative functions but share elevated rates of first-person stance expression. Biber's original analyses might have separated narrative from persuasive registers; the PCA conflates them because both load on overlapping features: personal pronouns, stance verbs, hedging constructions that mark subjective assessment.

Similarly, the third and fourth components blend explicit reference (frequent nouns, proper nouns anchoring discourse to specific entities) with what Biber called online elaboration (parenthetical insertions, the syntactic repairs characteristic of speech and informal writing). Magazine features and opinion columns cluster together despite serving distinct purposes, united by discourse-level patterns that cross-cut genre boundaries.

These partial mergings do not indicate failure. Factor analysis always involves judgment about how many dimensions to extract and how to interpret the loadings that emerge; different rotation methods and extraction criteria yield different structures from identical correlation matrices. The PCA on NEUROBIBER features produces a valid dimensional structure, not the unique correct structure (see for example ICA).

Authorship Verification Without Retraining

The second demo tests whether NEUROBIBER features can support a practical task: authorship verification on the PAN 2020 dataset, a benchmark widely used to evaluate stylometric methods.

The Verification Problem

PAN 2020 presents pairs of texts, typically fanfiction, and poses a binary question: do these texts share an author? The dataset's difficulty derives from fanfiction's conventions. Authors deliberately modulate style depending on the fictional universe they inhabit: a story set in the Harry Potter universe may adopt formal British prose; a story in the same author's original setting may use casual American idiom. Surface stylistic markers shift with deliberate intent, meaning that authorship signals must reside in patterns deeper than vocabulary or obvious register markers, the stable syntactic fingerprint that persists when an author consciously varies their voice.

Experimental Configuration

Three systems provide comparison:

A random baseline assigns labels without examining the texts, establishing the floor at F1 0.50 that any useful system must exceed. A RoBERTa bi-encoder, trained on 42,000 text pairs with a contrastive objective that pulls same-author pairs together in embedding space while pushing different-author pairs apart, represents the contemporary neural approach: let the transformer discover whatever features prove useful, without specifying what those features should be. The NEUROBIBER system extracts a 96-dimensional feature vector from each text, concatenates the two vectors into a 192-dimensional representation of the pair, and trains a Random Forest classifier to predict same-author versus different-author, with no fine-tuning of the feature extractor nor contrastive learning, just off-the-shelf features passed to an off-the-shelf classifier.

Results and Implications

**Model** **F1**
Random baseline 0.50
RoBERTa bi-encoder 0.78
NEUROBIBER + Random Forest 0.77

NEUROBIBER achieves F1 of 0.77, one point below the fine-tuned bi-encoder and twenty-seven points above chance. The gap between neural and feature-based approaches narrows to a margin that many applications would consider negligible, while the resource differential remains substantial. Generating NEUROBIBER vectors for the entire PAN dataset required approximately thirty minutes; training and tuning the bi-encoder requires further GPU-hours and hyperparameter search.

The comparison illuminates a deeper point about what interpretable features offer. For applications where explanation matters, such as forensic proceedings, legal disputes over document authorship, any context where a decision must be justified to a sceptical audience, the bi-encoder's 768-dimensional embedding provides nothing to point at. The dimensions encode whatever proved useful for contrastive training; no one can say what dimension 417 represents, or why two texts that differ on it might nonetheless share an author.

The NEUROBIBER vector, by contrast, decomposes into named features with documented linguistic definitions. An analyst can report that both texts show elevated rates of split infinitives and stranded prepositions, that both avoid nominalisation in favour of verbal constructions, that both deploy second-person pronouns at three times the register baseline. Whether such an explanation satisfies a particular legal standard is a separate question, but the explanation is at least possible to construct: a property the neural embedding lacks entirely.

Stable Fingerprints and Their Applications

The PAN results illustrate what Biber features provide that content-based signals cannot: stability across topic and register shifts. An author who changes subjects changes their vocabulary; an author who deliberately disguises their writing may adopt unfamiliar lexical items and avoid characteristic phrases. Syntactic patterns prove harder to consciously manipulate. These are features of expression like subordination rates, passive voice preferences, and the distribution of pronouns across grammatical positions. The fingerprint resides in how sentences are constructed, not what they say, and construction patterns persist as an invariant component as the content being expressed varies.

Forensic and Literary Contexts

This stability constitutes the core value proposition for forensic linguistics. Threatening messages, disputed wills, anonymous online accounts linked to criminal activity necessitate features the analyst can use as objective diagnostic despite the author's awareness of being analysed. Deep syntactic patterns resist conscious control in ways that vocabulary choice does not; most writers cannot reliably alter their subordination rates on demand, even when they know such rates might identify them.

Literary scholarship poses different questions with similar requirements. Tracing influence, identifying imitation, tracking stylistic development across an author's career; each requires features that capture something more systematic than topic or period vocabulary. Reporting that two novels occupy nearby regions of embedding space answers a question no one asked. On the other hand, reporting that both nominalize at twice the rate of their contemporaries, that both show unusual preference for sentence-initial adverbials, that both avoid the agentless passive their contemporaries favour opens interpretive possibilities.

Hybrid Architectures

Nothing prevents combining the 96 Biber features with dense neural embeddings in hybrid architectures. The Biber vector captures low-level syntactic patterns that the features were designed to track; the transformer embedding captures semantics, pragmatics, and whatever else the model learned during pretraining. Concatenation produces a representation that draws on both, and downstream tasks can weight each signal according to its utility.

This hybrid strategy proves particularly valuable for domain adaptation in language model fine-tuning. Suppose a research group wants to specialise a foundation model for a target domain—say legal writing, or nineteenth-century fiction, or medical case reports. Rather than parsing every candidate document with full syntactic analysis, NEUROBIBER can scan millions of samples rapidly and flag those whose stylistic profile matches the target distribution. The coarse filter reduces what must be parsed or manually reviewed, focusing annotation effort on documents likely to be relevant.

High Throughput Unlocked

The practical consequence of this 56-fold acceleration is that research programmes previously infeasible can become routine. Tracking stylistic change across decades of archived journalism, identifying register outliers in web crawls, monitoring social media streams for the stylistic anomalies that might indicate coordinated inauthentic behaviour: each requires stylistic characterisation at scales that parser-based methods cannot approach. At 117,000 tokens per second, the bottleneck shifts from feature extraction to whatever analysis the features feed.

Dual-Use Considerations

The same features that help verify threatening messages can compromise the anonymity of whistleblowers; the same throughput that enables research enables surveillance. Stylometric methods cut both ways, and the interpretability that makes NEUROBIBER useful for forensic explanation also makes it useful for identifying writers who prefer not to be identified.

The paper's authors acknowledge this tension explicitly. Deployment in sensitive contexts requires attention to consent, data governance, and the potential for misuse. The interpretability that distinguishes NEUROBIBER from black-box alternatives is, counterintuitively, an asset for oversight: because the features driving any classification decision can be inspected, audits for bias or discriminatory patterns become possible in ways that opaque neural systems resist.

None of this resolves the dual-use problem inherent to capability research. Useful tools can be misused; power that enables also endangers. What interpretability offers is not a solution but a precondition for solutions, the transparency that makes accountability possible when sought.


What NEUROBIBER demonstrates, across these validations, is that the old dichotomy between interpretable features and neural scale was false. Biber's carefully defined linguistic constructions can be extracted at transformer speeds. Common Crawl can be processed in weeks rather than years. Fine-tuned authorship systems can be matched with concatenated feature vectors and a Random Forest.

The features Biber enumerated in 1988, extended for digital registers and accelerated through knowledge distillation, still capture meaningful variation in how language is constructed. NEUROBIBER makes extracting them fast enough to matter at the scales contemporary research demands.