Linguistic Style at Scale

Why style matters and why it's hard to extract

Style as Stable Signal

The residue left when content is subtracted from a text, the systematic patterns of construction that persist across topics, constitutes what computational linguists mean by style. Two restaurant reviews discussing identical meals may diverge entirely in how they construct that discussion: one deploying hedged formulations and elaborate subordination, the other favouring blunt declaratives with minimal embedding. Two academic papers arguing the same thesis may load their sentences with nominalisations or prefer verbal constructions, may scatter stance markers liberally or maintain studied neutrality. These patterns persist when authors change subjects, making style a more reliable signal of authorship than the vocabulary that tracks topic.

The distinction between style and register repays careful attention. Register denotes the configuration of linguistic forms suited to particular communicative contexts: news articles exhibit one register, text messages another, academic prose a third. Each carries expectations about formality, sentence architecture, and lexical choice that competent writers internalise and reproduce. Style, by contrast, captures the systematic ways individuals or groups deviate from those contextual expectations. A journalist whose news copy features unusually elevated rates of first-person pronouns has not violated the register of journalism; she has expressed a distinctive style within it.

This framing, elaborated in Biber and Conrad's work on register, genre, and style, renders style analysis complementary to register analysis rather than redundant with it. Where register analysis reveals the linguistic forms a context demands, style analysis illuminates how writers navigate those demands idiosyncratically—the choices they make when the register permits latitude.

The Range of Applications

The practical applications extend well beyond determining who wrote what, though authorship verification remains central. Style representations feed into systems for transferring stylistic properties while preserving semantic content, for producing machine translations that maintain authorial voice rather than flattening it into translation-ese, for identifying the authors of threatening messages or verifying disputed documents in forensic contexts, and for detecting coordinated inauthentic behaviour through the stylistic clustering that emerges when multiple accounts share an operator. Each application presupposes some method for representing style computationally and comparing those representations across texts, which raises the question of what such representations should contain.

Biber's Multidimensional Analysis

The framework that has dominated corpus linguistics for four decades originated in Douglas Biber's 1988 monograph Variation Across Speech and Writing. Multidimensional Analysis enumerates a set of lexical and syntactic features (pronoun distributions, passive constructions, stance markers, subordination patterns, nominalisation rates) and tracks their frequencies across texts. Factor analysis then combines correlated features into interpretable dimensions of variation, each capturing some axis along which texts systematically differ.

The most celebrated of these dimensions, labelled Involved versus Informational production, separates texts by communicative orientation. Conversations and personal letters load positively, featuring elevated rates of first- and second-person pronouns, contractions, and private verbs like think and feel. Academic prose and official documents load negatively, their sentences dense with nominalisations, prepositional phrases, and attributive adjectives. The dimension captures something genuine about communicative purpose that transcends any single feature—a latent variable that the individual measurements jointly indicate.

Biber's original framework tracked sixty-seven features. Subsequent research, particularly Clarke and Grieve's work on abusive language and digital registers, extended the inventory to cover constructions absent from 1980s corpora: emoji, hashtags, URLs, laughter tokens. The expanded set now numbers ninety-six features, adequate for characterising text from social media platforms as well as traditional written registers.

Prior Implementations

Three open-source tools have implemented variants of Biber tagging, each demonstrating the utility of explicit, interpretable stylistic features for corpus linguistics and natural language processing:

The Multidimensional Analysis Tagger developed by Andrea Nini in Perl remains the standard reference implementation, widely cited in forensic and corpus linguistic research. biberpy, Serge Sharoff's Python port, brought the framework into contemporary NLP workflows. profiling-ud, a web app built by Brunato and colleagues on Universal Dependencies parsing, connected Biber features to the cross-linguistic annotation standard that now underlies most multilingual NLP.

What these implementations share, beyond their intellectual debt to Biber, is a common computational script: parse the input text, construct syntactic trees, then match hand-crafted patterns against those trees to identify and count features. The architecture is principled and mirrors how linguists conceptualise the features, but it inherits parsing's computational costs.

The Throughput Problem

Nini's Multidimensional Analysis Tagger processes approximately 2,100 tokens per second on a single CPU core. For corpora of the scale that stylistics traditionally examined (a few million words, the kind of collection a graduate student might assemble for a dissertation) this throughput suffices. Processing completes overnight; the researcher returns in the morning to find features extracted and ready for analysis.

Nowadays corpora are at web scale, and this approach hits a wall. Common Crawl, the standard pretraining corpus for large language models, contains roughly 1.1 trillion tokens. At 2,100 tokens per second with eight-way CPU parallelism, complete processing would require 756 days—more than two years of continuous computation merely to extract features that, in principle, might inform pretraining data curation, register classification, or quality filtering.

This is not a minor inconvenience susceptible to better hardware. The gap between what traditional methods can process and what contemporary corpora contain spans orders of magnitude. Research questions that require stylistic characterisation of web-scale data (how does linguistic style vary across the crawled web? can stylistically anomalous documents be identified and filtered?) become computationally intractable.

Neural Approaches to Stylometry

One response to the throughput problem abandons interpretable features entirely. Rather than specifying which linguistic properties to track, the neural approach learns style representations end-to-end, letting the model discover whatever proves useful for downstream tasks.

LUAR, developed by Rivera-Soto and colleagues for authorship verification, exemplifies this strategy: self-attention combined with contrastive training produces embeddings that cluster same-author texts while separating different-author texts, scaling to datasets with hundreds of thousands of authors. The representations perform well on verification tasks but offer no purchase on what stylistic properties they encode. The 768-dimensional embedding is a black box that happens to work.

LISA, from Patel and colleagues, attempts to recover interpretability by using GPT-3 to generate annotations for 768 style attributes and training models to predict these labels. The attributes are human-readable, but they operate at a different granularity than Biber features.

With NEUROBIBER, the interpretability of Biber features can be preserved at Transformer inference speeds.