Routes as Token Sequences

An autoregressive transformer predicts paths edge by edge, conditioned on natural language queries

Classically Inexpressible Problems

The algorithms surveyed in part 1 share an assumption: that the user's objective can be formalised as a function over the components of a route. Minimise arrival time. Minimise transfers. Minimise some weighted combination. This assumption is what makes the algorithms possible. Dijkstra requires edge weights; RAPTOR requires a criterion that can be tracked across rounds; Transfer Patterns requires that optimality be stable enough to precompute.

Some requests break this assumption entirely. "Route me via a grocery store to pick up dinner ingredients." This is not an edge-weight problem. No individual edge is more or less grocery-adjacent; the constraint applies to the route as a whole. "Plan a kid-friendly day trip with all stops within a twenty-minute drive." This is a knapsack problem embedded inside a routing query. "Find me a route with a coffee shop and a post box along the way, avoiding highways." This combines waypoint satisfaction with a preference constraint, neither of which decomposes over edges.

Current production systems handle these through what Zhao et al. (2024) call "cost modifiers": rules that rescale edge weights based on attributes. Want to avoid highways? Multiply highway edge weights by a penalty. The paper identifies three reasons this is inadequate. First, cost modifiers can only express preferences that decompose linearly over edges. "Avoid highways" decomposes; "stop at a coffee shop" does not. Second, production routing systems have become complex ensembles of hand-tuned combinatorial algorithms, each supporting a narrow class of query. Adding new capabilities means adding new algorithms into an already intricate codebase. Third, cost modifiers cannot exploit the rich metadata available in modern road networks -- colocated businesses, surface types, points of interest -- because engineering a modifier for each attribute is impractical.

The paper states the limitation concretely: Google Maps can plan a route with a specified intermediate stop if you give it an address, but it cannot automatically plan a stop to buy dental floss. It can provide transit routes, but not when an intermediate stop is also requested. These are not hypothetical gaps.

Why Prior Graph Net Approaches Failed to Scale

The idea of applying neural networks to routing problems is not new. Several threads of prior work are relevant, and the paper is careful to distinguish its approach from them.

Graves et al. (2016) used a Differentiable Neural Computer -- a recurrent network augmented with external read-write memory -- to route on the London Underground. The DNC is an intricate architecture designed for tasks requiring complex memory manipulation. It produced results on a small transit graph but was never demonstrated at the scale of a real road network and required a highly specialised implementation.

The algorithmic reasoning line of work, including Veličković et al. (2019) and Ibarz et al. (2022), trains graph neural networks to simulate the intermediate computations of classical algorithms like Dijkstra's. These "neural algorithmic learners" are intellectually interesting but have been tested only on graphs of roughly 100 nodes. Road networks are two to three orders of magnitude larger.

Pointer networks (Vinyals et al., 2015) and attention-based methods for the Travelling Salesman Problem (Kool et al., 2019) apply learned models to combinatorial optimisation on graphs, but they abstract away from road geometry entirely. They solve mathematical graph problems, not navigation problems with real-world metadata.

The common thread is that none of these approaches has been demonstrated on graphs of the size that routing applications require. A 10-kilometre stretch of urban road involves, on average, 600 road segments. The benchmark networks in this paper contain 10,000 vertices and 25,000 edges. Naively scaling a GNN or attention mechanism to this size is prohibitive: transformer memory is quadratic in sequence length, and GNN message-passing at this scale requires either extensive subsampling or distributed computation.

Autoregressive Prediction

The paper's key contribution is not the transformer architecture. It is the observation that routing can be decomposed into a sequence of local decisions, each of which can be made by attending only to a small neighbourhood of the graph.

A route is a sequence of edges. Predicting a route means predicting one edge at a time, conditioned on the query, the destination, and the edges chosen so far. This is exactly the autoregressive paradigm of language modelling: predict the next token given the preceding tokens. The analogy is suggestive but also precise. The model is a decoder-only transformer, the same architecture family as GPT. It processes a sequence of "tokens" that represent the query (as text), the destination edge, the partial route so far, a "receptive field" of nearby edges, and candidate next edges. It outputs a probability distribution over possible next edges.

The receptive field is the mechanism that makes the approach tractable. Rather than inputting the entire road network, the model sees only the 500 or so edges surrounding the current position. This reduces memory by a factor of over 2,000 compared to processing the full graph. The paper argues that this locality is not merely a computational convenience but reflects the actual structure of routing decisions. Road networks are designed for humans making local navigational choices. Businesses cluster in predictable patterns. A model attending only to its immediate surroundings can nonetheless produce globally sensible routes, much as a language model producing tokens one at a time can generate coherent paragraphs.

This claim is testable, and the paper tests it. An ablation study on a synthetic grid-world benchmark varies the receptive field from 8 edges to the full network (3,516 edges). Performance improves substantially from 8 to 256 edges, then plateaus. There are no statistically significant gains from seeing the full graph versus seeing 256 local edges. This is a strong result. It means the autoregressive decomposition is not losing important global information, at least for routes of the lengths tested.

Architecture

The model's input sequence consists of five components: the natural language query (as text tokens), the destination edge, the partial route taken so far (as a sequence of edges), the receptive field (edges in the local neighbourhood), and candidate next edges. Each edge is featurised through a multi-layer perceptron that processes its metadata: road type, length, speed limit, colocated points of interest.

This last point deserves emphasis. Language models learn an embedding table that maps each token in a fixed vocabulary to a vector. This model does not do that. The vocabulary of edges varies across cities. Instead of memorising specific edges, the model processes edge features through an MLP, producing representations that generalise across networks. This is why the model can route in a city it has never seen during training. Berkeley, the evaluation city, is held out entirely from the training data.

The transformer has 8 blocks, embedding dimension 1024, and intermediate dimension 2048. The first position's output serves as the "problem embedding." Candidate next edges are scored by dot product with this embedding, yielding logits for a softmax distribution. Training uses cross-entropy loss against ground-truth next edges, on a dataset of 2 million (query, route) pairs. The ground-truth routes were computed by an expensive offline procedure: peeking at the query labels and applying an ensemble of combinatorial algorithms, at a cost of 0.3 CPU hours per route.

Inference uses beam search with width 10, plus a secondary scoring model trained contrastively to select among the beam's final candidates. The ablation studies show that this scoring model provides a significant boost over selecting by cumulative log-probability alone.

Results

The benchmark covers over one million natural language queries on road networks of 23 US cities. Queries span waypoint routing ("guide me to a grocery store, a gas station, and a store selling energy drinks"), trip planning ("plan a kid-friendly adventure, all stops within 20 minutes"), and preference constraints ("include a bakery and a car repair shop, avoiding highways"). A single query communicates 1.8 tasks on average, with a maximum of 5.

Evaluation is on held-out queries in a held-out city. Success rates:

Task Success rate
Waypoint routing (errands) 92.0% +/- 3.7%
Waypoint routing (specific locations) 38.0% +/- 6.6%
Waypoint routing (mixed) 34.0% +/- 6.9%
Trip planning 68.0% +/- 4%

"Errands" are waypoint requests satisfiable by multiple point-of-interest categories (e.g. "buy a water bottle" can be fulfilled at grocery stores, convenience stores, or gas stations). "Locations" require a specific category (e.g. "visit a pharmacy"). The flexibility of errands makes them much easier. Among successfully completed waypoint routes, the median excess travel time over the optimal is 0.8 seconds, which is negligible.

The baselines are the important context. Cost modifiers achieve 0% success. Electrical flows (Sinop et al., 2021), generating 4,096 diverse candidate routes and checking each against the constraints, achieve 1.3% +/- 1.3%. These are not strawmen; they represent the full capability of classical methods on this class of problem, implemented with deliberate advantages (the cost modifier baseline tries every possible modifier and checks if any works). The learned model's 34-92% success rates, while far from perfect, represent a qualitative change in capability.

Ablations

Several ablation results are worth noting for what they reveal about the model's behaviour.

The receptive field plateau at 256 edges, already mentioned, suggests the model learns to make good decisions from local information. But the point-of-interest density experiment adds nuance. On the grid-world benchmark, increasing POI density makes waypoint routing easier (more options to satisfy constraints) but makes next-edge prediction harder (more plausible next edges, higher entropy over solutions). This dissociation between prediction accuracy and task success means that next-edge accuracy is a poor proxy for routing quality. The model can be uncertain about which specific route to take while still reliably finding some satisfactory route.

The scaling study shows gains from increasing model capacity that saturate relatively quickly. A 6-layer, 256-dimension model captures most of the benefit; scaling to 10 layers and 1024 dimensions provides only modest additional improvement. The authors interpret this as evidence that the simple transformer architecture may not be optimal. This is a reasonable reading. More expressive architectures, graph-structured attention mechanisms, or methods that incorporate CRP-like hierarchical information (as the alternative routes paper does) might extract more from the same data.

Early Days

The paper frames itself explicitly as a proof of concept. The benchmark networks are small by production standards (10,000 vertices versus the millions in a metropolitan road network). The 600-segment route length cap is acknowledged as a limitation imposed by training efficiency; the authors note that handling longer routes would require integrating hierarchical coarsening techniques like those in CRP. The receptive field constraint may cause problems on long routes where a locally reasonable decision leads to a globally suboptimal outcome that cannot be corrected.

The training data is generated synthetically: natural language queries produced by a language model from a manually curated taxonomy, labelled with structured annotations, and solved by brute-force combinatorial methods. This is a legitimate approach for a benchmark but leaves open the question of how well the model would handle the distribution of queries that real users actually issue.

Perhaps most importantly, the paper does not claim to have replaced classical routing. The query types it handles — waypoint routing, trip planning, preference constraints expressed in natural language — are precisely those that classical algorithms cannot address. For a standard A-to-B query, running Dijkstra or RAPTOR remains faster, more reliable, and exact. The paper's contribution is demonstrating that there exists a class of routing problems for which learned models substantially outperform classical methods, and that autoregressive transformers provide a workable architecture for this class.

The Hybrid Future

Read together, the two papers suggest a layered architecture rather than a wholesale replacement. Classical algorithms and preprocessing frameworks (CRP, Transfer Patterns, RAPTOR) provide the infrastructure: graph decomposition, fast point-to-point queries, hierarchical structure. Learned components operate within and on top of this infrastructure, handling tasks that classical methods handle poorly: scoring alternative routes, satisfying complex constraints, interpreting natural language preferences.

The alternative routes paper demonstrates this division of labour explicitly. CRP provides the partition hierarchy; the neural network provides fast uniform-stretch prediction. The semantic routing paper is more radical in aspiration but even it relies on classical algorithms to generate its training data and acknowledges the need for hierarchical graph coarsening to scale beyond its current benchmark.

Whether this hybrid architecture will prove sufficient for Google's needs is an open question. The semantic routing paper identifies natural extensions (richer receptive field representations, integration of graph neural networks or graph transformers, retrieval of business reviews during route finding) that could push the learned component further. The alternative routes paper's suggestion of learning path-overlap embeddings for fast clustering could improve the selection step of any alternative generation pipeline.

What seems clear is the direction of travel. The researchers who invented the classical algorithms are now publishing papers about replacing parts of those algorithms with learned components. The purely combinatorial era of routing research may not be ending, but it is acquiring a learned layer that its creators evidently believe is necessary.

References