Game Design and Framework

The adversarial roles and environment for SPAG

(arXiv:2404.10642v3 [cs.CL])

Part 2 of a series on the SPAG paper.

In this section, we cover:

  • The Adversarial Taboo-style mechanics for hidden words
  • Attacker vs. Defender roles and objectives
  • State, action, and reward design under a zero-sum setup

In this section, the paper formalizes how a language game can be built and why it serves as a suitable environment for self-play. The authors propose an adversarial variant of the classic “Taboo” game, where players have conflicting objectives. This design establishes clear win/loss conditions, enabling a zero-sum competitive scenario that encourages the emergence of complex language strategies and improved reasoning.

2.1 Adversarial Language Games

The Rationale for Game-Based Training

Natural language is high-dimensional: any utterance can contain a broad mix of semantics, syntax, and contextual cues. Traditional supervised learning, where each utterance is labeled correct or incorrect, provides limited feedback and little impetus for creativity or strategic thinking. By contrast, adversarial language games can reward or penalize entire sequences of actions (conversational turns), mimicking the multi-step decision process that arises in real dialogues.

Researchers have experimented with multi-agent setups in NLP, ranging from negotiation tasks to collaborative puzzle-solving. Many efforts, however, do not strongly enforce a single winner–loser outcome or require the full breadth of natural language. Instead, they may focus on truncated dialogues, single-turn interactions, or assume cooperative agents. Here, the adversarial angle is especially critical, because it tends to stress-test the model’s reasoning in ways that cooperative tasks might not.

2.2 The Adversarial Taboo Game

Basic Mechanics

Inspired by the classic board/card game Taboo, the Adversarial Taboo game (as introduced in the paper) revolves around a hidden target word ww. One player—called the Attacker—knows ww but is forbidden from speaking it directly. Meanwhile, the Defender, who does not know ww, tries to guess it before accidentally saying it.

  1. Hidden Target Word: Drawn from a set VtargetV_{\text{target}} (often a large vocabulary of common words).
  2. Attacker’s Role: Induce the Defender to inadvertently say the target word. The Attacker can describe or hint at related concepts without uttering ww itself.
  3. Defender’s Role: Infer the hidden word without blurting it out by mistake. When confident, the Defender can formally guess by saying, “I know the word! It is w!w\text{!}
  4. Turns & Termination: The conversation proceeds for a pre-set maximum number of turns T0T_0. If no one wins by then, the game is a tie.

Because each player’s objective strictly opposes that of the other, the game is naturally zero-sum. Crucially, each side must engage in strategic reasoning: the Attacker to trick or mislead, and the Defender to deduce, all while neither can overtly break the rules.

Figure 2: examples of Adversarial Taboo dialogues with the same target word “conversation.” One dialogue illustrates an attacker-winning scenario, while the other shows a defender-winning scenario.

Why Taboo?

Unlike simpler tasks such as “guessing a single entity with yes/no questions,” the Taboo game fosters richer conversations. The Attacker can talk about peripheral contexts—stories, analogies, tangential hints—trying to steer the Defender into uttering the forbidden word. The Defender, meanwhile, must figure out the word from context but avoid stating it thoughtlessly. This interplay demands semantic flexibility and high-level reasoning.

2.3 Modeling the Game as a Markov Game

In formalizing Adversarial Taboo, the paper adopts a Markov game (also known as a stochastic game) perspective, which generalizes single-agent Markov decision processes (MDPs) to multi-agent settings. A Markov game is defined by a tuple (S,A,F,r)(S, A, F, r):

  1. State Space (SS):
    Each state contains the conversation history and, for the Attacker’s internal state, the hidden target word ww. Specifically, the game’s states alternate between:
  1. Action Space (AA):
    The set of all possible natural language outputs. Formally, an action aAa \in A might be any token sequence from a vocabulary VtokenV_{\text{token}}. This is massive compared to board game actions, hence the challenge in language-based RL is significantly higher.

  2. Transition Function (FF):
    Since the conversation is essentially deterministic, FF appends the chosen utterance to the dialogue state. After the Attacker’s turn, the game transitions to an intermediate state where the Defender moves next.

  3. Reward Function (rr):
    A zero-sum structure is imposed:

In a zero-sum setting, the Attacker aims to maximize R(τ)R(\tau), the sum of rewards over the trajectory τ\tau, whereas the Defender strives to minimize R(τ)R(\tau) (equivalently, maximize R(τ)-R(\tau)).

Trajectory Probability

A single episode of the game, or trajectory, can be denoted τ=(s0,a1,s1,a2,s2,)\tau = (s_0, a_1, s_1, a_2, s_2, \ldots). The probability of τ\tau under policies μ\mu (Attacker) and ν\nu (Defender) is:

P(τ)  =  P(s0)t=1Tμ(atattackst1)  ν(atdefendst), P(\tau) \;=\; P(s_0) \prod_{t=1}^{T} \mu(a_t^{\text{attack}} \mid s_{t-1}) \;\nu(a_t^{\text{defend}} \mid s'_t) \,,

where sts'_t is the intermediate state after the Attacker’s move but before the Defender’s, and vice versa. Training the model then involves finding μ\mu and ν\nu that optimize the expected returns according to the reward function rr.

2.4 The Dual Roles: Attacker vs. Defender

The notion of two specialized policies is pivotal:

  1. Attacker’s Policy (μ\mu):
  1. Defender’s Policy (ν\nu):

Because the same LLM is used to implement both policies, the Attacker and Defender effectively share core language knowledge. However, at each step, they use different prompts or role instructions to shape their objectives (one to induce, the other to evade and infer). This design is crucial: it ensures that improvements in one role push the other role to become more robust, fueling a cycle of mutual adaptation.

Advantages of a Zero-Sum Setting

Tie vs. Forced Conclusion

In principle, the game can end in a tie if neither player wins within T0T_0 turns. Ties, though not a typical outcome in standard Taboo, motivate each side to act sooner:

Thus, even a tie outcome can offer learning signals (no side overcame the other), and the training process uses these episodes to adjust policies accordingly.

Summary of Section 2

  1. Adversarial Language Games were introduced as a powerful conceptual framework for driving emergent behavior, especially under conditions of partial or hidden information.
  2. The Adversarial Taboo game establishes a hidden-word scenario that requires nuanced strategic reasoning, building upon the general idea of Taboo but making it zero-sum.
  3. Markov Game Modeling formalizes the game’s states, actions, transitions, and rewards, yielding a structure amenable to reinforcement learning.
  4. The two-role system (Attacker vs. Defender) underscores the central tension: each agent’s best move depends on the other’s future strategy. This interplay encourages stronger general reasoning abilities in an LLM when self-play is applied.

Having introduced the game framework, Section 3 will detail how the LLM is trained for this two-player environment. It covers imitation learning from GPT-4, followed by self-play to refine policies via reinforcement learning.