Learning Strategies and Training

A closer look at imitation learning and self-play reinforcement in SPAG

(arXiv:2404.10642v3 [cs.CL])

Part 3 of a series on the SPAG paper.

In this section, we cover:

  • Adapting LLMs to game rules via imitation learning
  • Transitioning from GPT-4 demonstrations to self-play
  • Offline RL approaches and reward shaping
  • Balancing specialized training with broader language abilities

3. Learning Strategies & Training

Having established the adversarial game in Section 2, the paper now describes how the LLM is trained to act as both Attacker and Defender, and how it ultimately refines its reasoning capabilities. The overall process occurs in two main stages:

  1. Imitation Learning – where the LLM learns from GPT-4’s demonstrations of the game, ensuring it follows the rules and develops basic proficiency in both roles.
  2. Self-Play Reinforcement Learning – where the model competes against itself (a copy of its own policy), generating game episodes offline and updating its parameters to maximize (or minimize) the adversarial reward.

Beyond these broad stages, the authors detail policy optimization methods, offline RL considerations, and additional stability techniques that prevent the model from catastrophically forgetting general language capabilities. This section provides a comprehensive explanation of each component.

3.1 Imitation Learning with GPT-4

Motivations for an Initial Imitation Phase

While LLMs are highly capable language generators, they do not a priori know the specific rules of Adversarial Taboo nor how to conform to them. If the model is simply thrown into self-play from scratch, it might violate the game instructions (e.g., the Attacker might explicitly reveal the target word, or the Defender might guess without any evidence). Therefore, the paper introduces an imitation learning (IL) phase:

Data Collection & Setup

The authors enlist GPT-4 to play the Taboo game against itself, using system prompts that specify who is the Attacker and who is the Defender. For each target word ww, GPT-4 generates a full multi-turn episode. This yields a dataset Tim\mathcal{T}_{\text{im}} of (state, action) pairs:

  1. Attacker Episodes: If GPT-4’s Attacker eventually wins, that game’s action sequences are assigned to an “attacker-winning” subset, Timattack\mathcal{T}_{\text{im}}^{\text{attack}}.
  2. Defender Episodes: Conversely, if the GPT-4 Defender wins, the sequences go into Timdefend\mathcal{T}_{\text{im}}^{\text{defend}}.
  3. Ties or Invalid Rounds: Typically discarded or handled separately, as they provide less straightforward examples of “winning moves.”

Because GPT-4 is a powerful model, the authors trust that these demonstrations illustrate reasonable strategies without systematically exploiting obvious shortcuts or ignoring instructions.

Imitation Loss

For the LLM parameterized by θ\theta, the IL objective is to match the probability distribution of winning demonstrations from GPT-4. Let πref\pi_{\text{ref}} be the base pretrained model (e.g., LLaMA-2, Baichuan-2) before any fine-tuning. Then, for the attacker-winning trajectories, the paper defines:

Limattack(πθ)  =  EτTimattack[1Tt=1Tlogπθ ⁣(utfattack ⁣(st1))  +  β1KL[πθπref]], L_{\text{im}}^{\text{attack}}(\pi_\theta) \;=\; - \mathbb{E}_{\tau \in \mathcal{T}_{\text{im}}^{\text{attack}}} \Bigg[ \frac{1}{T} \sum_{t=1}^{T} \log \pi_\theta\!\bigl(u_t \,\big|\, f_{\text{attack}}\!(s_{t-1})\bigr) \;+\; \beta_1 \,\text{KL}\bigl[\pi_\theta \,\big\|\, \pi_{\text{ref}}\bigr] \Bigg],

where utu_t is the attacker’s chosen utterance at turn tt, st1s_{t-1} is the prior state, and fattack()f_{\text{attack}}(\cdot) is the prompt template that instructs the model to respond as Attacker. The first term enforces maximum likelihood matching of GPT-4’s moves, while the second term is a KL regularizer with coefficient β1>0\beta_1 > 0, discouraging the model from drifting too far from its pretrained linguistic knowledge.

Similarly, for defender-winning trajectories:

Limdefend(πθ)  =  EτTimdefend[1Tt=1Tlogπθ ⁣(vtfdefend ⁣(st))  +  β1KL[πθπref]]. L_{\text{im}}^{\text{defend}}(\pi_\theta) \;=\; - \mathbb{E}_{\tau \in \mathcal{T}_{\text{im}}^{\text{defend}}} \Bigg[ \frac{1}{T} \sum_{t=1}^{T} \log \pi_\theta\!\bigl(v_t \,\big|\, f_{\text{defend}}\!(s'_t)\bigr) \;+\; \beta_1 \,\text{KL}\bigl[\pi_\theta \,\big\|\, \pi_{\text{ref}}\bigr] \Bigg].

The total IL loss is typically an average:

Lim(πθ)  =  12Limattack(πθ)  +  12Limdefend(πθ). L_{\text{im}}(\pi_\theta) \;=\;\frac{1}{2}\, L_{\text{im}}^{\text{attack}}(\pi_\theta) \;+\;\frac{1}{2}\, L_{\text{im}}^{\text{defend}}(\pi_\theta).

In practice, the model is trained on attacker-winning and defender-winning episodes separately, ensuring it sees both perspectives and inherits GPT-4’s well-structured gameplay.

Outcome of Imitation Learning

By the end of this phase, the LLM can:

It is not yet fully optimized—GPT-4’s moves are strong but not necessarily exhaustive of all strategies. This is why the paper proceeds to a self-play regime next.

3.2 Transitioning to Self-Play

Motivation for Self-Play

Post-imitation, the model has learned “good enough” gameplay but might still exhibit suboptimal or predictable moves. Human data or GPT-4 data alone can be expensive to scale. Self-play offers a powerful alternative: the model generates new episodes on its own, effectively creating an unlimited supply of training data as it iteratively improves.

Moreover, the adversarial nature of Taboo means any deficiency in the Attacker’s cunning or the Defender’s inference will be exploited by the other role. This mutual feedback loop is what drives more sophisticated behavior—analogous to how, in board games, a strategy that once sufficed is later recognized and countered by an improving opponent.

Practical Mechanism

The paper details how a copy of the model, πθ\pi_{\theta'}, is made at each iteration. One copy is assigned the Attacker role, another copy the Defender role. A large set of target words {wi}\{w_i\} is sampled. For each wiw_i, the Attacker–Defender pair engages in a multi-turn game until it reaches a conclusion: attacker win, defender win, or tie.

This generates a dataset of self-play trajectories, denoted Tθ\mathcal{T}_{\theta'}. The next step is to apply reinforcement learning updates to πθ\pi_\theta using Tθ\mathcal{T}_{\theta'}. Crucially, the entire process can iterate multiple times, producing increasingly difficult or clever dialogues.

3.3 Reinforcement Learning from Self-Play

Zero-Sum Objective

Recall from Section 2.3 that we have a zero-sum reward structure:

For a trajectory τ\tau, let μθ\mu_\theta denote the Attacker policy and νθ\nu_\theta the Defender policy, both instantiated by the same LLM with different prompts. The training goal is to maximize the Attacker’s expected return if we are focusing on μθ\mu_\theta, and minimize that same return if we are focusing on νθ\nu_\theta. Symbolically,

maxμminν  Eτ(μ×ν)[R(τ)]. \max_{\mu} \min_{\nu} \; \mathbb{E}_{\tau \sim (\mu\times\nu)} \bigl[R(\tau)\bigr].

When using a single model πθ\pi_\theta for both policies, the authors effectively do separate RL updates for the Attacker component and the Defender component on their respective winning trajectories.

Offline RL and the Importance of Stability

A naive approach might attempt online RL, sampling a trajectory, updating the model, then sampling more episodes from the updated policy. However, online RL is computationally expensive for LLMs, since each iteration requires tens of thousands of dialogues. Hence, offline RL is used: a batch of self-play episodes is collected once from πθ\pi_{\theta'}; then πθ\pi_\theta is improved offline on that fixed batch.

Key Stability Concern: LLMs are prone to catastrophic forgetting or degenerate optimization if large policy updates are made from static data. The authors incorporate:

  1. KL Regularization w.r.t. πθ\pi_{\theta'} to constrain the updated policy from straying too far in a single step.
  2. SFT Data Mixing to preserve general language capabilities, blending in supervised instruction data with the RL objectives.

3.4 Policy Optimization and Loss Functions

Advantage-Based Offline Updates

Let Tθ\mathcal{T}_{\theta'} be the self-play dataset collected from the “frozen” policy πθ\pi_{\theta'}. For each trajectory τ\tau, define an advantage function Aπθ(s,a)A^{\pi_{\theta'}}(s, a) that measures the relative value of action aa in state ss compared to a baseline. In practice, the authors approximate:

Aπθ(s,a)    r(s,a)+γVπθ(s)    Vπθ(s), A^{\pi_{\theta'}}(s, a) \;\approx\; r(s, a) \,+\, \gamma \, V^{\pi_{\theta'}}\bigl( s' \bigr) \;-\; V^{\pi_{\theta'}}(s),

where γ\gamma is a discount factor, r(s,a)r(s,a) is the immediate reward, and VπθV^{\pi_{\theta'}} is the value function. They then use importance sampling to update πθ\pi_\theta:

Δθ        EτTθ[t=1Tπθ(atst1)πθ(atst1)  Aπθ(st1,at)  θlogπθ(atst1)]. \Delta \theta \;\;\propto\;\; \mathbb{E}_{\tau \in \mathcal{T}_{\theta'}} \Biggl[ \sum_{t=1}^T \frac{\pi_\theta(a_t \mid s_{t-1})}{\pi_{\theta'}(a_t \mid s_{t-1})} \;A^{\pi_{\theta'}}(s_{t-1}, a_t) \;\nabla_\theta \log\,\pi_\theta(a_t \mid s_{t-1}) \Biggr].

ReST (Reinforced Self-Training) Approach

A notable simplification is the adoption of a threshold-based approach akin to ReST, where only the winning episodes for each role are used for that role’s policy update. Specifically:

Such “self-imitation” techniques (also explored in prior RL literature) help the policy reinforce only successful trajectories, stabilizing training. Episodes with negative returns for a given role effectively do not feed into the gradient for that role.

KL Regularization & Multi-Objective Loss

To prevent the updated policy from diverging from πθ\pi_{\theta'} or forgetting crucial language skills, the final self-play training objective is typically expressed as:

LSPAG(πθ)  =  12EτTθattack-win[tπθ(utfattack(st1))πθ(utfattack(st1))  Aμθ(st1,ut)    β2KL[πθπθ]] L_{\text{SPAG}}(\pi_\theta) \;=\; -\frac12\, \mathbb{E}_{\tau \in \mathcal{T}_{\theta'}^{\text{attack-win}}} \Bigl[ \sum_{t} \frac{\pi_\theta(u_t\mid f_{\text{attack}}(s_{t-1}))} {\pi_{\theta'}(u_t\mid f_{\text{attack}}(s_{t-1}))} \;A^{\mu_{\theta'}}(s_{t-1}, u_t) \;-\; \beta_2 \,\text{KL}\bigl[\pi_\theta \|\pi_{\theta'}\bigr] \Bigr]

12EτTθdefend-win[tπθ(vtfdefend(st))πθ(vtfdefend(st))  Aνθ(st,vt)    β2KL[πθπθ]]    αE(x,y)DSFT[logπθ(yx)]. -\frac12\, \mathbb{E}_{\tau \in \mathcal{T}_{\theta'}^{\text{defend-win}}} \Bigl[ \sum_{t} \frac{\pi_\theta(v_t\mid f_{\text{defend}}(s'_{t}))} {\pi_{\theta'}(v_t\mid f_{\text{defend}}(s'_{t}))} \;A^{\nu_{\theta'}}(s'_{t}, v_t) \;-\; \beta_2 \,\text{KL}\bigl[\pi_\theta \|\pi_{\theta'}\bigr] \Bigr] \;-\; \alpha\, \mathbb{E}_{(x,y)\sim D_{\text{SFT}}} \bigl[\log\,\pi_\theta(y\mid x)\bigr].

Here:

The authors note that this mixture of RL on winning episodes plus supervised data offers the best of both worlds: it capitalizes on emergent strategies from self-play while anchoring the model to broader language skills.

3.5 Offline RL and Episode Collection

Sampling Process (Algorithmic View)

The paper provides Algorithm 1 (for data collection) and Algorithm 2 (for iterative self-play epochs). In essence:

  1. Copy Model: Create πθ\pi_{\theta'} as a snapshot of the current policy.
  2. Generate Episodes:

Convergence & Practical Considerations

Figure 3, showing ablation studies on hyperparameters such as episode size, KL coefficients, and SFT mixing ratios, indicating sample efficiency and performance trends in reasoning tasks. This figure validates the effectiveness of the offline RL approach.

Summary of Section 3

  1. Imitation Learning: GPT-4 data is used to warm-start the LLM, ensuring adherence to game rules and basic strategic proficiency.
  2. Transition to Self-Play: The model then generates its own adversarial dialogues, greatly expanding the variety of training examples.
  3. Reinforcement Learning: A zero-sum reward structure, offline dataset of self-play episodes, and advantage-based policy updates converge to more refined gameplay.
  4. Stability Mechanisms: KL regularization and a persistent SFT term safeguard against overfitting on the game at the expense of overall language quality.

Section 4 will turn to experimental results, showing how these learning strategies boost both in-game performance (win rates, compliance with Taboo rules) and general reasoning as measured on external NLP benchmarks.