Experimental Evaluation

Performance gains and in-game analyses for SPAG

(arXiv:2404.10642v3 [cs.CL])

Part 4 of a series on the SPAG paper.

In this section, we cover:

  • Setup and metrics for evaluating LLM reasoning
  • Win rates and gameplay outcomes against various baselines
  • Gains on reasoning benchmarks beyond the game
  • Emergent trends in dialogue length and strategy

This section presents the empirical results of training large language models (LLMs) through Self-Play of Adversarial Taboo (SPAG). The paperā€™s goal is twofold: (1) to demonstrate that self-play in a language game can enhance an LLMā€™s in-game performanceā€”reflected by higher win rates and better adherence to rulesā€”and (2) to show that the same training regime yields broader improvements in reasoning tasks outside the game. The authors evaluate on a suite of standard NLP benchmarks and compare their method against both supervised fine-tuning (SFT) baselines and alternative multi-agent approaches.

4.1 Experimental Setup

Model Backbones

The experiments use two open-source models: 1. LLaMA-2-7B
2. Baichuan-2-13B

These serve as the base pretrained checkpoints (Ļ€ref\pi_{\text{ref}}) before any Taboo-related training. Both models have demonstrated solid capabilities in general language tasks but are not inherently specialized for multi-turn adversarial reasoning.

Data Sources

  1. Imitation Data (GPT-4 Episodes):
    The authors collect a substantial set of game episodes where GPT-4 plays both Attacker and Defender. Each target word ww in a subset of the vocabulary prompts a new self-play round, ensuring coverage of diverse semantic fields. These dialogues provide high-quality demonstrations for the initial Imitation Learning phase.

  2. Self-Play Episodes:
    After the LLM has completed imitation learning, it plays Taboo against itself to generate new trajectories. The full target vocabulary VtargetV_{\text{target}} can reach 50,000 words (excluding stop words). In each iteration (or epoch) of self-play, thousands of dialogues are gathered for reinforcement learning (RL) updates.

  3. Supervised Fine-Tuning (SFT) Data:
    To avoid catastrophic forgetting of general language skills, the authors periodically blend in standard instruction-following data (e.g., Alpaca). This ensures the model remains well-rounded, rather than overfitting solely to the adversarial game.

Training Stages

  1. Imitation Learning (IL):
  1. SPAG Self-Play (Offline RL):

Evaluation Metrics

  1. Reasoning Benchmarks:
  1. Game Win Rates:
  1. Dialogue Statistics:

Figure 1, displaying a radial (or polygonal) chart of reasoning improvements across multiple benchmarks (e.g., MMLU, ARC, BBH). Each axis is normalized to highlight the continuous gains with self-play epochs.

4.2 Evaluation on Reasoning Benchmarks

One of the paperā€™s most striking claims is that training on Adversarial Taboo benefits general LLM reasoning, not merely in-game skill. To test this, the authors evaluate each checkpoint on standard benchmarks:

  1. Initial (Base) Performance:
    LLaMA-2 and Baichuan-2, before any Taboo-related fine-tuning, achieve baseline scores on tasks like MMLU, BBH, ARC, etc. These results serve as reference points to quantify how much Taboo training shifts the modelā€™s zero/few-shot performance.

  2. Imitation-Learned (IL) Models:
    After ingesting GPT-4ā€™s gameplay, both LLaMA-2 and Baichuan-2 show uniform improvements in reasoning accuracy. This occurs because the IL data itself covers a range of semantic fields and enforces strategic communication, effectively teaching certain patterns of multi-turn reasoning.

  3. Self-Play Reinforcement (SPAG):

  1. Comparison to Other Methods:

Quantitatively, the paper reports 2ā€“4% absolute improvements on tasks like ARC Challenge and BBH from IL to SPAG. While the percentage might seem modest, these tasks are notoriously difficult, and incremental gains often indicate significantly improved reasoning patterns.

Uniform Gains vs. Specific Tasks

A key observation is that the SPAG approach tends to boost performance across a broad spectrum of benchmarks. Tasks that emphasize logical consistency, inference, and multi-step reasoning see the most benefit. On the other hand, tasks demanding pure factual knowledge (like certain MMLU subsets) may not see as large a jump, since those rely more on memorized facts than on strategic or adversarial thinking.

Continuous Improvement Over Epochs

The authors conduct three self-play epochs:

Figure 1 (revisited)
In the paper, the radial chart shows how, by SPAG-3, the LLM approaches or surpasses best-known baselines on tasks like ARC Challenge and WinoGrande.

4.4 Comparisons with Baselines

To demonstrate that adversarial self-play is crucial, the authors consider multiple baselines:

  1. Alpaca SFT Only
  1. Chain-of-Thought (CoT)
  1. Non-Adversarial Games

Overall, SPAG emerges as uniquely powerful due to the zero-sum tension and the requirement for each role to out-reason the other. By forcing strategic trade-offs (e.g., ā€œCan I hint at this concept without giving the entire secret away?ā€), it triggers deeper cognitive patterns than simpler or cooperative tasks.

4.5 Ablation Studies & Hyperparameter Analysis

Data Size & Sample Efficiency

The authors vary:

They find that the largest gains from imitation arrive at roughly 5kā€“10k demonstrations, after which performance saturates. Conversely, repeated self-play continues to provide diminishing but notable increments, illustrating that each new batch of adversarial dialogues can reveal fresh strategies.

KL Coefficients & Reward Thresholds

As mentioned in Section 3, the combination of:

leads to more stable training. Aggressive KL constraints can stifle improvement, while too lenient constraints risk overfitting the language model to Taboo strategies, undermining broader language performance.

Mixing SFT Data

A balancing term Ī±\alpha ensures that each RL update also includes some supervised fine-tuning data. The paper shows that setting Ī±ā‰ˆ0.5\alpha \approx 0.5 is effective in preserving overall fluency and preventing the model from becoming too ā€œTaboo-centric.ā€

Figure 3: Ablation overview, displaying plots of geometric mean scores vs. KL coefficient, vs. dataset size, and so forth.

4.6 Game-Play Dynamics

Win Rates Against GPT-4

One direct measure of success is how often the trained Attacker can fool GPT-4, or how often the trained Defender can resist GPT-4ā€™s Attacker. The authors run the trained model as Attacker/Defender across a curated test set of target words (e.g., 168 objects or typical nouns). Over successive self-play epochs, both roles become increasingly effective, evidenced by:

Figure 4: Depicts the modelā€™s win rate (Attacker or Defender) against GPT-4 across successive epochs. The Attacker line typically starts lower (before imitation learning) then rises significantly, while the Defender line improves in tandem.

Dialogue Length & Interaction Patterns

The paper notes a fascinating emergent behavior: with repeated adversarial training, dialogues become succinct. Both Attacker and Defender learn that prolonging conversation can be riskyā€”the more they speak, the higher the chance of accidentally losing. Over the course of training:

Figure 6: Illustrates how the average number of turns and average utterance length decrease after each epoch of self-play, indicating greater efficiency or wariness in the conversational exchange.

Qualitative Examples

Beyond raw statistics, the paper includes several short sample dialogues (in tables or text form) to show how the Attacker tries increasingly subtle strategies, like:

Similarly, advanced Defenders adopt evasive language and do not ā€œfill inā€ implied words unless sure they can guess the actual target in a single shot.


Summary of Section 4

  1. Experimental Setup: The authors combine GPT-4ā€“based imitation data, a large target vocabulary, and iterative self-play with offline RL on LLaMA-2-7B and Baichuan-2-13B.
  2. Reasoning Benchmark Gains: Through multiple evaluations (MMLU, BBH, ARC, etc.), SPAG yields consistent improvements over baseline SFT and other multi-agent setups.
  3. Comparisons & Ablations: Chain-of-thought and pure SFT help to a degree, but do not match the broad or stable gains from adversarial self-play.
  4. Game-Play Analysis: Measured by win rates against GPT-4 and changes in dialogue dynamics, the self-play leads to more cunning Attacker moves and more defensive, inference-oriented Defender turns.
  5. Efficiency & Stability: Careful hyperparameter tuning (KL coefficients, mixing SFT data) is crucial. Without it, the model might overfit or lose general language skill.

Collectively, these findings reinforce the paperā€™s central claim: adversarial self-play in a carefully designed language game can effectively enhance both task-specific performance (the Taboo game) and general LLM reasoning. The final section (Section 5) addresses the broader discussion of limitations, ethical considerations, and future directions for scaling these insights.