The AI Agent

A reinforcement learning agent that discovered a strategy no human designed

The Question

The tournament analysis identified 30 hand-crafted strategies and tested them exhaustively. But every one of those strategies was designed by a human with assumptions about how the game should be played. What if we removed the human from the loop entirely?

We trained a reinforcement learning agent to play darts cricket from scratch — no strategy rules, no opening book, no notion of “scoring mode” or “covering mode.” The agent sees only the board state and learns from wins and losses. After millions of games against the strongest hand-crafted bots, it discovered something none of them do: adapt its opening based on turn order.

What the Agent Learned

Turn-Order-Dependent Opening

Going first → always open 20 (100% consistency).
Going second → always open 18 (93% consistency).

No Frongello strategy, and none of the experimental strategies (E1–E12), adapts to turn order. All 28 hand-crafted bots play identically regardless of who throws first. The agent discovered this distinction entirely on its own.

The logic is intuitive once you see it. Going first, you have the tempo advantage — secure the highest-value target (20) immediately. Going second, your opponent likely started on 20, so pivot to 18 and avoid direct competition on the same number. After the opening, the agent follows a consistent closing order: 19 next, then fill the remaining targets.

Adaptive Mid-Game Behavior

Beyond the opening, the agent dynamically balances scoring and closing based on game state:

The agent also shows opponent-specific adaptation. Against S2 (the strongest Frongello strategy), it flips to 57% closing when behind — respecting S2’s sophisticated covering logic by racing to close rather than trying to outscore it.

Chasing: Almost Never

When the opponent threatens a target (2 marks toward closing), the agent chases only 31% of the time. It mostly ignores the opponent’s progress and follows its own plan — independently confirming Frongello’s “never chase” finding through pure trial and error.

The Journey: Three Approaches

Getting here required three fundamentally different approaches and 19 iterations. Each failure taught something essential about what it takes to learn strategy in a complex game.

Attempt 1: Tabular Q-Learning

The simplest approach: maintain a lookup table mapping every game state to the value of each action. After each game, update the table using the Bellman equation. This works beautifully for small games, but darts cricket’s state space — two scores plus 14 mark counts — creates millions of unique states. The table grows unbounded, most entries are visited only once, and the agent plateaus around 45% win rate against basic bots.

Attempt 2: Deep Q-Network (DQN)

Replace the table with a neural network that generalizes across similar states. A standard fully-connected network (128×128) with experience replay and a frozen target network. This should scale better — but it struggled for a subtle reason.

DQN treats the 21 possible actions (7 targets × 3 hit types) as independent, unrelated choices. The network has no structural understanding that “aim at 20” and “aim at 19” are similar decisions, or that “single” and “triple” differ only in magnitude. Every relationship must be learned from scratch, and gradient updates for one action interfere with others. DQN never broke past 45% win rate.

Attempt 3: Branching Actor-Critic (A2C)

The breakthrough architecture. Instead of one flat output, the network splits into three specialized heads:

Input: 19 features (score differential, 14 marks, 4 aggregate counts) ↓ Shared Trunk: 128 → ReLU → 128 → ReLU ↓ ┌─────────────────────────────────────┐ │ │ Target Head HitType Head Value Head 64 → ReLU → 7 64 → ReLU → 3 64 → ReLU → 1 (which number?) (single/dbl/tri?) (how good is this state?)

Each head learns its own domain. The target head figures out which number to aim at based on board state; the hit-type head learns how aggressively to aim based on skill level. The value head provides the temporal-difference signal that drives learning. Two critical mechanisms keep the agent honest:

Training: 12 Bugs and 3 Breakthroughs

The architecture was only the beginning. Making it actually learn took 19 versions and the discovery and resolution of 12 distinct training bugs. Three breakthroughs shaped the final result.

Breakthrough 1: Supervised Pre-Training

A randomly initialized network explores blindly — throwing darts at every number equally, learning nothing for thousands of games. The solution: bootstrap the network by watching expert strategy bots play. We generated training data from four Frongello strategies (S1, S2, S6, S10), recording every dart decision, and trained the network via supervised learning with label smoothing (0.1) to prevent the output probabilities from collapsing to near-certainty.

After pre-training, the network opens with 20-triple ~91% of the time — sensible but not locked in. It has a starting intuition but remains open to discovering something better.

Breakthrough 2: Prune the Easy Opponents

This was the single most important discovery. Version 16 trained against all 28 bots (S1–S17 plus E1–E12). The agent won 48.5% overall — but against the 13 hardest bots, it won only 43%, worse than the simple sequential closer (S1 at 49.1%).

The problem: easy opponents like E6 (99.7% win rate) and E7 (86.7%) produced a “keep doing what you’re doing” gradient signal. Hard opponents produced a “you need to improve” signal. These contradicted each other, and the easy wins drowned out the hard losses.

The fix was simple: remove every bot the agent could beat more than 45% of the time. With only 13 hard bots remaining, every game produced useful learning signal. Win rate climbed from 5% to 55% over 2 million games, and the novel 20/18 opening split emerged.

Breakthrough 3: The Entropy Floor

Version 17 peaked at 55% win rate after ~4.4 million games — then started declining. The agent was overtraining: its entropy coefficient had annealed to 0.005, making the policy nearly deterministic. It could no longer adapt to slight variations in opponent behavior.

Raising the entropy floor from 0.005 to 0.01 solved the problem completely. Version 19, trained with the higher floor, held stable at 49.3% win rate with zero decline over 2 million additional games. The lower absolute number reflects a tighter bot pool (the two easiest remaining bots were also removed), not a weaker agent.

The 12 Bugs

Beyond the three breakthroughs, the agent’s development required solving 12 distinct technical problems. Each one caused either training collapse, reward hacking, or silently invalid results.

1. Bull Reward Bias

Agent throws bull 100% of the time. Closing bonus × face value (25) makes bull the highest-reward target. Fix: terminal-only rewards.

2. Peaked Pre-training

After supervised pre-training, softmax outputs collapse to 99.99% on one action. Entropy bonus can’t recover. Fix: label smoothing (0.1), fewer epochs.

3. Infinite Games

Degenerate policy never closes all 7 targets. Trajectory buffer grows to 4.4GB. Fix: 100-turn maximum per game.

4. Masking Mismatch

Bull-triple masking applied during action selection but not during gradient computation. Distribution mismatch corrupts training. Fix: consistent masking in both paths.

5. Self-Play Collapse

Both players updating simultaneously chase each other into degenerate strategies. Fix: freeze Player 2, train only Player 1.

6. Reward Hacking

Intermediate rewards for closing targets cause the agent to loop on 20-triple forever. Fix: remove all intermediate rewards entirely.

7. Entropy Annealing Bug

Constructor passes custom initial entropy, but annealing function hardcodes the default. Fix: store per-instance initial value.

8. Dead-Target Blindness

Without masking, the network can’t learn to avoid closed targets from terminal rewards alone — the signal is too diluted across 60 darts. Fix: dead-target action masking.

9. Double-Counted Rewards

Win reward applied during gameplay AND at game end = +200 instead of +100. Fix: compute_reward() always returns 0; terminal rewards applied exclusively at game end.

10. Deterministic Games

With miss_enabled=False, every triple lands. First player always wins by one turn. All v4–v8 results were invalidated. Fix: enable realistic miss probabilities.

11. Easy Bot Pollution

Easy opponents drown out hard-opponent learning signals. Agent converges to S1 behavior (43% vs hard bots). Fix: prune to hard-only bot pool.

12. Overtraining

Entropy floor too low (0.005). Policy becomes deterministic, loses adaptability. Win rate declines 3% past 4.4M games. Fix: entropy floor 0.01.

Version History

v1–v3 · Tabular Q-Learning

Bellman updates on a state-action lookup table. Plateaus at ~45% WR against basic bots. State space too large for tabular methods.

v4–v8 · Deep Q-Network

Double DQN with experience replay. Fails at ~45% WR. Flat action space prevents structured learning. Results later invalidated — deterministic game mechanics (Bug 10) made first-mover always win.

v9–v15 · A2C Development

Branching architecture implemented. Bugs 1–10 discovered and fixed iteratively. Supervised pre-training pipeline built. Agent begins learning meaningful strategies but can’t break past 50%.

v16 · First Full Run (All 28 Bots)

First end-to-end A2C training. 48.5% overall WR, but only 43% against hard bots — worse than the trivial S1 strategy. Easy opponents poison the learning signal.

v17 · Hard Bots Only (THE BREAKTHROUGH)

Pruned to 13 hard bots. WR climbs 5% → 55% over 2M games. Novel 20/18 turn-order-dependent opening discovered. But overtraining begins after 4.4M games (entropy floor too low).

v18 · Progressive Pruning

Removed S4 (62% WR) and S8 (66% WR) as the agent began farming them. 11 remaining bots all in the 47–57% range. WR stabilizes at ~50% with tighter competition.

v19 · Entropy Floor (STABILITY)

Entropy floor raised to 0.01. 49.3% WR with zero decline over 2M games. Agent maintains adaptability indefinitely. Final stable model.

Final Results

The v19 agent plays at pro skill level (MPR ~5.3) against the 11 hardest hand-crafted bots. Its 49.3% average win rate means it holds its own against a curated field where every opponent is a strong strategy — not a single easy win in the pool.

Opponent Win Rate Notes
E11 (Kitchen Sink) 53.2% Best matchup — complexity works against E11
S6 (Lead + Extra Darts) 52.1% Extra darts disrupt S6’s closing tempo
S16 (Chase + Extra + High Thresh.) 50.3%
S14 (Chase + Extra + Low Thresh.) 50.2%
E9 (Phase Shift — 3 phases) 49.2%
S10 (Chase + Low Thresh.) 49.0%
E5 (Smart Aim) 48.3%
E1 (Early Bull) 48.3%
E3 (Greedy Close-and-Score) 44.9%
E10 (Score Surge) 44.8%
S2 (Lead Then Cover) 42.9% Toughest opponent — S2’s sophisticated covering logic

All games played at pro skill level with realistic miss probabilities. Win rates measured over 2 million games with entropy floor 0.01.

What It Means

Independent Confirmation of Frongello

The agent rediscovered several Frongello principles without being told them. It learned to score before covering (76% closing early, shifting to 69–72% scoring once foundations are built). It learned not to chase (31% chase rate when threatened). These principles emerged from pure win/loss feedback — strong independent confirmation that they reflect genuine strategic truth, not just human intuition.

Something Genuinely New

The turn-order-dependent opening is not a refinement of existing strategy. It’s a wholly new dimension of play that no hand-crafted strategy explores. All 28 bots in the tournament play identically regardless of who goes first. The agent discovered that this context matters — and adapted accordingly.

The Limits of Hand-Crafted Rules

Hand-crafted strategies parameterize three dimensions: threshold, extra darts, and chase. The agent found value in a fourth dimension — turn order — that wasn’t part of the original parameter space at all. This suggests there may be other unconsidered dimensions (opponent modeling, dart-within-turn sequencing, adaptive threshold adjustment) that further human-designed strategies haven’t explored.

Training Insight: Curriculum Matters More Than Architecture

The single most impactful change across all 19 versions wasn’t the network architecture, the reward function, or the learning rate. It was removing easy opponents from the training pool. When every game produces useful signal, the agent learns. When easy wins dilute hard losses, it doesn’t. This principle — that opponent curriculum is the dominant training variable — likely generalizes to other game-playing RL systems.