The Question
The tournament analysis identified 30 hand-crafted strategies and tested them exhaustively. But every one of those strategies was designed by a human with assumptions about how the game should be played. What if we removed the human from the loop entirely?
We trained a reinforcement learning agent to play darts cricket from scratch — no strategy rules, no opening book, no notion of “scoring mode” or “covering mode.” The agent sees only the board state and learns from wins and losses. After millions of games against the strongest hand-crafted bots, it discovered something none of them do: adapt its opening based on turn order.
What the Agent Learned
Turn-Order-Dependent Opening
Going first → always open 20 (100% consistency).
Going second → always open 18 (93% consistency).
No Frongello strategy, and none of the experimental strategies (E1–E12), adapts to turn order. All 28 hand-crafted bots play identically regardless of who throws first. The agent discovered this distinction entirely on its own.
The logic is intuitive once you see it. Going first, you have the tempo advantage — secure the highest-value target (20) immediately. Going second, your opponent likely started on 20, so pivot to 18 and avoid direct competition on the same number. After the opening, the agent follows a consistent closing order: 19 next, then fill the remaining targets.
Adaptive Mid-Game Behavior
Beyond the opening, the agent dynamically balances scoring and closing based on game state:
- Early game (1–2 targets closed): 63–76% closing — build the foundation first.
- Mid game (3+ targets closed): 69–72% scoring — capitalize on open scoring lanes.
- Losing by a large margin: 69% scoring — catch up on points before closing out.
- Winning by a large margin: 51% closing — shut the game down.
The agent also shows opponent-specific adaptation. Against S2 (the strongest Frongello strategy), it flips to 57% closing when behind — respecting S2’s sophisticated covering logic by racing to close rather than trying to outscore it.
Chasing: Almost Never
When the opponent threatens a target (2 marks toward closing), the agent chases only 31% of the time. It mostly ignores the opponent’s progress and follows its own plan — independently confirming Frongello’s “never chase” finding through pure trial and error.
The Journey: Three Approaches
Getting here required three fundamentally different approaches and 19 iterations. Each failure taught something essential about what it takes to learn strategy in a complex game.
Attempt 1: Tabular Q-Learning
The simplest approach: maintain a lookup table mapping every game state to the value of each action. After each game, update the table using the Bellman equation. This works beautifully for small games, but darts cricket’s state space — two scores plus 14 mark counts — creates millions of unique states. The table grows unbounded, most entries are visited only once, and the agent plateaus around 45% win rate against basic bots.
Attempt 2: Deep Q-Network (DQN)
Replace the table with a neural network that generalizes across similar states. A standard fully-connected network (128×128) with experience replay and a frozen target network. This should scale better — but it struggled for a subtle reason.
DQN treats the 21 possible actions (7 targets × 3 hit types) as independent, unrelated choices. The network has no structural understanding that “aim at 20” and “aim at 19” are similar decisions, or that “single” and “triple” differ only in magnitude. Every relationship must be learned from scratch, and gradient updates for one action interfere with others. DQN never broke past 45% win rate.
Attempt 3: Branching Actor-Critic (A2C)
The breakthrough architecture. Instead of one flat output, the network splits into three specialized heads:
Each head learns its own domain. The target head figures out which number to aim at based on board state; the hit-type head learns how aggressively to aim based on skill level. The value head provides the temporal-difference signal that drives learning. Two critical mechanisms keep the agent honest:
- Action masking — Targets that both players have closed are masked with −∞ logits, preventing the network from wasting darts on dead numbers. Similarly, the “triple” option is masked when aiming at bull (which has no triple).
- Terminal-only rewards — The agent receives no intermediate feedback for closing a number or scoring points. Only the final outcome matters: +100 for a win, −100 for a loss. This prevents reward hacking (earlier versions learned to throw at bull every time because it gave the highest per-dart closing bonus).
Training: 12 Bugs and 3 Breakthroughs
The architecture was only the beginning. Making it actually learn took 19 versions and the discovery and resolution of 12 distinct training bugs. Three breakthroughs shaped the final result.
Breakthrough 1: Supervised Pre-Training
A randomly initialized network explores blindly — throwing darts at every number equally, learning nothing for thousands of games. The solution: bootstrap the network by watching expert strategy bots play. We generated training data from four Frongello strategies (S1, S2, S6, S10), recording every dart decision, and trained the network via supervised learning with label smoothing (0.1) to prevent the output probabilities from collapsing to near-certainty.
After pre-training, the network opens with 20-triple ~91% of the time — sensible but not locked in. It has a starting intuition but remains open to discovering something better.
Breakthrough 2: Prune the Easy Opponents
This was the single most important discovery. Version 16 trained against all 28 bots (S1–S17 plus E1–E12). The agent won 48.5% overall — but against the 13 hardest bots, it won only 43%, worse than the simple sequential closer (S1 at 49.1%).
The problem: easy opponents like E6 (99.7% win rate) and E7 (86.7%) produced a “keep doing what you’re doing” gradient signal. Hard opponents produced a “you need to improve” signal. These contradicted each other, and the easy wins drowned out the hard losses.
The fix was simple: remove every bot the agent could beat more than 45% of the time. With only 13 hard bots remaining, every game produced useful learning signal. Win rate climbed from 5% to 55% over 2 million games, and the novel 20/18 opening split emerged.
Breakthrough 3: The Entropy Floor
Version 17 peaked at 55% win rate after ~4.4 million games — then started declining. The agent was overtraining: its entropy coefficient had annealed to 0.005, making the policy nearly deterministic. It could no longer adapt to slight variations in opponent behavior.
Raising the entropy floor from 0.005 to 0.01 solved the problem completely. Version 19, trained with the higher floor, held stable at 49.3% win rate with zero decline over 2 million additional games. The lower absolute number reflects a tighter bot pool (the two easiest remaining bots were also removed), not a weaker agent.
The 12 Bugs
Beyond the three breakthroughs, the agent’s development required solving 12 distinct technical problems. Each one caused either training collapse, reward hacking, or silently invalid results.
1. Bull Reward Bias
Agent throws bull 100% of the time. Closing bonus × face value (25) makes bull the highest-reward target. Fix: terminal-only rewards.
2. Peaked Pre-training
After supervised pre-training, softmax outputs collapse to 99.99% on one action. Entropy bonus can’t recover. Fix: label smoothing (0.1), fewer epochs.
3. Infinite Games
Degenerate policy never closes all 7 targets. Trajectory buffer grows to 4.4GB. Fix: 100-turn maximum per game.
4. Masking Mismatch
Bull-triple masking applied during action selection but not during gradient computation. Distribution mismatch corrupts training. Fix: consistent masking in both paths.
5. Self-Play Collapse
Both players updating simultaneously chase each other into degenerate strategies. Fix: freeze Player 2, train only Player 1.
6. Reward Hacking
Intermediate rewards for closing targets cause the agent to loop on 20-triple forever. Fix: remove all intermediate rewards entirely.
7. Entropy Annealing Bug
Constructor passes custom initial entropy, but annealing function hardcodes the default. Fix: store per-instance initial value.
8. Dead-Target Blindness
Without masking, the network can’t learn to avoid closed targets from terminal rewards alone — the signal is too diluted across 60 darts. Fix: dead-target action masking.
9. Double-Counted Rewards
Win reward applied during gameplay AND at game end = +200 instead of +100. Fix: compute_reward() always returns 0; terminal rewards applied exclusively at game end.
10. Deterministic Games
With miss_enabled=False, every triple lands. First player always wins by one turn. All v4–v8 results were invalidated. Fix: enable realistic miss probabilities.
11. Easy Bot Pollution
Easy opponents drown out hard-opponent learning signals. Agent converges to S1 behavior (43% vs hard bots). Fix: prune to hard-only bot pool.
12. Overtraining
Entropy floor too low (0.005). Policy becomes deterministic, loses adaptability. Win rate declines 3% past 4.4M games. Fix: entropy floor 0.01.
Version History
Bellman updates on a state-action lookup table. Plateaus at ~45% WR against basic bots. State space too large for tabular methods.
Double DQN with experience replay. Fails at ~45% WR. Flat action space prevents structured learning. Results later invalidated — deterministic game mechanics (Bug 10) made first-mover always win.
Branching architecture implemented. Bugs 1–10 discovered and fixed iteratively. Supervised pre-training pipeline built. Agent begins learning meaningful strategies but can’t break past 50%.
First end-to-end A2C training. 48.5% overall WR, but only 43% against hard bots — worse than the trivial S1 strategy. Easy opponents poison the learning signal.
Pruned to 13 hard bots. WR climbs 5% → 55% over 2M games. Novel 20/18 turn-order-dependent opening discovered. But overtraining begins after 4.4M games (entropy floor too low).
Removed S4 (62% WR) and S8 (66% WR) as the agent began farming them. 11 remaining bots all in the 47–57% range. WR stabilizes at ~50% with tighter competition.
Entropy floor raised to 0.01. 49.3% WR with zero decline over 2M games. Agent maintains adaptability indefinitely. Final stable model.
Final Results
The v19 agent plays at pro skill level (MPR ~5.3) against the 11 hardest hand-crafted bots. Its 49.3% average win rate means it holds its own against a curated field where every opponent is a strong strategy — not a single easy win in the pool.
| Opponent | Win Rate | Notes |
|---|---|---|
| E11 (Kitchen Sink) | 53.2% | Best matchup — complexity works against E11 |
| S6 (Lead + Extra Darts) | 52.1% | Extra darts disrupt S6’s closing tempo |
| S16 (Chase + Extra + High Thresh.) | 50.3% | |
| S14 (Chase + Extra + Low Thresh.) | 50.2% | |
| E9 (Phase Shift — 3 phases) | 49.2% | |
| S10 (Chase + Low Thresh.) | 49.0% | |
| E5 (Smart Aim) | 48.3% | |
| E1 (Early Bull) | 48.3% | |
| E3 (Greedy Close-and-Score) | 44.9% | |
| E10 (Score Surge) | 44.8% | |
| S2 (Lead Then Cover) | 42.9% | Toughest opponent — S2’s sophisticated covering logic |
All games played at pro skill level with realistic miss probabilities. Win rates measured over 2 million games with entropy floor 0.01.
What It Means
Independent Confirmation of Frongello
The agent rediscovered several Frongello principles without being told them. It learned to score before covering (76% closing early, shifting to 69–72% scoring once foundations are built). It learned not to chase (31% chase rate when threatened). These principles emerged from pure win/loss feedback — strong independent confirmation that they reflect genuine strategic truth, not just human intuition.
Something Genuinely New
The turn-order-dependent opening is not a refinement of existing strategy. It’s a wholly new dimension of play that no hand-crafted strategy explores. All 28 bots in the tournament play identically regardless of who goes first. The agent discovered that this context matters — and adapted accordingly.
The Limits of Hand-Crafted Rules
Hand-crafted strategies parameterize three dimensions: threshold, extra darts, and chase. The agent found value in a fourth dimension — turn order — that wasn’t part of the original parameter space at all. This suggests there may be other unconsidered dimensions (opponent modeling, dart-within-turn sequencing, adaptive threshold adjustment) that further human-designed strategies haven’t explored.
Training Insight: Curriculum Matters More Than Architecture
The single most impactful change across all 19 versions wasn’t the network architecture, the reward function, or the learning rate. It was removing easy opponents from the training pool. When every game produces useful signal, the agent learns. When easy wins dilute hard losses, it doesn’t. This principle — that opponent curriculum is the dominant training variable — likely generalizes to other game-playing RL systems.