Slides: html pdf

The game of Go

The Goban.

Go is an ancient two-opponents board game, where each player successively places stones on a 19x19 grid. When a stone is surrounded by four opponents, it dies. The goal is to ensure strategical position in order to cover the biggest territory. There are around 10^{170} possible states and 250 actions available at each turn (10^{761} possible games), making it a much harder game than chess for a computer (35 possible actions, 10^{120} possible games). A game lasts 150 moves on average (80 in chess). Up until 2015 and AlphaGo, Go AIs could not compete with world-class experts, and people usually considered AI would need at least another 20 years to solve it.

Minimax and Alpha-Beta

Example of a game tree.

Minimax algorithm expand the whole game tree, simulating the moves of the MAX (you) and MIN (your opponent) players. The final outcome (win or lose) is assigned to the leaves. It allows to solve zero sum games: what MAX wins is lost by MIN, and vice-versa. We suppose MIN plays optimally (i.e. in his own interest).

Minimax iteratively backpropagates the value of the leaves according to who plays.

The value of the leaves is propagated backwards to the starting position: MAX chooses the action leading to the state with the highest value, MIN does the opposite. For most games, the tree becomes too huge for such a systematic search: * The value of all states further than a couple of moves away are approximated by a heuristic function: the value V(s) of these states. * Obviously useless parts of the tree are pruned: Alpha-Beta algorithm.

Alpha-Beta methods work well for simple problems where the complete game tree can be manipulated: Tic-Tac-Toe has only a couple of possible states and actions (3^9 = 19000 states).

Game tree of tic-tac-toe.

It also works when precise heuristics can be derived in a reasonable time. This is the principle of IBM DeepBlue which was the first Chess AI to beat a world champion (Garry Kasparov) in 1995. Carefully engineered heuristics (with the help of chess masters) allowed DeepBlue to search 6 moves away what is the best situation it can arrive in.

But it does not work in Go because its branching factor (250 actions possible from each state) is to huge: the tree explodes very soon. 250^{6} \approx 10^{15}, so even if your processor evaluates 1 billion nodes per second, it would need 11 days to evaluate a single position 6 moves away…



AlphaGo (Silver et al., 2016) uses four different neural networks:

  • The rollout policy and the SL policy network use supervised learning to predict expert human moves in any state.
  • The RL policy network uses self-play and reinforcement learning to learn new strategies.
  • The value network learns to predict the outcome of a game (win/lose) from the current state.

The rollout policy and the value network are used to guide stochastic tree exploration in Monte-Carlo Tree Search (MCTS) (MPC-like planning algorithm).

Architecture of AlphaGo (Silver et al., 2016).

Supervised learning is used for bootstrapping the policy network.

A policy network \rho_\sigma is trained to predict human expert moves:

  • 30M expert games have been gathered: input is board configuration, output is the move played by the expert.
  • The CNN has 13 convolutional layers (5x5) and no max-pooling.
  • The accuracy at the end of learning is 57% (not bad, but not sufficient to beat experts).

A faster rollout policy network \rho_\pi is also trained:

  • Only one layer, views only part of the state (around the last opponent’s move).
  • Prediction accuracy of 24%.
  • Inference time is only 2 \mus, instead of 3 ms for the policy network \rho_\sigma.

The SL policy network \rho_\sigma is used to initialize the weights of the RL policy network \rho_\rho, so it can start exploring from a decent policy.

The RL policy network then plays against an older version of itself (\approx target network) to improve its policy, updating the weights using Policy Gradient (REINFORCE):

\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{s \sim \rho_\theta, a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(s, a) \, R ]

where R = +1 when the game is won, -1 otherwise.

The idea of playing against an older version of the same network (self-play) allows to learn offline. The RL policy network already wins 85% of the time against the strongest AI at the time (Pachi), but not against expert humans. A value network \nu_\theta finally learns to predict the outcome of a game (+1 when winning, -1 when losing) based on the self-play positions generated by the RL policy network.

The policy network learns the probability of selecting different moves. Source: (Silver et al., 2016).

The value network learns to predict the value of any possible state under the learned policy. Source: (Silver et al., 2016).


AlphaZero (Silver et al., 2018) totally skips the supervised learning part: the RL policy network starts self-play from scratch!

Architecture of AlphaZero (Silver et al., 2018).

The RL policy network uses MCTS to select moves, not a softmax-like selection as in AlphaGo. The policy and value networks are merged into a two-headed monster: the convolutional residual layers are shared to predict both:

  • The policy \pi_\theta(s), which is only used to guide MCTS (prior of UCB).

a_t = \text{argmax}_a \, Q(s, a) + K \cdot \frac{\pi_\theta(s, a)}{1 + N(s, a)}

  • The state value V_\varphi(s) for the value of the leaves (no fast rollout).

The loss function used to train the network is a compound loss:

\mathcal{L}(\theta) = (R − V_\varphi(s))^2 - \pi_\text{MCTS}(s) \, \log \pi_\theta(s) + c ||\theta||^2

The policy head \pi_\theta(s) learns to mimic the actions selected by MCTS by minimizing the cross-entropy (or KL). The value network V_\varphi(s) learns to predict the return by minimizing the mse.

  1. Initialize neural network.
  2. Play self-play games, using 1,600 MCTS simulations per move (which takes about 0.4 seconds).
  3. Sample 2,048 positions from the most recent 500,000 games, along with whether the game was won or lost.
  4. Train the neural network, using both A) the move evaluations produced by the MCTS lookahead search and B) whether the current player won or lost.
  5. Finally, every 1,000 iterations of steps 3-4, evaluate the current neural network against the previous best version; if it wins at least 55% of the games, begin using it to generate self-play games instead of the prior version.

Repeat steps 3-4 700,000 times, while the self-play games are continuously being played .


By using a single network instead of four and learning faster, AlphaZero also greatly reduces the energy consumption.

The same algorithm can also play Chess and Shogi! The network weights are reset for each game, but it uses the same architecture and hyperparameters. After only 8 hours of training, AlphaZero beats Stockfish with 28-72-00, the best Chess AI at the time, which itself beats any human. This proves the algorithm is generic and can be applied to any board game.

AlphaZero can reach sota performance on Go, Cjess and Shogi. Source:


MuZero (Schrittwieser et al., 2019) is the latest extension of AlphaZero (but see EfficientZero (Ye et al., 2021)). Instead of relying on a perfect simulator for the MCTS, it learns the dynamics model instead.

s_{t+1}, r_{t+1} = f(s_t, a_t)

Architecture of MuZero (Schrittwieser et al., 2019).

MuZero is composed of three neural networks:

  • The representation network s= h(o_1, \ldots, o_t) (encoder) transforming the history of observations into a state representation (latent space).
  • The dynamics model s', r = g(s, a) used to generate rollouts for MCTS.
  • The policy and value network \pi, V = f(s) learning the policy with PG.

The dynamics model s', r = g(s, a) replaces the perfect simulator in MCTS. It is used in the expansion phase of MCTS to add new nodes. Importantly, nodes are latent representations of the observations, not observations directly. This is a similar idea to World Models and PlaNet/Dreamer, which plan in the latent space of a VAE. Selection in MCTS still follows an upper confidence bound using the learned policy \pi:

The actions taking during self-play are taken from the MCTS search as in AlphaZero. Note that the network plays each turn: there is additional information about whether the network is playing white or black. Self-played games are stored in a huge experience replay memory.

Finally, complete games sampled from the ERM are used to learn simultaneously the three networks f, g and h:

MuZero beats AlphaZero on Chess, Go and Shogi, but also R2D2 on Atari games. The representation network h allows to encode the Atari frames in a compressed manner that allows planning over raw images.

Performance of MuZero (Schrittwieser et al., 2019).