(cache) Google DeepMind's AlphaGo: How it works

Between 9 and 15 March 2016, a five game competition
took place between Lee Sedol, the second-highest ranking professional Go player, and AlphaGo, a computer program created by Google's DeepMind subsidiary. The competition was high-stake: a prize of one million dollars was put up by Google. AlphaGo won 4-1.

How exactly did AlphaGo manage to do it? All I could figure out was that machine learning was involved. Having a PhD in machine learning myself, I decided to go through the trouble and read the paper that DeepMind published on the subject. I will do my best to explain how it works in this blog post. I also read different opinions of how much a big deal this win is, and I will have some things to say about that myself (spoiler: I think it's a pretty big deal).

Go vs. chess

Go and chess are very popular board games, which are similar in some respects: both are played by two players taking turns, and there is no random element involved (no dice rolling, like in backgammon).

Source: Wired

In 1997, Garry Kasparov was defeated by Deep Blue, a computer program written by IBM, running on a supercomputer. This was the first time that a reigning world chess champion was defeated by a computer program in tournament conditions. Superficially, AlphaGo's win against Lee Sedol can be compared to Deep Blue's win against Gary Kasparov. With the exception that AlphaGo's win came almost 20 years later. So: what's the big deal? We have to understand the differences between chess and Go.

In chess, each player begins with 16 pieces of six different types. Each piece type moves differently. The goal of the game is to capture the opponent's king. Go starts with an empty board. At each turn, a player places a stone (the equivalent of a piece in chess) on the board. Stones all obey the same rules. The goal of the game is to capture as much territory as possible. It can therefore be argued that Go has simpler rules than chess.

In spite of the fact that the rules of Go might appear simpler than the rules of chess, the complexity of Go is higher. At each game state, a player is faced with a choice of a greater number of possible moves compared to chess (about 250 in Go vs. 35 in chess). Also, games usually last longer: A typical game in Go might last for 150 moves vs. 80 in chess.

via GIPHY

Because of this, the total number of possible games of Go has been estimated at 10⁷⁶¹, compared to 10¹²⁰ for chess. Both are very large numbers: the entire universe is estimated to contain "only" about 10⁸⁰ atoms. But Go is the most complex of the two games, which is also why it has been such a challenge for computers to play, until now.

Game AI: Why Go is challenging

To understand how AIs are capable of playing games such as chess and Go, we have to understand what a game tree is. A game tree represents game states (positions) as nodes in the tree, and possible actions as edges. The root of the tree represents the state at the beginning of the game. The next level represents the possible states after the first move, etc... For simple games such as tic-tac-toe, it is possible to represent all possible game states (the complete game tree) visually:

tictactoetree

Source: Wikimedia

For more complex games, this quickly becomes impossible. For chess, the tree would contain 10¹²⁰ nodes, which is totally impossible to store on a computer (remember: the universe has only ~10⁸⁰ atoms).

Knowing the complete game tree is useful for a game playing AI, because it allows the program to pick the best possible move at a given game state. This can be done with the minimax algorithm: At each game turn, the AI figures out which move would minimize the worst-case scenario. To do that, it finds the node in the tree corresponding to the current state of the game. It then picks the action that minimizes the worst possible loss it might suffer. This requires traversing the whole game tree down to nodes representing end-of-game states. The minimax algorithm therefore requires the complete game tree. Great for tic-tac-toe, but not useful for chess, and even less so for Go.

How did Deep Blue beat Kasparov? The basic principle is that Deep Blue searched the game tree as far as possible, usually to a depth of six moves or more. It would then use an evaluation function to evaluate the quality of the nodes at that depth. Essentially, the evaluation function replaces the subtree below that node with a single value summarizing this subtree. Then, Deep Blue would proceed similarly to the minimax algorithm: The move that leads to the least bad worst-base scenario at this maximum depth is chosen.

The evaluation function relies on some form of heuristic. It becomes easier to design good evaluation functions for moves that are closer to the end of the game. This makes intuitive sense: At the beginning of a game, it is hard to tell who is going to win, whereas toward the end of the game, it is sometimes very easy to tell who is going to win (e.g. just before a checkmate). It is probably impossible to design a perfect evaluation function, but better evaluation functions lead to better game play.

Two factors determine the strength of the AI:

Raw computing power. More computing power means the game tree can be searched to a greater depth, which leads to better estimates by the evaluation function. Deep Blue ran on a supercomputer (i.e. it had massive computing power).
Quality of the evaluation function. IBM put an enormous amount of effort into the design of the evaluation function. According to Wikipedia:

The evaluation function had been split into 8,000 parts, many of them designed for special positions. In the opening book there were over 4,000 positions and 700,000 grandmaster games. The endgame database contained many six piece endgames and five or fewer piece positions. Before the second match, the chess knowledge of the program was fine tuned by grandmaster Joel Benjamin. The opening library was provided by grandmasters Miguel Illescas, John Fedorowicz, and Nick de Firmian.

In summary: In spite of the large complexity of chess, Deep Blue relied largely on brute force, plus some well-designed heuristics.

Go cannot be tackled effectively with the same approach. Go has a wider branching factor (more possible moves at each state) than chess, and games tend to be longer. Hence, it is more difficult to search the game tree to a sufficient depth. In addition, it turns out that it is more difficult to design evaluation functions for Go than for chess. The endgame in Go is sometimes said to be especially complex. At the time of writing (March 15, 2016), Wikipedia notes:

Thus, it is very unlikely that it will be possible to program a reasonably fast algorithm for playing the Go endgame flawlessly, let alone the whole Go game.

In light of the recent win of AlphaGo, this prediction now seems needlessly pessimistic (and also wrong).

Monte Carlo Tree Search to the rescue

Monte Carlo Tree Search (MCTS) is an alternative approach to searching the game tree. The idea is to run many game simulations. Each simulation starts at the current game state and stops when the game is won by one of the two players. At first, the simulations are completely random: actions are chosen randomly at each state, for both players. At each simulation, some values are stored, such as how often each node has been visited, and how often this has led to a win. These numbers guide the later simulations in selecting actions (simulations thus become less and less random). The more simulations are executed, the more accurate these numbers become at selecting winning moves. It can be shown that as the number of simulations grows, MCTS indeed converges to optimal play.

MCTS faces an exploration/exploitation trade-off: It can tend to focus too early (after too few simulations) on actions that seem to lead to wins. It turns out that it is better to to include an exploration component in the search, which adds a random component. We talked about the exploration/exploitation trade-off in a previous blog article, but in a different context.

What is interesting about MCTS is that no domain knowledge or expert input about the game is required. Whereas Deep Blue used a complex evaluation function which was designed with the help of expert chess players, MCTS merely requires traversing a tree and keeping track of some numbers. Also, it is convenient that the whole game tree does not have to be expanded, as this would be impossible. However, it is necessary to run a large number of simulations in order to achieve good results.

The strongest Go AIs (Fuego, Pachi, Zen, and Crazy Stone) all rely on MCTS. They also rely on domain knowledge (hand-crafted rules designed by experts) to better select actions during the Monte Carlo simulations. All four programs achieve strong amateur play levels.

Beyond amateur play with learning

All previously mentioned methods for Go-playing AI rely on some kind of tree search, combined with hand-crafted rules. AlphaGo however makes extensive use of machine learning to avoid using hand-crafted rules. Three different kinds of neural networks are combined with a tree search procedure. I will explain how the pieces fit together, but I first have to provide some background.

What is machine learning?

Maching learning is the art and science of designing computer algorithms that learn from data. In supervised learning (a standard setting in machine learning) an algorithm is repeatedly presented with training examples along with their corresponding labels. E.g. A training example could be the game-state of a game of Go, and the training label would be if this state ultimately led to a win or a loss for the current player. The goal is to learn a model which is able to generalize well on previously unseen examples, i.e. it should be good at predicting the outcome of as-of-yet unseen Go games.

Neural networks

ann

Source: Wikimedia. Each white dot represents an artificial neuron. Each colored line represents a trainable parameter.

Artificial neural networks are a class of models that is frequently used in machine learning, both in the supervised and the unsupervised setting, because of their ability to handle large amounts of training data. Neural networks consist of a number of layers, each of which contain a number of parameters whose values are unknown a priori and need to be trained (i.e. tuned on training data). Each layer in an artificial neural network contains artificial neurons. Each neuron receives as input the outputs of neurons in a previous layer. The inputs are then summed together (and passed through a non-linear "activation" function). This behavior is reminiscent of biological neurons, which is where the name came "neural" network came from.

Convolutional networks

cnn

Source: Wikimedia

Convolutional networks are a sub-type of neural network that are especially well adapted for the processing of image data. Convolutional networks take as input an image. At each layer in the network, a series of filters is applied on the image. Convolutional networks are highly computationally efficient on image data because they restrict themselves to filtering operations, which are very useful for image data. These networks have been applied to all kinds of tasks that take images as input, such as digit, face, and licence plate recognition.

Note that all operations are feed-forward: the output of the neural network is obtained after a series of filtering operations. No back-tracking or search procedure is applied. Intuitively, convolutional networks are well-suited for problems that can be solved quickly and intuitively by humans, such as recognizing objects in an image. They are not well suited for problems that require time and reflection, e.g. finding the exit of a maze, given an image of this maze.

vision

Source: Google research. Object detection. A task well-suited for convolutional networks. (Most) humans can also do this very quickly and intuitively.

maze

Source: Wikimedia. Maze path-finding. Not an easy problem for a convolutional network: finding the solution requires search. A human also needs some time and reflection to find the solution.

A note on deep learning

Deep learning has often been mentioned in media recently. The term usually refers to training neural networks in an unsupervised manner, and greedily, layer-by-layer. The convolutional networks used by AlphaGo are indeed deep (they contain 13 layers), but they are trained in a supervised manner, and not layer-by-layer, but all in one go. Hence, strictly speaking, AlphaGo does not use deep learning.

Edit (April 6, 2016): In the comments below, both Loïc Matthey and Gary Cottrell informed me that I'm wrong about this. While the term deep learning used to refer to networks that are trained layer-by-layer in an unsupervised fashion, the term now refers to any network with a lot of layers. AlphaGo therefore does indeed use deep learning.

AlphaGo

AlphaGo relies on two different components: A tree search procedure, and convolutional networks that guide the tree search procedure. The convolutional networks are conceptually somewhat similar to the evaluation function in Deep Blue, except that they are learned and not designed. The tree search procedure can be regarded as a brute-force approach, whereas the convolutional networks provide a level on intuition to the game-play.

In total, three convolutional networks are trained, of two different kinds: two policy networks and one value network. Both types of networks take as input the current game state, represented as an image (with some additional input features, which are not important to our discussion).

Via Nature

The value network provides an estimate of the value of the current state of the game: what is the probability of the black player to ultimately win the game, given the current state? The input to the value network is the whole game board, and the output is a single number, representing the probability of a win.

The policy networks provide guidance regarding which action to choose, given the current state of the game. The output is a probability value for each possible legal move (i.e. the output of the network is as large as the board). Actions (moves) with higher probability values correspond to actions that have a higher chance of leading to a win.

Training the policy and value networks

A policy network was trained on 30 million positions from games played by human experts, available at the KGS Go server. An accuracy on a withheld test-set of 57% was achieved. When I first read the paper, I was very surprised that this was possible. I would have thought that predicting human game moves is a problem that is too difficult to be solved with convolutional networks. The fact that a convolutional network was able to predict human game play with such high accuracy seems to suggest that a lot of human Go-play is rather intuitive, instead of deeply reflective.

A smaller policy network is trained as well. Its accuracy is much lower (24.2%), but is much faster (2 microseconds instead of 3 milliseconds: 1500 times faster).

Deep reinforcement learning

Until now, the policy networks were only trained to predict human moves. But the goal should not be to be as good as possible at predicting human moves. Rather, the goal should be to have networks that are optimized to win the game. The policy networks were therefore improved by letting them play against each other, using the outcome of these games as a training signal. This is called reinforcement learning, or even deep reinforcement learning (because the networks being trained are deep).

Letting the system play against itself is a useful trick to let the system improve itself, without the need for a training set of games played by humans. This trick is not new, since it was used in TD-Gammon in 1992, created by Gerald Tesauro at IBM. TD-Gammon was a backgammon-playing program that reached the performance of the best human players at the time.

AlphaGo's performance without search

The AlphaGo team then tested the performance of the policy networks. At each move, they chose the actions that were predicted by the policy networks to give the highest likelihood of a win. Using this strategy, each move took only 3 ms to compute. They tested their best-performing policy network against Pachi, the strongest open-source Go program, and which relies on 100,000 simulations of MCTS at each turn. AlphaGo's policy network won 85% of the games against Pachi! I find this result truly remarkable. A fast feed-forward architecture (a convolutional network) was able to outperform a system that relies extensively on search. This again suggests that intuition is very important in the game of Go. It also shows that it is possible to play well without relying on very long reflections.

Value network

Then, a value network was trained on 30 million game positions obtained while the policy network played against itself. As a reminder, the value network is supposed to predict the likelihood of a win, give the current game-state. It is therefore similar to an evaluation function, with the difference that the value network is learned instead of designed.

We said that it is difficult to design evaluation functions that perform well at the beginning of a game, but that it becomes easier the closer the game approaches an end. The same effect is observed with the value network: the value network makes random predictions at the very beginning of the game, but becomes better at predicting the final game outcome the more moves have been played. The fact that both human-designed evaluation functions and learned value networks display the tendency to perform better towards the end of the game indicates that this tendency is not due to human limitations, but that it is something more fundamental.

Putting all pieces together: Tree search

AlphaGo's tree search procedure is somewhat similar to MCTS, but is guided by all three types of networks in an innovative manner. I will not go into too much detail here, as the full approach is somewhat involved.

Similarly to Deep Blue, AlphaGo uses an evaluation function to give a value estimate of a given state. AlphaGo uses a mixture of the output of the value network and the result of a self-play simulation of the fast policy network:

value of a state = value network output + simulation result.

This is interesting because it suggests a mixture of intuition and reflection. The value network provides the intuition, whereas the simulation result provides the reflection. The AlphaGo team also tried to use only the value network output, or only the simulation result, but those provided worse results than the combination of the two. It is also interesting that the value network output and the simulation result seem to be equally important.

The slow policy network is also used to guide the tree search. Here, the exploration/exploitation trade-off we previously mentioned is kept in mind. Given a game state, the policy network outputs a probability value for each legal move. This output is then divided by the number of times this move was taken in this state in the simulation. This encourages exploration by penalizing actions that were chosen often.

This concludes our description of the inner workings of the AlphaGo system.

How good is AlphaGo?

How strong is AlphaGo compared to other humans and AIs? The most commonly used system for comparing the strength of players is the Elo rating system. The difference in the ratings between two players serves as a predictor of the outcome of a match, where higher ratings indicate a higher chance of winning.

In the paper, written in 2015, the strength of various AIs was estimated as follows:

AI name	Elo rating
Distributed AlphaGo (2015)	3140
AlphaGo (2015)	2890
CrazyStone	1929
Zen	1888
Pachi	1298
Fuego	1148
GnuGo	431

AlphaGo ran on 48 CPUs and 8 GPUs and the distributed version of AlphaGo ran on 1202 CPUs and 176 GPUs.

Go Parallel

We see that additional computational resources lead to better game play, just like with chess. Unfortunately, no estimate of the Elo rating of AlphaGo running on a single CPU was provided.

At the time the paper was written, the distributed version of AlphaGo had defeated Fan Hui in a five-game tournament. Fan Hui is a professional 2 dan player whose Elo rating was estimated at 2,908 at the time.

On March 15, 2016, the distributed version of AlphaGo won 4-1 against Lee Sedol, whose Elo rating is now estimated at 3,520. The distributed version of AlphaGo is now estimated at 3,586. It is unlikely that AlphaGo would have won against Lee Sedol if it had not improved since 2015.

There is only one remaining human with a higher Elo rating than AlphaGo (Ke Jie, at 3621).

How important are these results?

Superficially, both Go and chess seem to be representative of typical challenges faced by AI: The decision-making task is challenging, and the search space is intractable.

In chess, it was possible to beat the best human players with a relatively straightforward solution: brute force search, plus some heuristics hand-crafted (with great effort) by expert chess players. The fact that heuristics had to be hand-crafted is rather disappointing, because it does not immediately lead to other breakthroughs in AI: each new problem would need new hand-crafted rules. Also, it seems that it was a lucky coincidence that chess had a state space that was very large, but still small enough to be just barely tractable.

Go, due to its higher complexity, could not be tackled with Deep Blue's approach. Good progress was made with MCTS. This is also a somewhat disappointing solution: pure MCTS does not use any domain knowledge. This means that a Go-playing program using pure MCTS does not know anything about how to play Go at the beginning of each new game. There is no learning through experience.

What is very interesting about AlphaGo is that it involves a learning component instead of hand-crafted heuristics. By playing against itself, AlphaGo automatically gets better and better at playing Go. This provides hope that AlphaGo's approach might adapt well to other AI problems.

Also interesting is that in spite of massive computational power (probably much more than Deep Blue), AlphaGo evaluated thousands of times fewer positions than Deep Blue. This is because the policy and value networks are very expensive to evaluate. But this trade-off is worth it: positions are evaluated more precisely, which ultimately leads to better game play.

I think the fact that AlphaGo won against Lee Sedol is not very interesting in itself. Anyway the win itself was heavily reliant on massive computational power. Had fewer CPUs and GPUs been available, AlphaGo might have lost. What is more interesting is the approach employed, because it provides hope for the future.

Some perspectives for the future

A few things about AlphaGo's approach seem unsatisfying, perhaps leaving room for improvement for the future:

AlphaGo relies on a large dataset of games played by human players. Fortunately, a big dataset of Go games played by human experts was available, but this might not be the case for other AI problems. Also, it is possible that AlphaGo is somewhat biased toward imitating human play. If it relied only on self-play (becoming better by playing against itself), perhaps completely new strategies that are contrary to current orthodoxy might be discovered. This was the case for TD-Gammon , which relied only on self-play. According to Wikipedia:

[TD-Gammon] explored strategies that humans had not pursued and led to advances in the theory of correct backgammon play.
The system is not trained end-to-end. AlphaGo has two distinct parts: the neural networks, and the tree search phase. The neural networks are trained independently of the tree search phase. Intuitively it would seem that a joint learning procedure should be possible, which might lead to a better combination of the learning and search phase.
AlphaGo relies on convolutional networks, a very special type of neural network, for the policy and value networks. Using convolutional networks might not be possible for all future games or AI problems in general.
A bunch of hyper-parameters need to be set. E.g. the system is shaped to use most time in the middle-game. It is not clear a priori why this is necessary and it would be more satisfying if this was learned automatically (if this indeed leads to better game play), instead of having to be set manually.

What's next?

What AI problems will the DeepMind team try to tackle next? Hard to say. Computer game AI would definitely be an interesting problem, both because of its complexity as well as for the practical applications. In fact, DeepMind has already taught computers to learn to play Atari games by themselves. Computer game AI therefore looks like a straightforward next problem to tackle.