Monday, July 25, 2016

Monte Carlo tree search (MCTS)

Ever wondering what data structure is used at the core engine for class of games like, Othello, Go, etc., and almost most of the board games you can think of! I encourage you to read this blog post then ;)

Recently, AI (Artificial Intelligence) and ML (Machine Learning) have successfully developed a tool to defeat human expert at Go game that was perceived to think of as one of the most difficult board games for quite many years, but thanks to deep RL (Reinforcement Learning) that used the MCTS as the main data structure to make this dream come true! Except the undeniable role of deep learning in this regard, MCTS is the main data structure that was considered to address this problem and similar ones since 2006. The first formalization of Kocsis and Szepesvári toward upper confidence bounds for trees, that heavily owes to John Von Neumann's 1928 minimax theorem known as the breakthrough for adversarial tree search methods, makes all the way up to explore and exploit the optimal policy to win 3 out of 4 against professional Go player Lee Sedol known as the human champion of Go.
The goal of Monte Carlo tree search is to analyse the most promising moves, so that the search tree grows based on random sampling of the search space. However, MCTS is limited to the games that have a finite number of possible moves in each state, as well as a finite length. Representing all possible states of the game in the tree is the main aim of this data structure. The root node of the tree is the initial game state. For every valid move from a node, there is a child node representing the new game state if that move is chosen. MCTS can theoretically be applied to any domain that can be described in terms of (state, action) pairs and simulation used to forecast outcomes.

The process of MCTS algorithm can be broken down into the following steps:

  • Selection, Starting at root node R, recursively select optimal child nodes until a leaf node L is reached. 
  • Expansion, If L is a not a terminal node, then create one or more child nodes and select one C. 
  • Simulation, Run a simulated playout from C until a result is achieved. 
  • Backpropagation, Update the current move sequence with the simulation result.
Kocsis and Szepesvári recommend to choose each node of the MCTS as the first step of the above algorithm that maximizes the following expression,

\[ \frac{w_i}{n_i} + c \sqrt{\frac{\ln{t}}{n_i}} \]

where \(w_i\) stands for the number of wins after the i-th move, \(n_i\) stands for the number of simulations after the i-th move, \(c\) is the exploration parameter—theoretically equal to \(\sqrt{2}\); in practice usually chosen empirically, \(t\) stands for the total number of simulations for the node considered. It is equal to the sum of all the \(n_i\).
And that’s it!

No comments:

Post a Comment