Convolutional Neural Network for Game

Game, or "Game playing" , is the classic problem of artificial intelligence. The game whose outcome could be objectively determined not only provides a perfect test platform for the artificial intelligence algorithm, but also make it possible to make computer players and human players to play against. According to different classification methods, the game can be divided into single and double (or multiplayer) games, board games and video games, collaborative games and war games, etc. Double or more people's games are often called Game, and Game can be divided into perfect information games (such as chess and chess) and incomplete information games (such as poker and military chess.) Early artificial intelligence algorithms are often dependent on the search algorithm, which in simple, limited search space (Such as Tic-Tac-Toe) is very effective.With the increase in the difficulty of the search, more complex games tend to use the combination of search and machine learning algorithms, especially Reinforcement learning algorithm.

Reinforcement learning is not deep learning. It is an important branch of machine learning that focuses on sequential decision making [1-2], and the most advanced algorithms in many games are based on reinforcement learning. Befor AlphaGo, the best Go algorithm (such as CrazyStone and Zen) combines reinforcement learning with Monte Carlo Tree search (MCTS). The reinforcement learning focuses on two unique features of learning, compared to unsupervised learning and supervised learning Problem: Exploitation vs. exploitation and Temporal credit assignment. A reinforcement learning algorithm needs to answer two basic questions:

How to evaluate a policy
How to find an optimal strategy for the problem .

Deep learning in the application of the game, often by helping reinforcement learning to better solve these two problems (such as shown in Figure 1).

Fig. 1 A convolutional neural network learns a mapping from game screens to game policy

Deep Reinforcement Learning

The traditional study of Reinforcement Learning focuses on Tabular Representations or Linear Function Approximations. For large-scale and complex sequential decision-making processes in reality, simple tabular representation and linear approximation are not enough. Deep learning can provide end-to-end, non-linear function approximation[1] so that reinforcement learning can solve more complex problems(such as how to express the state of the board in Go). On the other hand, in order to solve some of the observable Markov decision problems (POMDP) [3-5] which are common in games, the algorithm of reinforcement learning needs effectively process the sequence of actions and observations data (such as effectively summarizing historical actions and observations into a state representation) to find the optimal strategy. In areas such as audio, video, and text deep learning has been shown to be successful as learning sequence, so it is also used to learn the state representation in POMDP. Reinforcement learning with deep learning optimization is known as deep reinforcement learning, and a review of deep reinforcement learning can be found in [6].

The Application of Deep Reinforcement Learning in Game

In recent years, deep reinforcement learning has been increasingly used in games, including a variety of single-player and multiplayer games. The advantage of deep learning is that the non-linear representation can be learned from a large amount of training data, so as to achieve better prediction effect. When deep learning is applied to a new field, we need to focus on three questions: what is the goal of the prediction? Training data come from? What does learning result mean?

It has been found that when deep learning is introduced into a game, its predictive goal is often the value or probability (i.e. the valuation and strategy) of each possible action in a game state. The training data is often derived from the process of playing games on the computer (more precisely, the sequence of state-action-reward records), the result of learning is often a non-linear representation of the state and strategy of the game.

Fig. 2 Using CNNs Play Super Mario

Atari Game

The Arcade Learning Environment (ALE) is a common evaluation platform for artificial intelligence researchers, and its role is similar to the ImageNet challenge in image classification [7]. ALE provides an Atari 2600 (ATARI 2600) simulator and about 50 games (mostly single-player) [7]. These games are built on the same 160-pixel, 210-pixel screen with 128 colors for each pixel. Each game has a different action Space, including up to 18 possible actions. To play well these games, a successful algorithm needs to solve the game state representation and strategy to choose two challenges, which is almost as deep reinforcement learning tailor-made: the deep reinforcement learning algorithm can learn the representation of the current game state from the high-dimensional, partially observable screen, and learn the optimal strategy from the sparse, highly delayed reward information (Reward).

The classic reinforcement learning algorithm assumes that its value function can be represented by a table, where each entry corresponds to a state or a state + pairing of actions. For example, the Q-learning algorithm uses a table to record the value of each state + action pair. These values are updated during the learning process using the following formula:

(1)	$\begin{array}{l}\displaystyle Q(s_t,a_t) = Q(s_t,a_t) + \alpha (r_t+ \gamma max_b Q(s_{t+1},b)- Q(s_t,a_t))\end{array}$

Where $\begin{array}{l}s_t\end{array}$ , $\begin{array}{l}a_t\end{array}$ , $\begin{array}{l}r_t\end{array}$ is the game state, action, and reward at time $\begin{array}{l}t\end{array}$ , and $\begin{array}{l}s_{t+1}\end{array}$ is the state at time $\begin{array}{l}t+1\end{array}$ .

Fig. 3 Pac Man from Atari

Such table representations can be applied to situations where the game states and actions are not too numerous.When the number of possible game states and actions is large, the key problem with reinforcement learning is not the amount of space required to store this large table, but the amount of time and data required to correctly fill in the target values in the table.So the real challenge here is generalization: how to get the values corresponding to possible actions in the unexperienced state space when only a limited game state is experienced. The generalization method commonly used in reinforcement learning is function approximation. Function approximation Obtains some input-to-output mappings from (unknown) objective functions and attempts to generalize them to the whole function definition field to construct an approximation to the entire objective function. For example, in a linear function approximation, the value function of an action is represented as:

(2)	$\begin{array}{l}\displaystyle Q(s,a;\theta) = \theta^T \varphi (s,a)\end{array}$

Where $\begin{array}{l}\theta\end{array}$ is the parameter that can be learned and $\begin{array}{l}\varphi(\cdot)\end{array}$ is the feature function defined on the pair of states + actions. The linear function approximation method extracts the artificially designed features from the most recent frame of the game frame, and uses linear combination of these features to express and learn value functions. Bellemare of the University of Alberta first used a linear function approximation SARSA algorithm on ALE and the following four new generic artificial design feature sets:

First, the screen is divided into blocks of intersection, for each block a vector is used to indicate whether each color is present or not and set the set of these vectors as the BASIC feature set of the current game screen.
Based on the basic feature set, we add pairing combinations of basic features, which constitute BASS feature set.
The objects on the screen are first extracted and classified by clustering. And the position and velocity of these objects are inferred from the information between the frames.Types, positions and speeds of all the objects on the screen constitute the DISCO feature set of the screen of current game;
Locality-sensitive Hashing is applied to the game screen, and the resulting low-dimensional representation is used as the LSH feature set[7].

They have proposed Contingency Awareness in the follow-up work, which is based on the original feature set Has additional features are added to indicate which elements of the screen are directly affected by the player's input.46 Bellemare et al later proposed a further extension of the feature set by using the tug-of-war sketch. But in general, the reinforcement learning algorithm based on linear value function approximation is much weaker than that of the human player. In addition, it is not an easy task to design the feature function manually.

DeepMind's Wang et al. Proposed a new "dueling" neural network architecture that separates the value functions of the game state from those of the state-dependent actions, which makes the valuations of the individual actions no longer independent. Actions sharing more generalized state-valued functions are useful for reducing the estimation bias for different actions.

Go

Go is one of the most complex and complete information games in the world.Compared with other board games (such as chess and checkers), Go's search space is very large, the description and evaluation are very difficult.As the convolution neural network Atari has been a huge success in the game, and people expect the same method can be used to solve the computer Go.It turns out, however, that it is difficult to approximate the strategy or value function of a computer on a straight-line basis, either because the strategy and situation of the game are too complex, since the true value of each checkerboard state is Unknown, if there is not enough good initial strategy as the basis of reinforcement learning, the use of inaccurate action and reward information training neural network may not be able to do reliable prediction.In order to solve this problem, the researchers thought of using human chess manual do the training set of the deep neural network because the choice of each move in chess and the outcome of each chess game are known.The researchers expect that the strategies learned from the human chess are more suitable as the initial strategy of reinforcement learning .

Clark et al, of the University of Edinburgh, used the historical go manual of human players to train convolutional neural networks to predict the strategy of the human player (ie, the next step of the current situation) .This neural network, also known as the strategy network. These strategies use a feature set designed specifically for Go. For example, symmetric information is hard-coded into the input of the deep neural network, and the unreasonable position on the chessboard is blocked. Even if the prediction accuracy is less than 50%, the trained convolutional neural network can defeat the well-known Go program GnuGo (but lost to the more advanced Go program Fuego) [8].

Maddison et al. of the University of Toronto and the researchers from DeepMind have trained a 12-layer deep convolution neural network with a network warfare platform (KGS), which achieved 55% accuracy by predicting a human player's strategy. The well trained convolutional network can achieve a 97% winning rate for GnuGo and can match the most advanced Monte Carlo tree search algorithm that simulates 2 million times per step[9]. A similar approach has also been used by Tian et al. in Facebook developed the computer Go program Dark Forest (DarkForest) [10].

The strategy network trained from the human chess makes the basic strategy sufficiently good, which makes the state-action-reward information observed by the reinforcement learning algorithm in the self-game more close to the ground-truth of the optimal strategy. AlphaGo then uses a self-checker's board state and the outcome to train a value network to predict the value of any checkerboard state. This value network further enhances the AlphaGo gameplay. To keep the training data as independent as possible, only one situation is selected for training. This idea is similar to the random selection of experience playback in Atari games as much as possible.

On the other hand, Monte Carlo tree search is still the most advanced algorithms for dealing with large and complex sequential decisions, including computer chess. As the number of stochastic simulations and the search tree become larger and the estimate of the value becomes more accurate. However, its computational cost is too large for the actual Go game. Researchers have also attempted to use Monte Carlo tree search with deep learning (or simpler linear learning) to increase the accuracy of predictions or to reduce the computational cost of searching. AlphaGo introduced a new Monte Carlo tree Search algorithm. The algorithm combines Monte Carlo simulation with a network of values and strategies to reduce the computational cost. Value networks are used to assess the current state of the board (ie, to predict the value of a given situation), thus effectively reducing the depth of the search plan. The strategy network is then used to select the next step (ie, to predict the probability distribution of the next move for a given situation), thereby effectively reducing the search width. The training of these networks is a sophisticated combination of supervised learning of human expert action and intensive learning of self-play, and AlphaGo takes the lead in the European Go champion and world-class player Li Shi-shih .

Fig. 4 AlphaGo vs. Lee Sedol

Poker

Poker game challenge comes from incomplete information, that players can only obsere part of the historical events, and the opponent's information is unvisible. Bowling et al. of the University of Alberta, who proposed a near-Nash equilibrium solution for Texas Hold'em poker games, which is based on the idea of using two Regret minimization algorithms to repeatedly self-match. Last year, Yakovenko ce al from Columbia University trained a convolutional neural network for poker in a similar fashion and were able to compete against human experts in three common poker games. Heinrich et al. from DeepMind and Silver et al. [8] have proposed a more robust reinforcement learning algorithm that is more suitable for incomplete information games called "neural fictitious self-play" (NFSP).The main idea of NFSP is the classical "Fictitious Play" model in approximation game theory, which does not rely on a priori knowledge and can converge to Nash equilibrium through self-game in zero-sum games or potential games.

Conclusion

Game Theory has always been an important branch of artificial intelligence.The success of deep learning in other areas brought to unprecedented inspiration to the game artificial intelligence. For example, the computer Go has experienced the earliest rule-based algorithm, heuristic-based situation assessment algorithm, Monte Carlo tree-based search algorithm, reinforcement learning algorithm, and finally got a qualitative leap by the deep reinforcement learning. In the near future, we will see the deep learning algorithms and ideas applied to more and more games. The future research direction should be from the classic game gradually biased towards more complex, more complete information multiplayer games, especially video games. Deep learning in the game still has many problems to be solved, such as how to generalize the acquired knowledge to new issues, how to make more effective use of expert knowledge, how to learn new knowledge in the game, how to change the game according to the opponent's strategy and so on. We also expect to see successful game algorithms being applied across a wide range of industries, such as health and education, to truly improve the lives of ordinary people.

Literature

Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge: MIT Press, 1998.
Kaelbling L P, Littman M L, Moore A W. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 1996, 4: 237−285
Hausknecht M, Stone P. Deep recurrent q-learning for partially observable MDPS. In: Proceedings of the 2015 AAAI Fall Symposium Series. The Westin Arlington Gateway, Arlington, Virginia: AIAA, 2015.
Bakker B, Zhumatiy V, Gruener G, Schmidhuber J. A robot that reinforcement-learns to identify and memorize important previous observations. In: Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. Manno-Lugano, Switzerland: IEEE, 2003
Wierstra D, F¨orster A, Peters J, Schmidhuber J. Recurrent policy gradients. Logic Journal of IGPL, 2010, 18(5): 620−634
Schmidhuber J. Deep learning in neural networks: an overview. Neural Networks, 2015, 61: 85−117
Bellemare M, Naddaf Y, Veness J, Bowling M. The arcade learning environment: an evaluation platform for general agents. Journal of Artificial Intelligence Research, 2013, 47: 253−279
Clark C, Storkey A. Training deep convolutional neural networks to play go. In: Proceedings of the 32nd International Conference on Machine Learning. Lille, France: ICML, 2015.
Maddison C J, Huang A, Sutskever I, Silver D. Move evaluation in Go using deep convolutional neural networks. In: Proceedings of the 2014 International Conference on Learning Representations. Rimrock Resort Hotel, Banff, Canada: ICRR, 2014.
Tian Y D, Zhu Y. Better computer go player with neural network and long-term prediction. In: Proceeding of the 2016 International Conference on Learning Representations. Caribe Hilton, San Juan, Puerto Rico: ICLR, 2016.

Seitenhierarchie