Website about Evan (Z) Wang. Undergraduate student studying computer science at UMD. Produces music sometimes.
Developed an enviroment to train models to play “connect” games like Five-in-a-Row, Tic-Tac-Toe, and Connect-4. Click here to play against the trained agent (Five-in-a-Row).
The overall algorithm was inspired by AlphaGo Zero and AlphaZero.
The overall training environment and pipeline was written using Python; the neural networks were from PyTorch. ONNX.js was used to port the neural networks into JavaScript.
You can play against a trained agent for Five-in-a-Row on a 10x10 board here.
The code (for training and low-level gameplay) can be found at https://github.com/evanzwang/connect-ai.
The agent uses a modified Monte-Carlo Tree Search (MCTS) algorithm as specified in AlphaZero.
The algorithm creates a tree of possible game states. When it sees a new state, the neural network is used to predict a state value and search probabilities. Then, the search stops and some information is propagated back up the tree.
However, if the state has already been visited (the neural network has already predicted the state’s value and move probabilities), the algorithm selects a further move to play. The algorithm then runs on this new board state, until it reaches a state it hasn’t seen before.
The above search algorithm is then run for a specified number of iterations (set to be around 250). The action that was visited the most is selected as the actual move to play.
To train the neural network (from the previous section), data is obtained from many games of self-play, where the same agent plays for both sides (around 8,000 games). Every move, three things are stored:
After a set amount of games (10 games), the neural network parameters are updated with backpropagation using the data collected. First, the board state is fed into the neural network, and it outputs the state value and search probabilities for that specific board state. Then, the neural network’s value (ranging from -1 to 1) is trained to match 1 if the player won, and -1 if the player lost (mean-squared error). The neural network’s output probabilities are trained to match the ratio of the counts of the selected actions (cross-entropy).