This repository provides starter code and instructions for two tutorials on learning reinforcement learning (RL) in games. The tutorials break down the essential parts of the RL algorithms into key TODO sections, allowing you to complete them and run the AI program without needing to handle other routine elements like game tasks.
You will implement the temporal difference (TD) learning algorithm to train a value function for the game 2048.
The goal is to train an agent that can successfully merge a 2048-tile in the game.
You will implement Monte Carlo Tree Search with PUCT selection in an AlphaZero framework.
The goal is to train the agents to play Connect4 and TicTacToe against a random agent baseline.
- Temporal Difference Learning and afterstate updates
- Value function approximation using tuple networks
- Monte Carlo Tree Search for policy improvement
- AlphaZero training pipeline (self-play, optimization)
Train a strong 2048 player using the TD(0) learning algorithm.
def select_best_move(self, b : board) -> move:
# ============== TODO ==============
# hint: use self.estimate(b) to retrieve V(b)
moves = [ move(b, opcode) for opcode in range(4) ]
random.shuffle(moves)
for mv in moves:
if mv.is_valid():
return mv # select a legal move randomly
return move() # no legal move
- Iterate over four possible move directions.
- Exclude illegal moves.
- Return the move with the highest
$r + V(s')$ . - Expected result:
- Average score > 1100.
- Max tiles should be 512-tile or 1024-tile.
def learn_from_episode(self, path : list[move], alpha : float = 0.1) -> None:
# ============== TODO ==============
# hint: use self.estimate(b) to retrieve V(b);
# use self.update(b, u) to update V(b) with an error u
- For each afterstate
$s'$ , update:- $V(s't) ← V(s't) + α (r{t+1} + V(s'{t+1}) − V(s'_t))$
- For last afterstate:
- $V(s'{T−1}) ← V(s'{T−1}) + α (0 − V(s'_{T−1}))$
- Expected result:
- Average score > 3000 after 100 trained games.
- 2048-tile should appear within 2000 trained games.
Beyond the tutorials, you might want to dive deeper and explore more:
- Features of N-Tuple Network: experiment with different tuple architectures to improve performance.
- Expectimax Search: implement a lookahead search procedure for better action decision.
Train an strong AlphaZero-based agent for Connect4 and TicTacToe using MCTS with PUCT.
def select_child(self, parent: Node) -> Node:
# ============== TODO ==============
# hint: select the child with the highest PUCT score
# hint: self.PUCT_C1 and self.PUCT_C2 are PUCT constants
best_child = np.random.choice(parent.children)
return best_child
- Select the best child node by
$\arg\max_{a}(Q(s,a) + U(s,a))$ .$U(s,a)=P(s,a) \frac{\sqrt{\sum{_b}N(s,b)}}{1+N(s,a)}[c_1+\log(\frac{\sum{_b}N(s,b)+c_2+1}{c_2})]$
-
$N(s,a)$ represents the visit count of node$s$ when taking action$a$ . -
$\sum{_b}N(s,b)$ represents the total visit count of the child nodes$b$ of node$s$ , which typically refers to the visit count of the parent node. - If multiple child nodes have the same
$Q(s,a) + U(s,a)$ score, select the one with the highest policy$P(s,a)$ . - Expected result:
- For the TicTacToe, after about 50 iterations, draw should be the most frequent outcome.
Beyond the tutorials, you might want to dive deeper and explore more:
- Network architecture: add convolutional or residual layers.
- Feature design: add history channels or new board encodings.
- M. Szubert and W. Jaśkowski, "Temporal difference learning of N-tuple networks for the game 2048," CIG 2014.
- I-C. Wu, K.-H. Yeh, C.-C. Liang, C.-C. Chang, and H. Chiang, "Multi-stage temporal difference learning for 2048," TAAI 2014.
- K. Matsuzaki, "Systematic selection of N-tuple networks with consideration of interinfluence for game 2048," TAAI 2016.
- D. Silver et al., "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play," Science 362, 2018.