Tutorials for Reinforcement Learning in Games

This repository provides starter code and instructions for two tutorials on learning reinforcement learning (RL) in games. The tutorials break down the essential parts of the RL algorithms into key TODO sections, allowing you to complete them and run the AI program without needing to handle other routine elements like game tasks.

TD Learning for 2048

You will implement the temporal difference (TD) learning algorithm to train a value function for the game 2048.
The goal is to train an agent that can successfully merge a 2048-tile in the game.

AlphaZero for TicTacToe / Connect4

You will implement Monte Carlo Tree Search with PUCT selection in an AlphaZero framework.
The goal is to train the agents to play Connect4 and TicTacToe against a random agent baseline.

🧠 What You Will Learn

Temporal Difference Learning and afterstate updates
Value function approximation using tuple networks
Monte Carlo Tree Search for policy improvement
AlphaZero training pipeline (self-play, optimization)

🧩 Assignment 1: TD Learning for 2048

1. Objectives

Train a strong 2048 player using the TD(0) learning algorithm.

2. Required TODOs

Implement the best-move selection:

def select_best_move(self, b : board) -> move:
    # ============== TODO ==============
    # hint: use self.estimate(b) to retrieve V(b)
    moves = [ move(b, opcode) for opcode in range(4) ]
    random.shuffle(moves)
    for mv in moves:
        if mv.is_valid():
            return mv # select a legal move randomly
    return move() # no legal move

Iterate over four possible move directions.
Exclude illegal moves.
Return the move with the highest $r + V(s')$.
Expected result:
- Average score > 1100.
- Max tiles should be 512-tile or 1024-tile.

Implement TD(0) updates:

def learn_from_episode(self, path : list[move], alpha : float = 0.1) -> None:
    # ============== TODO ==============
    # hint: use self.estimate(b) to retrieve V(b);
    # use self.update(b, u) to update V(b) with an error u

For each afterstate $s'$, update:
- $V(s't) ← V(s't) + α (r{t+1} + V(s'{t+1}) − V(s'_t))$
For last afterstate:
- $V(s'{T−1}) ← V(s'{T−1}) + α (0 − V(s'_{T−1}))$
Expected result:
- Average score > 3000 after 100 trained games.
- 2048-tile should appear within 2000 trained games.

3. Advanced Topics

Beyond the tutorials, you might want to dive deeper and explore more:

Features of N-Tuple Network: experiment with different tuple architectures to improve performance.
Expectimax Search: implement a lookahead search procedure for better action decision.

🧩 Assignment 2: AlphaZero for TicTacToe / Connect4

1. Objectives

Train an strong AlphaZero-based agent for Connect4 and TicTacToe using MCTS with PUCT.

2. Required TODOs

Implement child node selection by PUCT:

def select_child(self, parent: Node) -> Node:
    # ============== TODO ==============
    # hint: select the child with the highest PUCT score
    # hint: self.PUCT_C1 and self.PUCT_C2 are PUCT constants
    best_child = np.random.choice(parent.children)
    return best_child

Select the best child node by $\arg\max_{a}(Q(s,a) + U(s,a))$.
- $U(s,a)=P(s,a) \frac{\sqrt{\sum{_b}N(s,b)}}{1+N(s,a)}[c_1+\log(\frac{\sum{_b}N(s,b)+c_2+1}{c_2})]$
$N(s,a)$ represents the visit count of node $s$ when taking action $a$.
$\sum{_b}N(s,b)$ represents the total visit count of the child nodes $b$ of node $s$, which typically refers to the visit count of the parent node.
If multiple child nodes have the same $Q(s,a) + U(s,a)$ score, select the one with the highest policy $P(s,a)$.
Expected result:
- For the TicTacToe, after about 50 iterations, draw should be the most frequent outcome.

3. Advanced Topics

Beyond the tutorials, you might want to dive deeper and explore more:

Network architecture: add convolutional or residual layers.
Feature design: add history channels or new board encodings.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
alphazero_tutorial.ipynb		alphazero_tutorial.ipynb
tdl2048_tutorial.ipynb		tdl2048_tutorial.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tutorials for Reinforcement Learning in Games

TD Learning for 2048

AlphaZero for TicTacToe / Connect4

🧠 What You Will Learn

🧩 Assignment 1: TD Learning for 2048

1. Objectives

2. Required TODOs

Implement the best-move selection:

Implement TD(0) updates:

3. Advanced Topics

🧩 Assignment 2: AlphaZero for TicTacToe / Connect4

1. Objectives

2. Required TODOs

Implement child node selection by PUCT:

3. Advanced Topics

🔗 References

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

rlglab/rlg-tutorial

Folders and files

Latest commit

History

Repository files navigation

Tutorials for Reinforcement Learning in Games

TD Learning for 2048

AlphaZero for TicTacToe / Connect4

🧠 What You Will Learn

🧩 Assignment 1: TD Learning for 2048

1. Objectives

2. Required TODOs

Implement the best-move selection:

Implement TD(0) updates:

3. Advanced Topics

🧩 Assignment 2: AlphaZero for TicTacToe / Connect4

1. Objectives

2. Required TODOs

Implement child node selection by PUCT:

3. Advanced Topics

🔗 References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages