n-Step TD Prediction

Medium
Reinforcement Learning

Implement n-step Temporal Difference (TD) prediction for estimating state values. n-Step TD bridges the gap between TD(0), which bootstraps after one step, and Monte Carlo, which waits until the episode ends. By looking n steps ahead before bootstrapping, n-step TD can learn more efficiently in many environments. Your task is to compute the n-step return and update state values accordingly.

Examples

Example 1:
Input: states=[0,1,2], rewards=[1,1,1], V={0:0, 1:0, 2:0, 3:0}, n=2, gamma=0.9, alpha=0.1
Output: V[0] updated using G = r1 + gamma*r2 + γ²*V[2] = 1 + 0.9*1 + 0.81*0 = 1.9
Explanation: For state 0 with n=2, we look 2 steps ahead: get rewards r1=1 and r2=1, then bootstrap from V[S2]=V[2]=0. The 2-step return is 1.9, so V[0] += 0.1*(1.9 - 0) = 0.19.

Starter Code

import numpy as np

def n_step_td_prediction(
    episodes: list[list[tuple[int, float]]],
    n_states: int,
    n: int,
    gamma: float,
    alpha: float
) -> np.ndarray:
    """
    Perform n-step TD prediction to estimate state values.
    
    Args:
        episodes: List of episodes. Each episode is a list of (state, reward) tuples.
                 The reward at index i is the reward received AFTER leaving state i.
                 Episodes end with a terminal transition (last state's reward is the final reward).
        n_states: Number of states (states are integers 0 to n_states-1)
        n: Number of steps for n-step TD
        gamma: Discount factor
        alpha: Learning rate
        
    Returns:
        V: Estimated state values as numpy array of shape (n_states,)
    """
    # Your code here
    pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews