Implement n-step Temporal Difference (TD) prediction for estimating state values. n-Step TD bridges the gap between TD(0), which bootstraps after one step, and Monte Carlo, which waits until the episode ends. By looking n steps ahead before bootstrapping, n-step TD can learn more efficiently in many environments. Your task is to compute the n-step return and update state values accordingly.
Examples
Example 1:
Input:
states=[0,1,2], rewards=[1,1,1], V={0:0, 1:0, 2:0, 3:0}, n=2, gamma=0.9, alpha=0.1Output:
V[0] updated using G = r1 + gamma*r2 + γ²*V[2] = 1 + 0.9*1 + 0.81*0 = 1.9Explanation: For state 0 with n=2, we look 2 steps ahead: get rewards r1=1 and r2=1, then bootstrap from V[S2]=V[2]=0. The 2-step return is 1.9, so V[0] += 0.1*(1.9 - 0) = 0.19.
Starter Code
import numpy as np
def n_step_td_prediction(
episodes: list[list[tuple[int, float]]],
n_states: int,
n: int,
gamma: float,
alpha: float
) -> np.ndarray:
"""
Perform n-step TD prediction to estimate state values.
Args:
episodes: List of episodes. Each episode is a list of (state, reward) tuples.
The reward at index i is the reward received AFTER leaving state i.
Episodes end with a terminal transition (last state's reward is the final reward).
n_states: Number of states (states are integers 0 to n_states-1)
n: Number of steps for n-step TD
gamma: Discount factor
alpha: Learning rate
Returns:
V: Estimated state values as numpy array of shape (n_states,)
"""
# Your code here
passPython3
ReadyLines: 1Characters: 0
Ready