Implement TD(llama) prediction with eligibility traces for estimating state values. TD(λ) unifies TD(0) and Monte Carlo methods through a decay parameter λ. Eligibility traces track which states are 'eligible' for learning—recently visited states get more credit for the current TD error. When λ=0, this reduces to TD(0); when λ=1, it approximates Monte Carlo. Your task is to implement the backward view of TD(λ) using accumulating eligibility traces.
Examples
Example 1:
Input:
episode=[(0,1), (1,1), (2,0)], gamma=0.9, lambda=0.8, alpha=0.1Output:
States 0,1,2 updated with decaying eligibilityExplanation: At each step, we compute TD error δ, update eligibility traces (decay old traces, increment current state), then update all state values proportionally to their eligibility: V(s) += α * δ * e(s).
Starter Code
import numpy as np
def td_lambda_prediction(
episodes: list[list[tuple[int, float]]],
n_states: int,
gamma: float,
lambd: float,
alpha: float
) -> np.ndarray:
"""
Estimate state values using TD(λ) with accumulating eligibility traces.
Args:
episodes: List of episodes. Each episode is a list of (state, reward) tuples.
The reward at index i is the reward received AFTER leaving state i.
n_states: Number of states (states are integers 0 to n_states-1)
gamma: Discount factor
lambd: Trace decay parameter (λ). Use 'lambd' to avoid Python keyword.
alpha: Learning rate
Returns:
V: Estimated state values as numpy array of shape (n_states,)
"""
# Your code here
passPython3
ReadyLines: 1Characters: 0
Ready