TD(llama) with Eligibility Traces

Hard
Reinforcement Learning

Implement TD(llama) prediction with eligibility traces for estimating state values. TD(λ) unifies TD(0) and Monte Carlo methods through a decay parameter λ. Eligibility traces track which states are 'eligible' for learning—recently visited states get more credit for the current TD error. When λ=0, this reduces to TD(0); when λ=1, it approximates Monte Carlo. Your task is to implement the backward view of TD(λ) using accumulating eligibility traces.

Examples

Example 1:
Input: episode=[(0,1), (1,1), (2,0)], gamma=0.9, lambda=0.8, alpha=0.1
Output: States 0,1,2 updated with decaying eligibility
Explanation: At each step, we compute TD error δ, update eligibility traces (decay old traces, increment current state), then update all state values proportionally to their eligibility: V(s) += α * δ * e(s).

Starter Code

import numpy as np

def td_lambda_prediction(
    episodes: list[list[tuple[int, float]]],
    n_states: int,
    gamma: float,
    lambd: float,
    alpha: float
) -> np.ndarray:
    """
    Estimate state values using TD(λ) with accumulating eligibility traces.
    
    Args:
        episodes: List of episodes. Each episode is a list of (state, reward) tuples.
                 The reward at index i is the reward received AFTER leaving state i.
        n_states: Number of states (states are integers 0 to n_states-1)
        gamma: Discount factor
        lambd: Trace decay parameter (λ). Use 'lambd' to avoid Python keyword.
        alpha: Learning rate
        
    Returns:
        V: Estimated state values as numpy array of shape (n_states,)
    """
    # Your code here
    pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews