The AI Interview - Master AI/ML Interviews

Implement a function to compute the Temporal Difference (TD) error for a single state transition in reinforcement learning.

The TD error measures how much the current value estimate differs from a better estimate that incorporates the immediate reward and the bootstrapped value of the next state.

Given:

v_s: The current estimate of the value for state s
reward: The immediate reward received after transitioning from state s
v_s_prime: The current estimate of the value for the next state s'
gamma: The discount factor (between 0 and 1)
done: A boolean indicating if the next state is terminal

Return the TD error as a float. Note that when the episode terminates (done=True), there is no future value to bootstrap from since the episode ends.

Only use NumPy.

Examples

Example 1:

Input: v_s=5.0, reward=1.0, v_s_prime=10.0, gamma=0.9, done=False

Output: 5.0

Explanation: TD target = reward + gamma * V(s') = 1.0 + 0.9 * 10.0 = 10.0. TD error = TD target - V(s) = 10.0 - 5.0 = 5.0. The positive TD error indicates the current value estimate was too low.

Starter Code

import numpy as np

def compute_td_error(v_s: float, reward: float, v_s_prime: float, gamma: float, done: bool) -> float:
    """
    Compute the Temporal Difference (TD) error for a single transition.
    
    Args:
        v_s: Current state value estimate V(s)
        reward: Immediate reward received
        v_s_prime: Next state value estimate V(s')
        gamma: Discount factor (0 <= gamma <= 1)
        done: True if s' is a terminal state
    
    Returns:
        The TD error delta
    """
    # Your code here
    pass

Compute Temporal Difference Error

Examples

Starter Code