Implement a function to compute the Temporal Difference (TD) error for a single state transition in reinforcement learning.
The TD error measures how much the current value estimate differs from a better estimate that incorporates the immediate reward and the bootstrapped value of the next state.
Given:
v_s: The current estimate of the value for state sreward: The immediate reward received after transitioning from state sv_s_prime: The current estimate of the value for the next state s'gamma: The discount factor (between 0 and 1)done: A boolean indicating if the next state is terminal
Return the TD error as a float. Note that when the episode terminates (done=True), there is no future value to bootstrap from since the episode ends.
Only use NumPy.
Examples
Example 1:
Input:
v_s=5.0, reward=1.0, v_s_prime=10.0, gamma=0.9, done=FalseOutput:
5.0Explanation: TD target = reward + gamma * V(s') = 1.0 + 0.9 * 10.0 = 10.0. TD error = TD target - V(s) = 10.0 - 5.0 = 5.0. The positive TD error indicates the current value estimate was too low.
Starter Code
import numpy as np
def compute_td_error(v_s: float, reward: float, v_s_prime: float, gamma: float, done: bool) -> float:
"""
Compute the Temporal Difference (TD) error for a single transition.
Args:
v_s: Current state value estimate V(s)
reward: Immediate reward received
v_s_prime: Next state value estimate V(s')
gamma: Discount factor (0 <= gamma <= 1)
done: True if s' is a terminal state
Returns:
The TD error delta
"""
# Your code here
passPython3
ReadyLines: 1Characters: 0
Ready