Implement the policy gradient estimator using the REINFORCE algorithm. The policy is parameterized by a 2D NumPy array theta of shape (num_states, num_actions). The policy for each state is computed via softmax over theta[s, :]. Given a list of episodes (each a list of (state, action, reward) tuples), compute the average gradient of the log-policy multiplied by the return at each time step.
Examples
Example 1:
Input:
theta = np.zeros((2,2)); episodes = [[(0,1,0), (1,0,1)], [(0,0,0)]]Output:
[[-0.25, 0.25], [0.25, -0.25]]Explanation: Episode 1 contributes a positive gradient from reward 1 at t=1; episode 2 adds zero. Result is averaged across episodes.
Starter Code
import numpy as np
def compute_policy_gradient(theta: np.ndarray, episodes: list[list[tuple[int, int, float]]]) -> np.ndarray:
"""
Estimate the policy gradient using REINFORCE.
Args:
theta: (num_states x num_actions) policy parameters.
episodes: List of episodes, where each episode is a list of (state, action, reward).
Returns:
Average policy gradient (same shape as theta).
"""
# Your code here
passPython3
ReadyLines: 1Characters: 0
Ready