The AI Interview - Master AI/ML Interviews

Implement the policy gradient estimator using the REINFORCE algorithm. The policy is parameterized by a 2D NumPy array theta of shape (num_states, num_actions). The policy for each state is computed via softmax over theta[s, :]. Given a list of episodes (each a list of (state, action, reward) tuples), compute the average gradient of the log-policy multiplied by the return at each time step.

Examples

Example 1:

Input: theta = np.zeros((2,2)); episodes = [[(0,1,0), (1,0,1)], [(0,0,0)]]

Output: [[-0.25, 0.25], [0.25, -0.25]]

Explanation: Episode 1 contributes a positive gradient from reward 1 at t=1; episode 2 adds zero. Result is averaged across episodes.

Starter Code

import numpy as np

def compute_policy_gradient(theta: np.ndarray, episodes: list[list[tuple[int, int, float]]]) -> np.ndarray:
    """
    Estimate the policy gradient using REINFORCE.

    Args:
        theta: (num_states x num_actions) policy parameters.
        episodes: List of episodes, where each episode is a list of (state, action, reward).

    Returns:
        Average policy gradient (same shape as theta).
    """
    # Your code here
    pass

Policy Gradient with REINFORCE

Examples

Starter Code