Implement policy evaluation for a 5x5 gridworld. Given a policy (mapping each state to action probabilities), compute the state-value function V(s) for each cell using the Bellman expectation equation. The agent can move up, down, left, or right, receiving a constant reward of -1 for each move. Terminal states (the four corners) are fixed at 0. Iterate until the largest change in V is less than a given threshold. Only use Python built-ins and no external RL libraries.
Examples
Example 1:
Input:
policy = {(i, j): {'up': 0.25, 'down': 0.25, 'left': 0.25, 'right': 0.25} for i in range(5) for j in range(5)}
gamma = 0.9
threshold = 0.001
V = gridworld_policy_evaluation(policy, gamma, threshold)
print(round(V[2][2], 4))Output:
-7.0902Explanation: The policy is uniform (equal chance of each move). The agent receives -1 per step. After iterative updates, the center state value converges to about -7.09, and corners remain at 0.
Starter Code
def gridworld_policy_evaluation(policy: dict, gamma: float, threshold: float) -> list[list[float]]:
"""
Evaluate state-value function for a policy on a 5x5 gridworld.
Args:
policy: dict mapping (row, col) to action probability dicts
gamma: discount factor
threshold: convergence threshold
Returns:
5x5 list of floats
"""
# Your code here
passPython3
ReadyLines: 1Characters: 0
Ready