Write a function that performs one step of value iteration for a given Markov Decision Process (MDP) using the Bellman equation. The function should update the state-value function V(s) for each state based on possible actions, transition probabilities, rewards, and the discount factor gamma. Only use NumPy.
Examples
Example 1:
Input:
import numpy as np
transitions = [
{0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, False)]},
{0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, True)]}
]
V = np.array([0.0, 0.0])
gamma = 0.9
new_V = bellman_update(V, transitions, gamma)
print(np.round(new_V, 2))Output:
[1. 1.]Explanation: For state 0, the best action is to go to state 1 and get a reward of 1. For state 1, taking action 1 gives a reward of 1 and ends the episode, so its value is 1.
Starter Code
import numpy as np
def bellman_update(V, transitions, gamma):
"""
Perform one step of value iteration using the Bellman equation.
Args:
V: np.ndarray, state values, shape (n_states,)
transitions: list of dicts. transitions[s][a] is a list of (prob, next_state, reward, done)
gamma: float, discount factor
Returns:
np.ndarray, updated state values
"""
# TODO: Implement Bellman update
passPython3
ReadyLines: 1Characters: 0
Ready