The AI Interview - Master AI/ML Interviews

Write a function that performs one step of value iteration for a given Markov Decision Process (MDP) using the Bellman equation. The function should update the state-value function V(s) for each state based on possible actions, transition probabilities, rewards, and the discount factor gamma. Only use NumPy.

Examples

Example 1:

Input:

import numpy as np
transitions = [
  {0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, False)]},
  {0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, True)]}
]
V = np.array([0.0, 0.0])
gamma = 0.9
new_V = bellman_update(V, transitions, gamma)
print(np.round(new_V, 2))

Output: [1. 1.]

Explanation: For state 0, the best action is to go to state 1 and get a reward of 1. For state 1, taking action 1 gives a reward of 1 and ends the episode, so its value is 1.

Starter Code

import numpy as np

def bellman_update(V, transitions, gamma):
    """
    Perform one step of value iteration using the Bellman equation.
    Args:
      V: np.ndarray, state values, shape (n_states,)
      transitions: list of dicts. transitions[s][a] is a list of (prob, next_state, reward, done)
      gamma: float, discount factor
    Returns:
      np.ndarray, updated state values
    """
    # TODO: Implement Bellman update
    pass

Implement the Bellman Equation for Value Iteration

Examples

Starter Code