Evaluate Expected Value in a Markov Decision Process

Medium
Reinforcement Learning

Given an MDP (Markov Decision Process) specified by a set of states, actions, transition probabilities, and rewards, write a function to compute the expected value of taking a particular action in a particular state, assuming a discount factor gamma. Use only NumPy.

Examples

Example 1:
Input: states = [0, 1] actions = ['a', 'b'] P = {0: {'a': {0: 0.5, 1: 0.5}, 'b': {0: 1.0}}, 1: {'a': {1: 1.0}, 'b': {0: 0.7, 1: 0.3}}} R = {0: {'a': {0: 5, 1: 10}, 'b': {0: 2}}, 1: {'a': {1: 0}, 'b': {0: -1, 1: 3}}} gamma = 0.9 V = np.array([1.0, 2.0]) print(expected_action_value(0, 'a', P, R, V, gamma))
Output: 8.85
Explanation: For state 0 and action 'a': - Next state 0: 0.5 * (5 + 0.9*1.0) = 0.5 * 5.9 = 2.95 - Next state 1: 0.5 * (10 + 0.9*2.0) = 0.5 * 11.8 = 5.9 Total: 2.95 + 5.9 = 8.85

Starter Code

import numpy as np

def expected_action_value(state, action, P, R, V, gamma):
    """
    Computes the expected value of taking `action` in `state` for the given MDP.
    Args:
      state: int or str, the current state
      action: str, the chosen action
      P: dict of dicts, P[s][a][s'] = prob of next state s' if a in s
      R: dict of dicts, R[s][a][s'] = reward for (s, a, s')
      V: np.ndarray, the value function vector, indexed by state
      gamma: float, discount factor
    Returns:
      float: expected value
    """
    # Your code here
    pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews