Given an MDP (Markov Decision Process) specified by a set of states, actions, transition probabilities, and rewards, write a function to compute the expected value of taking a particular action in a particular state, assuming a discount factor gamma. Use only NumPy.
Examples
Example 1:
Input:
states = [0, 1]
actions = ['a', 'b']
P = {0: {'a': {0: 0.5, 1: 0.5}, 'b': {0: 1.0}}, 1: {'a': {1: 1.0}, 'b': {0: 0.7, 1: 0.3}}}
R = {0: {'a': {0: 5, 1: 10}, 'b': {0: 2}}, 1: {'a': {1: 0}, 'b': {0: -1, 1: 3}}}
gamma = 0.9
V = np.array([1.0, 2.0])
print(expected_action_value(0, 'a', P, R, V, gamma))Output:
8.85Explanation: For state 0 and action 'a':
- Next state 0: 0.5 * (5 + 0.9*1.0) = 0.5 * 5.9 = 2.95
- Next state 1: 0.5 * (10 + 0.9*2.0) = 0.5 * 11.8 = 5.9
Total: 2.95 + 5.9 = 8.85
Starter Code
import numpy as np
def expected_action_value(state, action, P, R, V, gamma):
"""
Computes the expected value of taking `action` in `state` for the given MDP.
Args:
state: int or str, the current state
action: str, the chosen action
P: dict of dicts, P[s][a][s'] = prob of next state s' if a in s
R: dict of dicts, R[s][a][s'] = reward for (s, a, s')
V: np.ndarray, the value function vector, indexed by state
gamma: float, discount factor
Returns:
float: expected value
"""
# Your code here
passPython3
ReadyLines: 1Characters: 0
Ready