Implement the epsilon-greedy method for action selection in an n-armed bandit problem. Given a set of estimated action values (Q-values), select an action using the epsilon-greedy policy: with probability epsilon, choose a random action; with probability 1 - epsilon, choose the action with the highest estimated value.
Examples
Example 1:
Input:
Q = np.array([0.5, 2.3, 1.7])
epsilon = 0.0
action = epsilon_greedy(Q, epsilon)
print(action)Output:
1Explanation: With epsilon=0.0 (always greedy), the highest Q-value is 2.3 at index 1, so the function always returns 1.
Starter Code
import numpy as np
def epsilon_greedy(Q, epsilon=0.1):
"""
Selects an action using epsilon-greedy policy.
Q: np.ndarray of shape (n,) -- estimated action values
epsilon: float in [0, 1]
Returns: int, selected action index
"""
# Your code here
passPython3
ReadyLines: 1Characters: 0
Ready