Implement the Upper Confidence Bound (UCB) action selection strategy for the multi-armed bandit problem. Write a function that, given the current number of times each action has been selected, the average rewards for each action, and the current timestep t, returns the action to select according to the UCB1 formula. Use only NumPy.
Examples
Example 1:
Input:
import numpy as np
counts = np.array([1, 1, 1, 1])
values = np.array([1.0, 2.0, 1.5, 0.5])
t = 4
c = 2.0
print(ucb_action(counts, values, t, c))Output:
1Explanation: At t=4, each action has been tried once, but action 1 has the highest average reward (2.0) and the same confidence bound as the others, so it is chosen.
Starter Code
import numpy as np
def ucb_action(counts, values, t, c):
"""
Choose an action using the UCB1 formula.
Args:
counts (np.ndarray): Number of times each action has been chosen
values (np.ndarray): Average reward of each action
t (int): Current timestep (starts from 1)
c (float): Exploration coefficient
Returns:
int: Index of action to select
"""
# TODO: Implement the UCB action selection
passPython3
ReadyLines: 1Characters: 0
Ready