Upper Confidence Bound (UCB) Action Selection

Easy
Reinforcement Learning

Implement the Upper Confidence Bound (UCB) action selection strategy for the multi-armed bandit problem. Write a function that, given the current number of times each action has been selected, the average rewards for each action, and the current timestep t, returns the action to select according to the UCB1 formula. Use only NumPy.

Examples

Example 1:
Input: import numpy as np counts = np.array([1, 1, 1, 1]) values = np.array([1.0, 2.0, 1.5, 0.5]) t = 4 c = 2.0 print(ucb_action(counts, values, t, c))
Output: 1
Explanation: At t=4, each action has been tried once, but action 1 has the highest average reward (2.0) and the same confidence bound as the others, so it is chosen.

Starter Code

import numpy as np

def ucb_action(counts, values, t, c):
    """
    Choose an action using the UCB1 formula.
    Args:
      counts (np.ndarray): Number of times each action has been chosen
      values (np.ndarray): Average reward of each action
      t (int): Current timestep (starts from 1)
      c (float): Exploration coefficient
    Returns:
      int: Index of action to select
    """
    # TODO: Implement the UCB action selection
    pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews