The AI Interview - Master AI/ML Interviews

Implement the gradient bandit algorithm for action selection in a multi-armed bandit setting. Write a class GradientBandit that maintains a set of action preferences and updates them after each reward. The class should provide a method select_action() to sample an action using the softmax of preferences, and a method update(action, reward) to update preferences using the gradient ascent update rule. Use only NumPy.

Examples

Example 1:

Input:

import numpy as np
gb = GradientBandit(num_actions=3, alpha=0.1)
a = gb.select_action()
gb.update(a, reward=1.0)
probs = gb.softmax()
print(np.round(probs, 2).tolist())

Output: [0.32, 0.34, 0.34]

Explanation: After a positive reward, the selected action's preference is increased, boosting its softmax probability.

Starter Code

import numpy as np

class GradientBandit:
    def __init__(self, num_actions, alpha=0.1):
        """
        num_actions (int): Number of possible actions
        alpha (float): Step size for preference updates
        """
        self.num_actions = num_actions
        self.alpha = alpha
        self.preferences = np.zeros(num_actions)
        self.avg_reward = 0.0
        self.time = 0
    def softmax(self):
        # Compute softmax probabilities from preferences
        pass
    def select_action(self):
        # Sample an action according to the softmax distribution
        pass
    def update(self, action, reward):
        # Update action preferences using the gradient ascent update
        pass

Gradient Bandit Action Selection

Examples

Starter Code