Implement the gradient bandit algorithm for action selection in a multi-armed bandit setting. Write a class GradientBandit that maintains a set of action preferences and updates them after each reward. The class should provide a method select_action() to sample an action using the softmax of preferences, and a method update(action, reward) to update preferences using the gradient ascent update rule. Use only NumPy.
Examples
Example 1:
Input:
import numpy as np
gb = GradientBandit(num_actions=3, alpha=0.1)
a = gb.select_action()
gb.update(a, reward=1.0)
probs = gb.softmax()
print(np.round(probs, 2).tolist())Output:
[0.32, 0.34, 0.34]Explanation: After a positive reward, the selected action's preference is increased, boosting its softmax probability.
Starter Code
import numpy as np
class GradientBandit:
def __init__(self, num_actions, alpha=0.1):
"""
num_actions (int): Number of possible actions
alpha (float): Step size for preference updates
"""
self.num_actions = num_actions
self.alpha = alpha
self.preferences = np.zeros(num_actions)
self.avg_reward = 0.0
self.time = 0
def softmax(self):
# Compute softmax probabilities from preferences
pass
def select_action(self):
# Sample an action according to the softmax distribution
pass
def update(self, action, reward):
# Update action preferences using the gradient ascent update
passPython3
ReadyLines: 1Characters: 0
Ready