Implement an efficient method to update the mean reward for a k-armed bandit action after receiving each new reward, without storing the full history of rewards. Given the previous mean estimate (Q_prev), the number of times the action has been selected (k), and a new reward (R), compute the updated mean using the incremental formula.
Note: Using a regular mean that stores all past rewards will eventually run out of memory. Your solution should use only the previous mean, the count, and the new reward.
Examples
Example 1:
Input:
Q_prev = 2.0
k = 2
R = 6.0
new_Q = incremental_mean(Q_prev, k, R)
print(round(new_Q, 2))Output:
4.0Explanation: The updated mean is Q_prev + (1/k) * (R - Q_prev) = 2.0 + (1/2)*(6.0 - 2.0) = 2.0 + 2.0 = 4.0
Starter Code
def incremental_mean(Q_prev, k, R):
"""
Q_prev: previous mean estimate (float)
k: number of times the action has been selected (int)
R: new observed reward (float)
Returns: new mean estimate (float)
"""
# Your code here
pass
Python3
ReadyLines: 1Characters: 0
Ready