Implement Q-Learning Algorithm for MDPs

Medium
Reinforcement Learning

Write a function that implements the Q-Learning algorithm to learn the optimal Q-table for a given Markov Decision Process (MDP). The function should take the number of states, number of actions, transition probabilities matrix, rewards matrix, list of terminal states, learning rate, discount factor, epsilon for exploration, and the number of episodes as inputs. Use these parameters to iteratively update the Q-table based on the Q-Learning update rule, employing an epsilon-greedy strategy for action selection. Ensure the function handles starting from non-terminal states and stops episodes upon reaching a terminal state.

Constraints:

  • num_states: Integer greater than or equal to 1.
  • num_actions: Integer greater than or equal to 1.
  • P: A 3D NumPy array of shape (num_states, num_actions, num_states) where each element is a probability between 0 and 1, and each sub-array sums to 1.
  • R: A 2D NumPy array of shape (num_states, num_actions) with float or integer values.
  • terminal_states: A list or NumPy array of integers, each between 0 and num_states - 1, with no duplicates.
  • alpha: A float between 0 and 1.
  • gamma: A float between 0 and 1.
  • epsilon: A float between 0 and 1.
  • num_episodes: An integer greater than or equal to 1. The function should return a 2D NumPy array of shape (num_states, num_actions) representing the learned Q-table.

Examples

Example 1:
Input: import numpy as np; np.random.seed(42); P = np.array([[[0, 1], [1, 0]], [[1, 0], [1, 0]]]); R = np.array([[1, 0], [0, 0]]); terminal_states = [1]; print(q_learning(2, 2, P, R, terminal_states, 0.1, 0.9, 0.1, 10))
Output: [[0.65132156, 0.052902 ],[0., 0.]]
Explanation: The Q-Learning algorithm initializes a Q-table with zeros and iteratively updates it over 10 episodes by starting from random non-terminal states, selecting actions via an epsilon-greedy policy, sampling next states and rewards from the provided transition probabilities (P) and rewards (R), and applying the update rule: Q(s, a) += alpha * (reward + gamma * max(Q(next_state)) - Q(s, a)). This process results in the output Q-table [[0.65132156, 0.052902], [0., 0.]], where the values represent learned estimates of state-action values, with the second state's Q-values remaining zero because it is a terminal state and no further actions are taken from there.

Starter Code

import numpy as np
def q_learning(num_states, num_actions, P, R, terminal_states, alpha, gamma, epsilon, num_episodes):
    # Your code here
    pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews