Implement the self-attention mechanism, a fundamental component of transformer models used in NLP and computer vision.
Your task is to implement the self_attention function that computes attention output given Query (Q), Key (K), and Value (V) matrices.
The self-attention formula is: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
where d_k is the dimensionality of the key vectors (number of columns in K).
Input:
- Q: Query matrix of shape (seq_len, d_k)
- K: Key matrix of shape (seq_len, d_k)
- V: Value matrix of shape (seq_len, d_v)
Output:
- Attention output matrix of shape (seq_len, d_v)
Steps:
- Compute attention scores: scores = Q * K^T / sqrt(d_k)
- Apply softmax row-wise to get attention weights (each row should sum to 1)
- Compute output: output = attention_weights * V
Note: The helper function compute_qkv is provided to compute Q, K, V from input X and weight matrices.
Examples
Example 1:
Input:
Q = np.array([[1, 0], [0, 1]])
K = np.array([[1, 0], [0, 1]])
V = np.array([[1, 2], [3, 4]])
output = self_attention(Q, K, V)Output:
[[1.660477, 2.660477], [2.339523, 3.339523]]Explanation: 1. Compute scores: Q @ K.T / sqrt(2) = [[0.707, 0], [0, 0.707]]
2. Apply softmax row-wise: [[0.66, 0.34], [0.34, 0.66]]
3. Multiply by V: attention_weights @ V gives the final contextualized output
Starter Code
import numpy as np
def compute_qkv(X, W_q, W_k, W_v):
"""Compute Query, Key, Value matrices from input X and weight matrices."""
Q = np.dot(X, W_q)
K = np.dot(X, W_k)
V = np.dot(X, W_v)
return Q, K, V
def self_attention(Q, K, V):
"""
Compute scaled dot-product self-attention.
Args:
Q: Query matrix of shape (seq_len, d_k)
K: Key matrix of shape (seq_len, d_k)
V: Value matrix of shape (seq_len, d_v)
Returns:
Attention output of shape (seq_len, d_v)
"""
# Your code here
pass
Python3
ReadyLines: 1Characters: 0
Ready