Implement the training phase of the Byte Pair Encoding (BPE) algorithm, a subword tokenization method widely used in modern NLP models like GPT and RoBERTa.
Given a corpus represented as a dictionary mapping space-separated token sequences to their frequencies, and a number of merge operations to perform, implement the BPE training algorithm.
The algorithm works iteratively:
- Count all adjacent token pairs across the corpus, weighted by word frequency
- Find the most frequent pair
- Merge that pair everywhere it appears in the corpus (replacing the two tokens with their concatenation)
- Record the merge operation
- Repeat for the specified number of merges
Note: When counting pairs in a word, if a pair appears multiple times in the same word (e.g., 'a b a b' contains 'a b' twice), count each occurrence separately, multiplied by the word frequency.
If there are fewer possible merges than num_merges (e.g., all words are single tokens), stop early.
The special token '</w>' represents end-of-word and should be treated as a regular token that can participate in merges.
Write a function that returns the list of merge operations performed, where each merge is a tuple of the two tokens that were merged.
Examples
corpus = {"h u g </w>": 10, "p u g </w>": 5, "p u g s </w>": 5}, num_merges = 2[('u', 'g'), ('ug', '</w>')]Starter Code
def byte_pair_encoding(corpus: dict, num_merges: int) -> list:
"""
Train a BPE tokenizer on the given corpus.
Args:
corpus: Dictionary mapping space-separated token sequences to their frequencies.
Example: {"l o w </w>": 5, "n e w </w>": 6}
num_merges: Number of merge operations to perform.
Returns:
List of tuples, where each tuple contains the two tokens that were merged.
Example: [('l', 'o'), ('lo', 'w')]
"""
pass