Byte Pair Encoding (BPE) Tokenizer

Medium
NLP

Implement the training phase of the Byte Pair Encoding (BPE) algorithm, a subword tokenization method widely used in modern NLP models like GPT and RoBERTa.

Given a corpus represented as a dictionary mapping space-separated token sequences to their frequencies, and a number of merge operations to perform, implement the BPE training algorithm.

The algorithm works iteratively:

  1. Count all adjacent token pairs across the corpus, weighted by word frequency
  2. Find the most frequent pair
  3. Merge that pair everywhere it appears in the corpus (replacing the two tokens with their concatenation)
  4. Record the merge operation
  5. Repeat for the specified number of merges

Note: When counting pairs in a word, if a pair appears multiple times in the same word (e.g., 'a b a b' contains 'a b' twice), count each occurrence separately, multiplied by the word frequency.

If there are fewer possible merges than num_merges (e.g., all words are single tokens), stop early.

The special token '</w>' represents end-of-word and should be treated as a regular token that can participate in merges.

Write a function that returns the list of merge operations performed, where each merge is a tuple of the two tokens that were merged.

Examples

Example 1:
Input: corpus = {"h u g </w>": 10, "p u g </w>": 5, "p u g s </w>": 5}, num_merges = 2
Output: [('u', 'g'), ('ug', '</w>')]
Explanation: Step 1: Count all adjacent pairs weighted by frequency. Pair (u,g) has count 10+5+5=20, (h,u) has 10, (g,</w>) has 10+5=15, (p,u) has 5+5=10, (g,s) has 5, (s,</w>) has 5. The most frequent pair is (u,g) with 20. Merge u and g everywhere: corpus becomes {"h ug </w>": 10, "p ug </w>": 5, "p ug s </w>": 5}. Step 2: Recount pairs. (ug,</w>) has count 10+5=15, (h,ug) has 10, (p,ug) has 10, (ug,s) has 5, (s,</w>) has 5. Most frequent is (ug,</w>) with 15. Merge ug and </w>. Final merges: [('u','g'), ('ug','</w>')].

Starter Code

def byte_pair_encoding(corpus: dict, num_merges: int) -> list:
    """
    Train a BPE tokenizer on the given corpus.
    
    Args:
        corpus: Dictionary mapping space-separated token sequences to their frequencies.
                Example: {"l o w </w>": 5, "n e w </w>": 6}
        num_merges: Number of merge operations to perform.
    
    Returns:
        List of tuples, where each tuple contains the two tokens that were merged.
        Example: [('l', 'o'), ('lo', 'w')]
    """
    pass
Lines: 1Characters: 0
Ready
The AI Interview - Master AI/ML Interviews