Back to Resources
Papers

bert

Author: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

Author: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

Date: October 11, 2018 (Last revised: May 24, 2019)

Link: https://arxiv.org/abs/1810.04805

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT Architecture
BERT Architecture

Introduction

BERT (Bidirectional Encoder Representations from Transformers) represents a paradigm shift in natural language processing. Introduced by researchers at Google AI Language, BERT is a revolutionary language representation model that fundamentally changed how we approach NLP tasks through deep bidirectional pre-training.

The Pre-BERT Landscape

Before BERT, language representation models primarily used unidirectional approaches:

  • Left-to-right models: Process text sequentially from beginning to end
  • Shallow bidirectionality: Concatenate left-to-right and right-to-left representations
  • Task-specific architectures: Required substantial modifications for different NLP tasks

These approaches had significant limitations in capturing full contextual information.

BERT's Revolutionary Approach

Bidirectional Pre-training

BERT's key innovation is pre-training deep bidirectional representations by jointly conditioning on both left and right context in all layers. Unlike previous models, BERT can "see" the entire sentence at once, understanding context from both directions simultaneously.

Traditional (left-to-right): The [cat] → sat → on → the → mat
Traditional (right-to-left): mat ← the ← on ← sat ← The [cat]
BERT (bidirectional): The ← [cat] → sat
                         ↑
                   (sees full context)

Two-Stage Framework

BERT operates in two stages:

  1. Pre-training: Train on unlabeled data over different pre-training tasks
  2. Fine-tuning: Initialize with pre-trained parameters and fine-tune on downstream tasks

Pre-training Tasks

BERT uses two unsupervised tasks for pre-training:

1. Masked Language Model (MLM)

The MLM task randomly masks 15% of tokens and trains the model to predict them:

def create_masked_lm_predictions(tokens, masked_lm_prob=0.15):
    """
    Create masked language model predictions

    Args:
        tokens: Input token sequence
        masked_lm_prob: Probability of masking each token

    Returns:
        Masked tokens and labels
    """
    masked_tokens = tokens.copy()
    labels = []

    for i, token in enumerate(tokens):
        # Mask with probability
        if random.random() < masked_lm_prob:
            prob = random.random()

            if prob < 0.8:
                # 80% of time: replace with [MASK]
                masked_tokens[i] = '[MASK]'
            elif prob < 0.9:
                # 10% of time: replace with random token
                masked_tokens[i] = random.choice(vocab)
            # 10% of time: keep original

            labels.append(token)
        else:
            labels.append(None)

    return masked_tokens, labels

Example:

Original:  "The quick brown fox jumps over the lazy dog"
Masked:    "The [MASK] brown fox [MASK] over the lazy dog"
Predict:   "quick" and "jumps"

2. Next Sentence Prediction (NSP)

The NSP task trains BERT to understand relationships between sentences:

def create_nsp_training_data(document):
    """
    Create next sentence prediction training pairs

    Args:
        document: Document containing multiple sentences

    Returns:
        Sentence pairs with labels (IsNext or NotNext)
    """
    training_pairs = []
    sentences = split_into_sentences(document)

    for i in range(len(sentences) - 1):
        sentence_a = sentences[i]

        # 50% of time: actual next sentence (positive example)
        if random.random() < 0.5:
            sentence_b = sentences[i + 1]
            label = 'IsNext'
        # 50% of time: random sentence (negative example)
        else:
            sentence_b = random.choice(sentences)
            label = 'NotNext'

        training_pairs.append((sentence_a, sentence_b, label))

    return training_pairs

BERT Architecture

Model Configurations

BERT comes in two main sizes:

BERT_BASE:

  • Layers (L): 12
  • Hidden size (H): 768
  • Attention heads (A): 12
  • Total parameters: 110M

BERT_LARGE:

  • Layers (L): 24
  • Hidden size (H): 1024
  • Attention heads (A): 16
  • Total parameters: 340M

Input Representation

BERT's input combines three types of embeddings:

class BERTEmbedding:
    def __init__(self, vocab_size, hidden_size, max_position, type_vocab_size):
        # Token embeddings: vocabulary
        self.token_embedding = nn.Embedding(vocab_size, hidden_size)

        # Positional embeddings: position in sequence
        self.position_embedding = nn.Embedding(max_position, hidden_size)

        # Segment embeddings: distinguish sentence A from sentence B
        self.segment_embedding = nn.Embedding(type_vocab_size, hidden_size)

        self.layer_norm = nn.LayerNorm(hidden_size)

    def forward(self, token_ids, segment_ids, position_ids):
        # Sum all embeddings
        token_embed = self.token_embedding(token_ids)
        position_embed = self.position_embedding(position_ids)
        segment_embed = self.segment_embedding(segment_ids)

        embeddings = token_embed + position_embed + segment_embed
        embeddings = self.layer_norm(embeddings)

        return embeddings

Special Tokens

BERT uses special tokens for different purposes:

  • [CLS]: Classification token (first token of every sequence)
  • [SEP]: Separator token (separates sentences)
  • [MASK]: Mask token (for MLM task)
  • [PAD]: Padding token

Fine-tuning for Downstream Tasks

One of BERT's most powerful features is its simplicity in fine-tuning:

Single Sentence Classification

class BERTClassifier(nn.Module):
    def __init__(self, bert_model, num_classes):
        super().__init__()
        self.bert = bert_model
        self.classifier = nn.Linear(bert_model.config.hidden_size, num_classes)

    def forward(self, input_ids, attention_mask, token_type_ids):
        # Get BERT output
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids
        )

        # Use [CLS] token representation for classification
        pooled_output = outputs.pooler_output

        # Classification layer
        logits = self.classifier(pooled_output)

        return logits

Question Answering

class BERTForQuestionAnswering(nn.Module):
    def __init__(self, bert_model):
        super().__init__()
        self.bert = bert_model
        self.qa_outputs = nn.Linear(bert_model.config.hidden_size, 2)

    def forward(self, input_ids, attention_mask):
        # Get BERT sequence output
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask
        )

        sequence_output = outputs.last_hidden_state

        # Predict start and end positions
        logits = self.qa_outputs(sequence_output)
        start_logits, end_logits = logits.split(1, dim=-1)

        return start_logits.squeeze(-1), end_logits.squeeze(-1)

Benchmark Results

BERT achieved state-of-the-art results across multiple NLP benchmarks:

GLUE Benchmark

The General Language Understanding Evaluation (GLUE) benchmark:

  • BERT_BASE: 78.5% average score
  • BERT_LARGE: 80.5% average score
  • Improvement: 7.7 percentage points absolute improvement

SQuAD (Question Answering)

Stanford Question Answering Dataset results:

SQuAD v1.1:

  • Test F1: 93.2
  • Improvement: 1.5 points absolute

SQuAD v2.0:

  • Test F1: 83.1
  • Improvement: 5.1 points absolute

MultiNLI (Natural Language Inference)

  • Accuracy: 86.7%
  • Improvement: 4.6% absolute

Named Entity Recognition (NER)

On CoNLL-2003 NER:

  • F1 Score: 92.8
  • New state-of-the-art

Technical Innovations

Bidirectional Context

The bidirectional nature allows BERT to understand nuanced meanings:

Example: "The bank of the river"
- Left context: "The bank" → could be financial institution
- Right context: "of the river" → clarifies it's a riverbank
- BERT sees both: correctly identifies riverbank context

Transfer Learning Excellence

BERT's pre-trained representations transfer exceptionally well:

# Pre-trained on general text
bert_base = load_pretrained_bert()

# Fine-tune for specific task with minimal data
task_model = BERTClassifier(bert_base, num_classes=3)
train(task_model, task_data, epochs=3)  # Often just 2-4 epochs!

WordPiece Tokenization

BERT uses WordPiece tokenization to handle rare words:

Input: "unbelievable"
Tokens: ["un", "##believe", "##able"]

This enables BERT to handle out-of-vocabulary words effectively.

Impact on NLP

BERT's impact on the field has been transformative:

1. Pre-training Paradigm

Established pre-training + fine-tuning as the standard approach for NLP:

General Pre-training → Task-Specific Fine-tuning
(Large unlabeled data) → (Small labeled data)

2. Democratization of NLP

Made state-of-the-art NLP accessible:

  • Download pre-trained BERT
  • Fine-tune on your task
  • Achieve excellent results

3. Research Catalyst

Inspired numerous variants and improvements:

  • RoBERTa: Robustly Optimized BERT
  • ALBERT: A Lite BERT
  • DistilBERT: Distilled version
  • ELECTRA: More efficient pre-training
  • SpanBERT: Span-based pre-training

4. Multilingual Models

BERT demonstrated effective multilingual learning:

  • Multilingual BERT supports 104 languages
  • Shows cross-lingual transfer capabilities

Practical Applications

BERT has been deployed in numerous real-world applications:

Search Engines

Google integrated BERT into Search:

  • Better understanding of search queries
  • Improved contextual relevance
  • Handling of conversational queries

Chatbots and Virtual Assistants

Enhanced understanding of:

  • User intent
  • Contextual conversation flow
  • Nuanced language patterns

Content Recommendation

Improved content matching based on:

  • Semantic similarity
  • Contextual relevance
  • User intent understanding

Best Practices for Using BERT

1. Choose the Right Model Size

# For resource-constrained environments
model = BertModel.from_pretrained('bert-base-uncased')

# For maximum performance
model = BertModel.from_pretrained('bert-large-uncased')

# For specific domains
model = BertModel.from_pretrained('bert-base-cased')

2. Proper Fine-tuning

# Learning rate scheduling
optimizer = AdamW(model.parameters(), lr=2e-5)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=num_warmup_steps,
    num_training_steps=num_training_steps
)

# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Few epochs (usually 2-4)
for epoch in range(3):
    train_epoch(model, train_dataloader, optimizer, scheduler)

3. Task-Specific Adaptations

Different tasks require different approaches:

  • Classification: Use [CLS] token
  • Token classification: Use all token representations
  • Question answering: Predict span start and end
  • Sentence pairs: Use [SEP] to separate

Limitations and Considerations

While revolutionary, BERT has limitations:

1. Computational Requirements

  • Large memory footprint
  • Slow inference for real-time applications
  • Requires significant GPU resources

2. Maximum Sequence Length

  • Limited to 512 tokens
  • Long documents require truncation or splitting

3. Domain Adaptation

  • General pre-training may not capture domain-specific language
  • May require domain-specific pre-training

Conclusion

BERT represents a watershed moment in natural language processing. By introducing deep bidirectional pre-training, it demonstrated that:

  • Bidirectional context is crucial for language understanding
  • Pre-training + fine-tuning is a powerful paradigm
  • Transfer learning works exceptionally well for NLP
  • Simplicity in design can lead to remarkable results

The impact of BERT extends far beyond its impressive benchmark results. It democratized state-of-the-art NLP, inspired countless research directions, and fundamentally changed how we approach language understanding tasks.

BERT's legacy continues through its many successors and variants, and its core principles remain foundational to modern NLP. Whether you're building a chatbot, improving search, or developing language understanding systems, BERT's innovations provide the foundation for success.


Citation:

@article{devlin2018bert,
  title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
  author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
  journal={arXiv preprint arXiv:1810.04805},
  year={2018}
}