SA
Samin Chandeepa
15 min read

Building a High-Quality Medical Pretraining Dataset for Small Language Models

Large language models like GPT-4 or Gemini are trained on trillions of tokens scraped from the open web. But when your goal is a Small Language Model (SLM) with only ~300 million parameters, targeted at the medical domain, quality matters far more than quantity. Every token the model sees during pretraining needs to carry signal, not noise.

In this post, we build a high-quality medical pretraining dataset by curating content from three authoritative sources: PubMed abstracts, PMC open-access full-text articles, and clinical practice guidelines. We implement a nine-stage pipeline that includes data loading with fallbacks, text cleaning, quality filtering, deduplication, tokenization, and efficient document packing.

#Pipeline Overview

The pipeline is organized into nine stages, each building on the output of the previous one.

  1. Data Loading (PubMed, PMC OA, Guidelines)
  2. Text Cleaning (boilerplate removal, normalization)
  3. Quality Filtering (length, language, content checks)
  4. Exact Deduplication (MD5 hashing)
  5. Near-Duplicate Removal (MinHash LSH)
  6. Tokenization (GPT-2 tokenizer)
  7. Document Packing (greedy, EOS separators)
  8. Export (HuggingFace Hub)
  9. Visualization (plots and statistics)
Data processing pipeline diagram

Nine-stage pipeline for building high-quality medical pretraining data

#Why These Sources

Medical data quality is paramount. We selected three authoritative sources that represent the gold standard in medical literature.

PubMed Abstracts

Curated summaries of peer-reviewed medical research, providing high-signal content without the noise of full papers.

PMC Open Access

Full-text articles from PubMed Central's open access collection, offering comprehensive medical content.

Clinical Guidelines

Evidence-based recommendations from medical societies, representing current best practices.

This combination ensures our dataset covers research findings, detailed methodologies, and clinical applications.

CONFIG = {
  "MAX_SAMPLES": {
    "pubmed":     50_000,   # PubMed abstracts
    "pmc_oa":     20_000,   # PMC full-text articles
    "guidelines": 10_000,   # Clinical guidelines
  },
}

This yielded 78,080 raw documents across the three sources:

SourceDocumentsEst. Tokens
PubMed Abstracts50,000~25M
PMC Open Access20,000~30M
Clinical Guidelines10,000~15M
Total80,000~70M
Corpus composition by source

Corpus composition: document count and estimated token volume by source.

#Data Loading Implementation

Each source has its own loader function that returns a list of dictionaries with text, source, and id fields. We handle dataset-script deprecation errors and network issues with automatic fallbacks.

def load_pubmed(max_samples):
    """Load PubMed abstracts with fallback."""
    records = []
    try:
        ds = load_dataset("ncbi/pubmed", split="train", streaming=True)
        for i, row in enumerate(tqdm(ds, total=max_samples)):
            if i >= max_samples:
                break
            records.append({
                "text": row["MedlineCitation"]["Article"]["Abstract"]["AbstractText"],
                "source": "pubmed",
                "id": row["MedlineCitation"]["PMID"]
            })
    except Exception:
        ds = load_dataset("ccdv/pubmed-summarization", split="train", streaming=True)
        # fallback iteration continues...
    return records

#Text Cleaning

Raw medical text contains substantial boilerplate that reduces training signal. We remove common patterns using regex and normalization.

BOILERPLATE_PATTERNS = [
    r"copyright\s*©?\s*\d{4}",
    r"this (?:article|work) is licensed under",
    r"funding[:\s].*?(?:\.|$)",
    r"acknowledgements?\s*:?",
    r"conflict[s]?\s+of\s+interest",
    r"author\s+contributions?",
    r"(?:https?://|www\.)\S+",
    r"doi:\s*\S+",
    r"\[\d+(?:[-,–]\d+)*\]",
    r"&[a-zA-Z]+;",
    r"<[^>]+>",
    r"[=\-_]{10,}",
]
def clean_text(text):
    text = unicodedata.normalize("NFKC", text)
    for pattern in compiled_patterns:
        text = pattern.sub(" ", text)
    for line in text.split("\n"):
        if re.match(r"^\s*(References|Bibliography|Works Cited)", line):
            text = text[: text.find(line)]
            break
    lines = [l for l in text.split("\n") if not (len(l) > 0 and sum(c.isdigit() for c in l) / len(l) > 0.5)]
    text = "\n".join(lines)
    text = re.sub(r"[ \t]+", " ", text)
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()

#Quality Filtering

We apply multiple quality filters to ensure only high-signal content reaches the model.

FilterThresholdPurpose
Word Count
100 words
Remove stubs/short abstracts
LanguageEnglish onlyTokenizer compatibility
Content Quality< 30% boilerplateHigh signal-to-noise ratio
Medical RelevanceMedical keyword score > 0.3Domain relevance

The quality filter also includes language detection using the langdetect library. We only keep English documents, since our model and tokenizer are designed for English.

def is_english(text):
    try:
        return detect(text[:500]) == "en"
    except LangDetectException:
        return True  # Keep on detection failure

The filtering results tell an important story:

StageDocumentsRemovedReason
Raw Load78,080
After Cleaning78,0800Text normalization only
After Filtering70,2727,808 (10%)Quality thresholds
After Dedup44,18726,085 (37%)Duplicates removed

#Deduplication

Duplicate content severely degrades training quality. We implement both exact and near-duplicate removal.

Exact Deduplication (MD5 Hashing)

Simple but effective: normalize whitespace and hash the text.

def exact_dedup(corpus):
    seen_hashes = set()
    unique = []
    for doc in corpus:
        normalized = re.sub(r"\s+", " ", doc["text"].lower().strip())
        doc_hash = hashlib.md5(normalized.encode()).hexdigest()
        if doc_hash not in seen_hashes:
            seen_hashes.add(doc_hash)
            unique.append(doc)
    return unique

Near-Duplicate Removal (MinHash LSH)

For documents that are similar but not identical, we use MinHash LSH with character-level n-grams.

def get_minhash(text, num_perm, ngram_size):
    mh = MinHash(num_perm=num_perm)
    for i in range(len(text) - ngram_size + 1):
        ngram = text[i:i + ngram_size]
        mh.update(ngram.encode("utf8"))
    return mh


def near_dedup(corpus, config):
    lsh = MinHashLSH(
        threshold=config["MINHASH_THRESHOLD"],
        num_perm=config["MINHASH_NUM_PERM"]
    )
    unique = []
    for doc in tqdm(corpus):
        mh = get_minhash(doc["text"], config["MINHASH_NUM_PERM"], config["NGRAM_SIZE"])
        if not lsh.query(mh):
            lsh.insert(doc["id"], mh)
            unique.append(doc)
    return unique

#Tokenization & Packing

We use GPT-2's tokenizer and pack documents efficiently with EOS separators.

TOKENIZER_NAME = "gpt2"
MAX_CHUNK_TOKENS = 1024
DOC_SEPARATOR = "<|endoftext|>"

tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)
tokenizer.add_special_tokens({'eos_token': DOC_SEPARATOR})
sep_ids = tokenizer.encode(DOC_SEPARATOR)

def pack_documents(corpus, tokenizer, max_tokens, sep_ids):
    chunks = []
    current_chunk = []

    for doc in tqdm(corpus):
        token_ids = tokenizer.encode(
            doc["text"],
            add_special_tokens=False,
            truncation=True
        ) + sep_ids

        if len(current_chunk) + len(token_ids) > max_tokens:
            chunks.append(current_chunk[:max_tokens])
            current_chunk = token_ids
        else:
            current_chunk.extend(token_ids)

    if current_chunk:
        chunks.append(current_chunk[:max_tokens])

    return chunks

Our packing achieved 98.3% average chunk fill rate, meaning less than 2% of tokens are wasted as padding.

44,187

Documents

~44.7M

Tokens

98.3%

Fill Rate

#Export

We export both the raw documents and packed chunks to HuggingFace Hub.

dataset_dict = Dataset.from_pandas(df).train_test_split(
    test_size=0.05, seed=CONFIG["SEED"]
)

dataset_dict.push_to_hub("Saminx22/medical_data_for_slm", config_name="documents")

packed_chunks_dataset = Dataset.from_dict({
    "input_ids": [chunk for chunk in all_chunks],
    "token_count": [len(chunk) for chunk in all_chunks],
    "chunk_id": [f"chunk_{i}" for i in range(len(all_chunks))]
})
packed_chunks_dataset.push_to_hub("Saminx22/medical_data_for_slm", config_name="chunks", split="train")

#Chinchilla Scaling Check

The Chinchilla scaling laws suggest that a compute-optimal language model should be trained on approximately 20 tokens per parameter.

300M params × 20 tokens/param = 6 billion tokens (optimal)

Our dataset: ~44.7M tokens

At 44.7M tokens, our dataset is substantially below the Chinchilla-optimal threshold. This is intentional for a prototype—we are demonstrating the pipeline, not training a production model.

#Key Takeaways

  1. Quality over quantity. For small models, every token matters. Aggressive filtering removed 8.9% of documents; exact dedup removed another 37.6%.
  2. Stream everything. Medical datasets can be massive. Streaming from HuggingFace with sample caps prevents OOM crashes.
  3. Pack efficiently. Greedy document packing with <|endoftext|> separators achieved 98.3% utilization, wasting almost no tokens.
  4. Build fallbacks. Data sources break, APIs change, scripts get deprecated. Automatic fallback loaders keep the pipeline running.
  5. Visualize your corpus. Statistics and plots catch problems that code cannot: distribution skew, outlier documents, unexpected source imbalances.

In the companion blog post, we use this dataset to pretrain MedSLM — a 330M-parameter transformer with RMSNorm, Rotary Positional Embeddings, SwiGLU activations, and Grouped-Query Attention.

#Resources

Available Blogs

Explore other posts in this series.