Samin Chandeepa

•Apr 4, 2026•15 min read

Building a High-Quality Medical Pretraining Dataset for Small Language Models

Large language models like GPT-4 or Gemini are trained on trillions of tokens scraped from the open web. But when your goal is a Small Language Model (SLM) with only ~300 million parameters, targeted at the medical domain, quality matters far more than quantity. Every token the model sees during pretraining needs to carry signal, not noise.

In this post, we build a high-quality medical pretraining dataset by curating content from three authoritative sources: PubMed abstracts, PMC open-access full-text articles, and clinical practice guidelines. We implement a nine-stage pipeline that includes data loading with fallbacks, text cleaning, quality filtering, deduplication, tokenization, and efficient document packing.

#Pipeline Overview

The pipeline is organized into nine stages, each building on the output of the previous one.

Data Loading (PubMed, PMC OA, Guidelines)
Text Cleaning (boilerplate removal, normalization)
Quality Filtering (length, language, content checks)
Exact Deduplication (MD5 hashing)
Near-Duplicate Removal (MinHash LSH)
Tokenization (GPT-2 tokenizer)
Document Packing (greedy, EOS separators)
Export (HuggingFace Hub)
Visualization (plots and statistics)

Nine-stage pipeline for building high-quality medical pretraining data

#Why These Sources

Medical data quality is paramount. We selected three authoritative sources that represent the gold standard in medical literature.

PubMed Abstracts

Curated summaries of peer-reviewed medical research, providing high-signal content without the noise of full papers.

PMC Open Access

Full-text articles from PubMed Central's open access collection, offering comprehensive medical content.

Clinical Guidelines

Evidence-based recommendations from medical societies, representing current best practices.

This combination ensures our dataset covers research findings, detailed methodologies, and clinical applications.

CONFIG = {
  "MAX_SAMPLES": {
    "pubmed":     50_000,   # PubMed abstracts
    "pmc_oa":     20_000,   # PMC full-text articles
    "guidelines": 10_000,   # Clinical guidelines
  },
}

This yielded 78,080 raw documents across the three sources:

Source	Documents	Est. Tokens
PubMed Abstracts	50,000	~25M
PMC Open Access	20,000	~30M
Clinical Guidelines	10,000	~15M
Total	80,000	~70M

Corpus composition: document count and estimated token volume by source.

#Data Loading Implementation

Each source has its own loader function that returns a list of dictionaries with text, source, and id fields. We handle dataset-script deprecation errors and network issues with automatic fallbacks.

def load_pubmed(max_samples):
    """Load PubMed abstracts with fallback."""
    records = []
    try:
        ds = load_dataset("ncbi/pubmed", split="train", streaming=True)
        for i, row in enumerate(tqdm(ds, total=max_samples)):
            if i >= max_samples:
                break
            records.append({
                "text": row["MedlineCitation"]["Article"]["Abstract"]["AbstractText"],
                "source": "pubmed",
                "id": row["MedlineCitation"]["PMID"]
            })
    except Exception:
        ds = load_dataset("ccdv/pubmed-summarization", split="train", streaming=True)
        # fallback iteration continues...
    return records

#Text Cleaning

Raw medical text contains substantial boilerplate that reduces training signal. We remove common patterns using regex and normalization.

BOILERPLATE_PATTERNS = [
    r"copyright\s*©?\s*\d{4}",
    r"this (?:article|work) is licensed under",
    r"funding[:\s].*?(?:\.|$)",
    r"acknowledgements?\s*:?",
    r"conflict[s]?\s+of\s+interest",
    r"author\s+contributions?",
    r"(?:https?://|www\.)\S+",
    r"doi:\s*\S+",
    r"\[\d+(?:[-,–]\d+)*\]",
    r"&[a-zA-Z]+;",
    r"&lt;[^&gt;]+&gt;",
    r"[=\-_]{10,}",
]

def clean_text(text):
    text = unicodedata.normalize("NFKC", text)
    for pattern in compiled_patterns:
        text = pattern.sub(" ", text)
    for line in text.split("\n"):
        if re.match(r"^\s*(References|Bibliography|Works Cited)", line):
            text = text[: text.find(line)]
            break
    lines = [l for l in text.split("\n") if not (len(l) > 0 and sum(c.isdigit() for c in l) / len(l) > 0.5)]
    text = "\n".join(lines)
    text = re.sub(r"[ \t]+", " ", text)
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()

#Quality Filtering

We apply multiple quality filters to ensure only high-signal content reaches the model.

Filter	Threshold	Purpose
Word Count	100 words	Remove stubs/short abstracts
Language	English only	Tokenizer compatibility
Content Quality	< 30% boilerplate	High signal-to-noise ratio
Medical Relevance	Medical keyword score > 0.3	Domain relevance

The quality filter also includes language detection using the langdetect library. We only keep English documents, since our model and tokenizer are designed for English.

def is_english(text):
    try:
        return detect(text[:500]) == "en"
    except LangDetectException:
        return True  # Keep on detection failure

The filtering results tell an important story:

Stage	Documents	Removed	Reason
Raw Load	78,080
After Cleaning	78,080	0	Text normalization only
After Filtering	70,272	7,808 (10%)	Quality thresholds
After Dedup	44,187	26,085 (37%)	Duplicates removed

#Deduplication

Duplicate content severely degrades training quality. We implement both exact and near-duplicate removal.

Exact Deduplication (MD5 Hashing)

Simple but effective: normalize whitespace and hash the text.

def exact_dedup(corpus):
    seen_hashes = set()
    unique = []
    for doc in corpus:
        normalized = re.sub(r"\s+", " ", doc["text"].lower().strip())
        doc_hash = hashlib.md5(normalized.encode()).hexdigest()
        if doc_hash not in seen_hashes:
            seen_hashes.add(doc_hash)
            unique.append(doc)
    return unique

Near-Duplicate Removal (MinHash LSH)

For documents that are similar but not identical, we use MinHash LSH with character-level n-grams.

def get_minhash(text, num_perm, ngram_size):
    mh = MinHash(num_perm=num_perm)
    for i in range(len(text) - ngram_size + 1):
        ngram = text[i:i + ngram_size]
        mh.update(ngram.encode("utf8"))
    return mh


def near_dedup(corpus, config):
    lsh = MinHashLSH(
        threshold=config["MINHASH_THRESHOLD"],
        num_perm=config["MINHASH_NUM_PERM"]
    )
    unique = []
    for doc in tqdm(corpus):
        mh = get_minhash(doc["text"], config["MINHASH_NUM_PERM"], config["NGRAM_SIZE"])
        if not lsh.query(mh):
            lsh.insert(doc["id"], mh)
            unique.append(doc)
    return unique

#Tokenization & Packing

We use GPT-2's tokenizer and pack documents efficiently with EOS separators.

TOKENIZER_NAME = "gpt2"
MAX_CHUNK_TOKENS = 1024
DOC_SEPARATOR = "<|endoftext|>"

tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)
tokenizer.add_special_tokens({'eos_token': DOC_SEPARATOR})
sep_ids = tokenizer.encode(DOC_SEPARATOR)

def pack_documents(corpus, tokenizer, max_tokens, sep_ids):
    chunks = []
    current_chunk = []

    for doc in tqdm(corpus):
        token_ids = tokenizer.encode(
            doc["text"],
            add_special_tokens=False,
            truncation=True
        ) + sep_ids

        if len(current_chunk) + len(token_ids) > max_tokens:
            chunks.append(current_chunk[:max_tokens])
            current_chunk = token_ids
        else:
            current_chunk.extend(token_ids)

    if current_chunk:
        chunks.append(current_chunk[:max_tokens])

    return chunks

Our packing achieved 98.3% average chunk fill rate, meaning less than 2% of tokens are wasted as padding.

44,187

Documents

~44.7M

Tokens

98.3%

Fill Rate

#Export

We export both the raw documents and packed chunks to HuggingFace Hub.

dataset_dict = Dataset.from_pandas(df).train_test_split(
    test_size=0.05, seed=CONFIG["SEED"]
)

dataset_dict.push_to_hub("Saminx22/medical_data_for_slm", config_name="documents")

packed_chunks_dataset = Dataset.from_dict({
    "input_ids": [chunk for chunk in all_chunks],
    "token_count": [len(chunk) for chunk in all_chunks],
    "chunk_id": [f"chunk_{i}" for i in range(len(all_chunks))]
})
packed_chunks_dataset.push_to_hub("Saminx22/medical_data_for_slm", config_name="chunks", split="train")

#Chinchilla Scaling Check

The Chinchilla scaling laws suggest that a compute-optimal language model should be trained on approximately 20 tokens per parameter.

300M params × 20 tokens/param = 6 billion tokens (optimal)

Our dataset: ~44.7M tokens

At 44.7M tokens, our dataset is substantially below the Chinchilla-optimal threshold. This is intentional for a prototype—we are demonstrating the pipeline, not training a production model.

#Key Takeaways

Quality over quantity. For small models, every token matters. Aggressive filtering removed 8.9% of documents; exact dedup removed another 37.6%.
Stream everything. Medical datasets can be massive. Streaming from HuggingFace with sample caps prevents OOM crashes.
Pack efficiently. Greedy document packing with <|endoftext|> separators achieved 98.3% utilization, wasting almost no tokens.
Build fallbacks. Data sources break, APIs change, scripts get deprecated. Automatic fallback loaders keep the pipeline running.
Visualize your corpus. Statistics and plots catch problems that code cannot: distribution skew, outlier documents, unexpected source imbalances.

In the companion blog post, we use this dataset to pretrain MedSLM — a 330M-parameter transformer with RMSNorm, Rotary Positional Embeddings, SwiGLU activations, and Grouped-Query Attention.

#Resources

Dataset: https://huggingface.co/datasets/Saminx22/medical_data_for_slm

Available Blogs

Explore other posts in this series.

Building MedSLM: A 330M Parameter Medical Language Model

In this post, we build MedSLM - a 330M parameter transformer trained from scratch on our curated medical dataset. We implement modern architecture choices like RMSNorm, Rotary Positional Embeddings, SwiGLU activations, and Grouped-Query Attention.

Apr 4, 2026•20 min read

Curating a Medical SFT Dataset: From Raw QA Pairs to Instruction-Ready Data

In this post, we build a high-quality Supervised Fine-Tuning (SFT) dataset for medical question answering. We combine three curated medical QA sources, apply multi-stage quality filtering, perform MinHash near-duplicate removal, and produce a clean 51K-example instruction dataset.

Apr 11, 2026•25 min read

Training MedSLM-SFT: Supervised Fine-Tuning for Medical Instruction Following

With our pretraining corpus complete and MedSLM trained from scratch, we now focus on instruction fine-tuning. This stage teaches the model to act as a helpful medical assistant by training it on curated (instruction, response) pairs.

Apr 11, 2026•20 min read