Samin Chandeepa

•Apr 6, 2026•15 min read

Curating a Medical SFT Dataset: From Raw QA Pairs to Instruction-Ready Data

In this post, we build a high-quality Supervised Fine-Tuning (SFT) dataset for medical question answering. We combine three curated sources, apply multi-stage quality filtering and near-duplicate removal, and produce a clean instruction-following dataset.

In the previous posts, we built a medical pretraining corpus and trained MedSLM from scratch. Now we prepare the instruction-following data that will teach MedSLM to answer medical questions like a conversational assistant.

#Why SFT Data Matters

After pre-training on ~148M tokens of raw medical text (PubMed, PMC, clinical guidelines), MedSLM can generate fluent medical text — but it behaves like an autocomplete engine, not a conversational assistant. Supervised Fine-Tuning (SFT) bridges this gap by training the model on curated (instruction, response) pairs, teaching it to answer medical questions accurately and concisely, follow a consistent question-answering format, and provide helpful, structured medical information.

#Pipeline Overview

Dataset Selection & Loading — Load high-quality medical QA datasets from HuggingFace
Data Exploration & Quality Assessment — Understand data distributions, quality, and coverage
Instruction Format Conversion — Convert raw QA pairs into a structured chat template
Quality Filtering & Cleaning — Remove low-quality, too-short, or malformed examples
Near-Duplicate Removal — Remove semantically similar duplicates via MinHash LSH
Train / Validation / Test Split — Stratified splitting for robust evaluation
Dataset Quality Evaluation — Automated quality checks and sample review
Upload to HuggingFace Hub — Push the final dataset for downstream use

#Dataset Selection

For high-quality medical SFT data, we combine three curated sources that cover different aspects of medical knowledge. By combining these sources, we get a diverse dataset that covers detailed medical explanations, textbook-style knowledge, and concise factual recall.

Source	Repository	Examples	Format	Strength
MedQuAD	keivalya/MedQuad-MedicalQnADataset	16,407	QA pairs	Gold-standard NIH medical QA with detailed answers
WikiDoc	medalpaca/medical_meadow_wikidoc	10,000	Alpaca-style	Broad coverage with textbook-quality explanations
Flashcards	medalpaca/medical_meadow_medical_flashcards	33,955	Alpaca-style	Concise fact-based QA for factual recall

Total Raw Examples

59,952

Sources

Format

Alpaca-style chat

Language

English

#Global Configuration

All pipeline parameters are defined in a single configuration block. These control quality filtering thresholds, deduplication sensitivity, and output settings. Key design decisions include a minimum answer length of 50 characters (ensuring answers are substantive, not just "Yes" or "No"), a maximum answer length of 4,096 characters (preventing overflow beyond the model's context window), and a MinHash threshold of 0.80 (slightly more aggressive than pre-training since instruction data tends to have more near-duplicates).

CONFIG = {
    "MIN_QUESTION_LENGTH":    10,
    "MAX_QUESTION_LENGTH":    512,
    "MIN_ANSWER_LENGTH":      50,
    "MAX_ANSWER_LENGTH":      4096,
    "MIN_ANSWER_WORDS":       10,
    "MAX_SPECIAL_CHAR_RATIO": 0.25,
    "MINHASH_NUM_PERM":       128,
    "MINHASH_THRESHOLD":      0.80,
    "NGRAM_SIZE":             5,
    "TRAIN_RATIO":            0.90,
    "VAL_RATIO":              0.05,
    "TEST_RATIO":             0.05,
    "TOKENIZER_NAME":         "gpt2",
    "SEED":                   42,
}

#Instruction Format Conversion

Each dataset has a different schema, so we normalize them into a unified instruction-following format. We use a structured Alpaca-style chat template with clear role markers — System, User, and Assistant — that the model can learn to follow during fine-tuning. This format provides clear role separation, a consistent structure across all examples, extensibility for multi-turn conversations, and compatibility with inference (we provide System + User and the model generates the Assistant response).

SYSTEM_PROMPT = "You are a medical AI assistant. Provide accurate, evidence-based answers to medical questions."

def format_instruction(question: str, answer: str) -> str:
    return (
        f"### System:\n{SYSTEM_PROMPT}\n\n"
        f"### User:\n{question.strip()}\n\n"
        f"### Assistant:\n{answer.strip()}"
    )

We also apply Unicode normalization (NFKD) and whitespace collapsing to each text field before formatting. After conversion, each source produces clean instruction-formatted records: MedQuAD yields 16,407 examples, WikiDoc yields 9,998, and Flashcards yields 33,547 — for a total of 59,952 combined records.

#Quality Filtering

Not all examples are suitable for SFT. We apply a series of independent quality filters and track how many examples each filter removes. Every training example must be well-formed (both question and answer present), substantive (answers long enough to be useful), within length bounds (fitting the model's context window), clean (low ratio of special characters), English-only, and informative (actual questions rather than headers or metadata).

def compute_special_char_ratio(text: str) -> float:
    if not text:
        return 0.0
    special = sum(1 for c in text if not c.isalnum() and not c.isspace())
    return special / len(text)

def is_english(text: str) -> bool:
    if len(text) < 50:
        return True
    try:
        return detect(text[:500]) == 'en'
    except LangDetectException:
        return True

Filter Reason	Examples Removed
Few words in answer	2,164
Short answer (< 50 chars)	905
Long answer (> 4,096 chars)	866
Non-English	56
Total Removed	3,991 (6.7%)

Before Filtering

59,952

After Filtering

55,961

Removal Rate

6.7%

#Near-Duplicate Removal

Medical QA datasets often contain near-duplicate questions phrased slightly differently but asking the same thing. Training on duplicates wastes compute and can cause the model to memorize specific phrasings rather than learning generalizable medical knowledge. We use MinHash Locality-Sensitive Hashing (LSH) — the same technique used in the pre-training data pipeline — to efficiently find and remove near-duplicates. The process involves shingling (converting each question into character 5-grams), computing MinHash signatures (compact hash representations), LSH bucketing (grouping similar items), and deduplication (keeping one representative per group).

def get_shingles(text: str, n: int = 5) -> set:
    text = text.lower().strip()
    if len(text) < n:
        return {text}
    return {text[i:i+n] for i in range(len(text) - n + 1)}

def create_minhash(shingles: set, num_perm: int = 128) -> MinHash:
    m = MinHash(num_perm=num_perm)
    for s in shingles:
        m.update(s.encode('utf-8'))
    return m

lsh = MinHashLSH(threshold=0.80, num_perm=128)

for idx, record in enumerate(filtered_records):
    shingles = get_shingles(record["question"], 5)
    mh = create_minhash(shingles, 128)
    result = lsh.query(mh)
    if len(result) == 0:
        lsh.insert(f"doc_{idx}", mh)
        keep_indices.append(idx)

Before Dedup

55,961

After Dedup

51,296

Duplicates Removed

4,665 (8.3%)

We deduplicate based on questions (not answers), since the same question with different answers would be a data quality issue. After deduplication, the source distribution is: Flashcards 30,011, MedQuAD 12,580, WikiDoc 8,705.

#Train / Validation / Test Split

The dataset is split into three partitions using stratified splitting by source, ensuring each split maintains the same proportional representation of MedQuAD, WikiDoc, and Flashcards. We shuffle with a fixed random seed (42) for reproducibility.

Split	Ratio	Examples	Purpose
Train	90%	46,166	Model fine-tuning
Validation	5%	2,565	Hyperparameter tuning, early stopping
Test	5%	2,565	Final evaluation (never seen during training)

#Dataset Quality Evaluation

Before uploading, we run five automated quality checks to validate the dataset meets our standards.

Check	Result	Details
Format Consistency	PASS	51,296 / 51,296 correctly formatted
Data Leakage	PASS	0 val/test questions found in train
Answer Quality	PASS	93.0% end with proper punctuation, avg 5.9 sentences
Topic Diversity	PASS	19/19 medical keywords covered
Random Sample Review	PASS	Manual inspection of random examples

The topic diversity check confirms broad medical coverage: the dataset mentions symptoms (22,928 times), patient (21,920), disease (19,227), blood (19,153), treatment (16,779), infection (10,541), cancer (9,842), heart (9,002), pain (7,588), and therapy (7,120) among other medical keywords.

#Final Dataset Schema

The uploaded HuggingFace dataset contains four fields per example. The text field is what the model trains on directly — it contains the complete formatted instruction-following template.

Field	Type	Description
text	string	Full formatted instruction (System + User + Assistant)
question	string	The raw medical question
answer	string	The raw medical answer
source	string	Origin dataset (medquad, wikidoc, flashcards)

#Key Takeaways

Source diversity matters. Combining MedQuAD (detailed explanations), WikiDoc (textbook knowledge), and Flashcards (concise recall) produces a well-rounded SFT dataset.
Multi-stage filtering is essential. Our six-filter pipeline removed 6.7% of low-quality examples — short answers, non-English text, and malformed questions.
Near-duplicate removal saves compute. MinHash LSH identified 8.3% near-duplicate questions that would have wasted training time and encouraged memorization.
Stratified splits preserve distribution. Splitting by source ensures balanced representation in train, validation, and test sets.
Automated quality checks build confidence. Five checks — format consistency, leakage detection, answer quality, topic diversity, and sample review — validate the dataset before fine-tuning.

#Resources

Dataset: https://huggingface.co/datasets/Saminx22/medical_data_for_slm_SFT

Base Model: https://huggingface.co/Saminx22/MedSLM

Available Blogs

Explore other posts in this series.

Building a High-Quality Medical Pretraining Dataset for Small Language Models

Large language models like GPT-4 or Gemini are trained on trillions of tokens scraped from the open web. But when your goal is a Small Language Model (SLM) with only ~300 million parameters, targeted at the medical domain, quality matters far more than quantity.

Apr 4, 2026•15 min read

Building MedSLM: A 330M Parameter Medical Language Model

In this post, we build MedSLM - a 330M parameter transformer trained from scratch on our curated medical dataset. We implement modern architecture choices like RMSNorm, Rotary Positional Embeddings, SwiGLU activations, and Grouped-Query Attention.

Apr 4, 2026•20 min read

Training MedSLM-SFT: Supervised Fine-Tuning for Medical Instruction Following

With our pretraining corpus complete and MedSLM trained from scratch, we now focus on instruction fine-tuning. This stage teaches the model to act as a helpful medical assistant by training it on curated (instruction, response) pairs.

Apr 11, 2026•20 min read