SA
Samin Chandeepa
15 min read

Curating a Medical SFT Dataset: From Raw QA Pairs to Instruction-Ready Data

In this post, we build a high-quality Supervised Fine-Tuning (SFT) dataset for medical question answering. We combine three curated sources, apply multi-stage quality filtering and near-duplicate removal, and produce a clean instruction-following dataset.

In the previous posts, we built a medical pretraining corpus and trained MedSLM from scratch. Now we prepare the instruction-following data that will teach MedSLM to answer medical questions like a conversational assistant.

#Why SFT Data Matters

After pre-training on ~148M tokens of raw medical text (PubMed, PMC, clinical guidelines), MedSLM can generate fluent medical text — but it behaves like an autocomplete engine, not a conversational assistant. Supervised Fine-Tuning (SFT) bridges this gap by training the model on curated (instruction, response) pairs, teaching it to answer medical questions accurately and concisely, follow a consistent question-answering format, and provide helpful, structured medical information.

#Pipeline Overview

  1. Dataset Selection & Loading — Load high-quality medical QA datasets from HuggingFace
  2. Data Exploration & Quality Assessment — Understand data distributions, quality, and coverage
  3. Instruction Format Conversion — Convert raw QA pairs into a structured chat template
  4. Quality Filtering & Cleaning — Remove low-quality, too-short, or malformed examples
  5. Near-Duplicate Removal — Remove semantically similar duplicates via MinHash LSH
  6. Train / Validation / Test Split — Stratified splitting for robust evaluation
  7. Dataset Quality Evaluation — Automated quality checks and sample review
  8. Upload to HuggingFace Hub — Push the final dataset for downstream use

#Dataset Selection

For high-quality medical SFT data, we combine three curated sources that cover different aspects of medical knowledge. By combining these sources, we get a diverse dataset that covers detailed medical explanations, textbook-style knowledge, and concise factual recall.

SourceRepositoryExamplesFormatStrength
MedQuADkeivalya/MedQuad-MedicalQnADataset16,407QA pairsGold-standard NIH medical QA with detailed answers
WikiDocmedalpaca/medical_meadow_wikidoc10,000Alpaca-styleBroad coverage with textbook-quality explanations
Flashcardsmedalpaca/medical_meadow_medical_flashcards33,955Alpaca-styleConcise fact-based QA for factual recall

Total Raw Examples

59,952

Sources

3

Format

Alpaca-style chat

Language

English

#Global Configuration

All pipeline parameters are defined in a single configuration block. These control quality filtering thresholds, deduplication sensitivity, and output settings. Key design decisions include a minimum answer length of 50 characters (ensuring answers are substantive, not just "Yes" or "No"), a maximum answer length of 4,096 characters (preventing overflow beyond the model's context window), and a MinHash threshold of 0.80 (slightly more aggressive than pre-training since instruction data tends to have more near-duplicates).

CONFIG = {
    "MIN_QUESTION_LENGTH":    10,
    "MAX_QUESTION_LENGTH":    512,
    "MIN_ANSWER_LENGTH":      50,
    "MAX_ANSWER_LENGTH":      4096,
    "MIN_ANSWER_WORDS":       10,
    "MAX_SPECIAL_CHAR_RATIO": 0.25,
    "MINHASH_NUM_PERM":       128,
    "MINHASH_THRESHOLD":      0.80,
    "NGRAM_SIZE":             5,
    "TRAIN_RATIO":            0.90,
    "VAL_RATIO":              0.05,
    "TEST_RATIO":             0.05,
    "TOKENIZER_NAME":         "gpt2",
    "SEED":                   42,
}

#Instruction Format Conversion

Each dataset has a different schema, so we normalize them into a unified instruction-following format. We use a structured Alpaca-style chat template with clear role markers — System, User, and Assistant — that the model can learn to follow during fine-tuning. This format provides clear role separation, a consistent structure across all examples, extensibility for multi-turn conversations, and compatibility with inference (we provide System + User and the model generates the Assistant response).

SYSTEM_PROMPT = "You are a medical AI assistant. Provide accurate, evidence-based answers to medical questions."

def format_instruction(question: str, answer: str) -> str:
    return (
        f"### System:\n{SYSTEM_PROMPT}\n\n"
        f"### User:\n{question.strip()}\n\n"
        f"### Assistant:\n{answer.strip()}"
    )

We also apply Unicode normalization (NFKD) and whitespace collapsing to each text field before formatting. After conversion, each source produces clean instruction-formatted records: MedQuAD yields 16,407 examples, WikiDoc yields 9,998, and Flashcards yields 33,547 — for a total of 59,952 combined records.

#Quality Filtering

Not all examples are suitable for SFT. We apply a series of independent quality filters and track how many examples each filter removes. Every training example must be well-formed (both question and answer present), substantive (answers long enough to be useful), within length bounds (fitting the model's context window), clean (low ratio of special characters), English-only, and informative (actual questions rather than headers or metadata).

def compute_special_char_ratio(text: str) -> float:
    if not text:
        return 0.0
    special = sum(1 for c in text if not c.isalnum() and not c.isspace())
    return special / len(text)

def is_english(text: str) -> bool:
    if len(text) < 50:
        return True
    try:
        return detect(text[:500]) == 'en'
    except LangDetectException:
        return True
Filter ReasonExamples Removed
Few words in answer2,164
Short answer (< 50 chars)905
Long answer (> 4,096 chars)866
Non-English56
Total Removed3,991 (6.7%)

Before Filtering

59,952

After Filtering

55,961

Removal Rate

6.7%

#Near-Duplicate Removal

Medical QA datasets often contain near-duplicate questions phrased slightly differently but asking the same thing. Training on duplicates wastes compute and can cause the model to memorize specific phrasings rather than learning generalizable medical knowledge. We use MinHash Locality-Sensitive Hashing (LSH) — the same technique used in the pre-training data pipeline — to efficiently find and remove near-duplicates. The process involves shingling (converting each question into character 5-grams), computing MinHash signatures (compact hash representations), LSH bucketing (grouping similar items), and deduplication (keeping one representative per group).

def get_shingles(text: str, n: int = 5) -> set:
    text = text.lower().strip()
    if len(text) < n:
        return {text}
    return {text[i:i+n] for i in range(len(text) - n + 1)}

def create_minhash(shingles: set, num_perm: int = 128) -> MinHash:
    m = MinHash(num_perm=num_perm)
    for s in shingles:
        m.update(s.encode('utf-8'))
    return m

lsh = MinHashLSH(threshold=0.80, num_perm=128)

for idx, record in enumerate(filtered_records):
    shingles = get_shingles(record["question"], 5)
    mh = create_minhash(shingles, 128)
    result = lsh.query(mh)
    if len(result) == 0:
        lsh.insert(f"doc_{idx}", mh)
        keep_indices.append(idx)

Before Dedup

55,961

After Dedup

51,296

Duplicates Removed

4,665 (8.3%)

We deduplicate based on questions (not answers), since the same question with different answers would be a data quality issue. After deduplication, the source distribution is: Flashcards 30,011, MedQuAD 12,580, WikiDoc 8,705.

#Train / Validation / Test Split

The dataset is split into three partitions using stratified splitting by source, ensuring each split maintains the same proportional representation of MedQuAD, WikiDoc, and Flashcards. We shuffle with a fixed random seed (42) for reproducibility.

SplitRatioExamplesPurpose
Train90%46,166Model fine-tuning
Validation5%2,565Hyperparameter tuning, early stopping
Test5%2,565Final evaluation (never seen during training)

#Dataset Quality Evaluation

Before uploading, we run five automated quality checks to validate the dataset meets our standards.

CheckResultDetails
Format ConsistencyPASS51,296 / 51,296 correctly formatted
Data LeakagePASS0 val/test questions found in train
Answer QualityPASS93.0% end with proper punctuation, avg 5.9 sentences
Topic DiversityPASS19/19 medical keywords covered
Random Sample ReviewPASSManual inspection of random examples

The topic diversity check confirms broad medical coverage: the dataset mentions symptoms (22,928 times), patient (21,920), disease (19,227), blood (19,153), treatment (16,779), infection (10,541), cancer (9,842), heart (9,002), pain (7,588), and therapy (7,120) among other medical keywords.

#Final Dataset Schema

The uploaded HuggingFace dataset contains four fields per example. The text field is what the model trains on directly — it contains the complete formatted instruction-following template.

FieldTypeDescription
textstringFull formatted instruction (System + User + Assistant)
questionstringThe raw medical question
answerstringThe raw medical answer
sourcestringOrigin dataset (medquad, wikidoc, flashcards)

#Key Takeaways

  1. Source diversity matters. Combining MedQuAD (detailed explanations), WikiDoc (textbook knowledge), and Flashcards (concise recall) produces a well-rounded SFT dataset.
  2. Multi-stage filtering is essential. Our six-filter pipeline removed 6.7% of low-quality examples — short answers, non-English text, and malformed questions.
  3. Near-duplicate removal saves compute. MinHash LSH identified 8.3% near-duplicate questions that would have wasted training time and encouraged memorization.
  4. Stratified splits preserve distribution. Splitting by source ensures balanced representation in train, validation, and test sets.
  5. Automated quality checks build confidence. Five checks — format consistency, leakage detection, answer quality, topic diversity, and sample review — validate the dataset before fine-tuning.

#Resources

Available Blogs

Explore other posts in this series.