Curating a Medical SFT Dataset: From Raw QA Pairs to Instruction-Ready Data
In this post, we build a high-quality Supervised Fine-Tuning (SFT) dataset for medical question answering. We combine three curated sources, apply multi-stage quality filtering and near-duplicate removal, and produce a clean instruction-following dataset.
In the previous posts, we built a medical pretraining corpus and trained MedSLM from scratch. Now we prepare the instruction-following data that will teach MedSLM to answer medical questions like a conversational assistant.
#Why SFT Data Matters
After pre-training on ~148M tokens of raw medical text (PubMed, PMC, clinical guidelines), MedSLM can generate fluent medical text — but it behaves like an autocomplete engine, not a conversational assistant. Supervised Fine-Tuning (SFT) bridges this gap by training the model on curated (instruction, response) pairs, teaching it to answer medical questions accurately and concisely, follow a consistent question-answering format, and provide helpful, structured medical information.
#Pipeline Overview
- Dataset Selection & Loading — Load high-quality medical QA datasets from HuggingFace
- Data Exploration & Quality Assessment — Understand data distributions, quality, and coverage
- Instruction Format Conversion — Convert raw QA pairs into a structured chat template
- Quality Filtering & Cleaning — Remove low-quality, too-short, or malformed examples
- Near-Duplicate Removal — Remove semantically similar duplicates via MinHash LSH
- Train / Validation / Test Split — Stratified splitting for robust evaluation
- Dataset Quality Evaluation — Automated quality checks and sample review
- Upload to HuggingFace Hub — Push the final dataset for downstream use
#Dataset Selection
For high-quality medical SFT data, we combine three curated sources that cover different aspects of medical knowledge. By combining these sources, we get a diverse dataset that covers detailed medical explanations, textbook-style knowledge, and concise factual recall.
| Source | Repository | Examples | Format | Strength |
|---|---|---|---|---|
| MedQuAD | keivalya/MedQuad-MedicalQnADataset | 16,407 | QA pairs | Gold-standard NIH medical QA with detailed answers |
| WikiDoc | medalpaca/medical_meadow_wikidoc | 10,000 | Alpaca-style | Broad coverage with textbook-quality explanations |
| Flashcards | medalpaca/medical_meadow_medical_flashcards | 33,955 | Alpaca-style | Concise fact-based QA for factual recall |
Total Raw Examples
Sources
Format
Language
#Global Configuration
All pipeline parameters are defined in a single configuration block. These control quality filtering thresholds, deduplication sensitivity, and output settings. Key design decisions include a minimum answer length of 50 characters (ensuring answers are substantive, not just "Yes" or "No"), a maximum answer length of 4,096 characters (preventing overflow beyond the model's context window), and a MinHash threshold of 0.80 (slightly more aggressive than pre-training since instruction data tends to have more near-duplicates).
CONFIG = {
"MIN_QUESTION_LENGTH": 10,
"MAX_QUESTION_LENGTH": 512,
"MIN_ANSWER_LENGTH": 50,
"MAX_ANSWER_LENGTH": 4096,
"MIN_ANSWER_WORDS": 10,
"MAX_SPECIAL_CHAR_RATIO": 0.25,
"MINHASH_NUM_PERM": 128,
"MINHASH_THRESHOLD": 0.80,
"NGRAM_SIZE": 5,
"TRAIN_RATIO": 0.90,
"VAL_RATIO": 0.05,
"TEST_RATIO": 0.05,
"TOKENIZER_NAME": "gpt2",
"SEED": 42,
}#Instruction Format Conversion
Each dataset has a different schema, so we normalize them into a unified instruction-following format. We use a structured Alpaca-style chat template with clear role markers — System, User, and Assistant — that the model can learn to follow during fine-tuning. This format provides clear role separation, a consistent structure across all examples, extensibility for multi-turn conversations, and compatibility with inference (we provide System + User and the model generates the Assistant response).
SYSTEM_PROMPT = "You are a medical AI assistant. Provide accurate, evidence-based answers to medical questions."
def format_instruction(question: str, answer: str) -> str:
return (
f"### System:\n{SYSTEM_PROMPT}\n\n"
f"### User:\n{question.strip()}\n\n"
f"### Assistant:\n{answer.strip()}"
)We also apply Unicode normalization (NFKD) and whitespace collapsing to each text field before formatting. After conversion, each source produces clean instruction-formatted records: MedQuAD yields 16,407 examples, WikiDoc yields 9,998, and Flashcards yields 33,547 — for a total of 59,952 combined records.
#Quality Filtering
Not all examples are suitable for SFT. We apply a series of independent quality filters and track how many examples each filter removes. Every training example must be well-formed (both question and answer present), substantive (answers long enough to be useful), within length bounds (fitting the model's context window), clean (low ratio of special characters), English-only, and informative (actual questions rather than headers or metadata).
def compute_special_char_ratio(text: str) -> float:
if not text:
return 0.0
special = sum(1 for c in text if not c.isalnum() and not c.isspace())
return special / len(text)
def is_english(text: str) -> bool:
if len(text) < 50:
return True
try:
return detect(text[:500]) == 'en'
except LangDetectException:
return True| Filter Reason | Examples Removed |
|---|---|
| Few words in answer | 2,164 |
| Short answer (< 50 chars) | 905 |
| Long answer (> 4,096 chars) | 866 |
| Non-English | 56 |
| Total Removed | 3,991 (6.7%) |
Before Filtering
After Filtering
Removal Rate
#Near-Duplicate Removal
Medical QA datasets often contain near-duplicate questions phrased slightly differently but asking the same thing. Training on duplicates wastes compute and can cause the model to memorize specific phrasings rather than learning generalizable medical knowledge. We use MinHash Locality-Sensitive Hashing (LSH) — the same technique used in the pre-training data pipeline — to efficiently find and remove near-duplicates. The process involves shingling (converting each question into character 5-grams), computing MinHash signatures (compact hash representations), LSH bucketing (grouping similar items), and deduplication (keeping one representative per group).
def get_shingles(text: str, n: int = 5) -> set:
text = text.lower().strip()
if len(text) < n:
return {text}
return {text[i:i+n] for i in range(len(text) - n + 1)}
def create_minhash(shingles: set, num_perm: int = 128) -> MinHash:
m = MinHash(num_perm=num_perm)
for s in shingles:
m.update(s.encode('utf-8'))
return m
lsh = MinHashLSH(threshold=0.80, num_perm=128)
for idx, record in enumerate(filtered_records):
shingles = get_shingles(record["question"], 5)
mh = create_minhash(shingles, 128)
result = lsh.query(mh)
if len(result) == 0:
lsh.insert(f"doc_{idx}", mh)
keep_indices.append(idx)Before Dedup
After Dedup
Duplicates Removed
We deduplicate based on questions (not answers), since the same question with different answers would be a data quality issue. After deduplication, the source distribution is: Flashcards 30,011, MedQuAD 12,580, WikiDoc 8,705.
#Train / Validation / Test Split
The dataset is split into three partitions using stratified splitting by source, ensuring each split maintains the same proportional representation of MedQuAD, WikiDoc, and Flashcards. We shuffle with a fixed random seed (42) for reproducibility.
| Split | Ratio | Examples | Purpose |
|---|---|---|---|
| Train | 90% | 46,166 | Model fine-tuning |
| Validation | 5% | 2,565 | Hyperparameter tuning, early stopping |
| Test | 5% | 2,565 | Final evaluation (never seen during training) |
#Dataset Quality Evaluation
Before uploading, we run five automated quality checks to validate the dataset meets our standards.
| Check | Result | Details |
|---|---|---|
| Format Consistency | PASS | 51,296 / 51,296 correctly formatted |
| Data Leakage | PASS | 0 val/test questions found in train |
| Answer Quality | PASS | 93.0% end with proper punctuation, avg 5.9 sentences |
| Topic Diversity | PASS | 19/19 medical keywords covered |
| Random Sample Review | PASS | Manual inspection of random examples |
The topic diversity check confirms broad medical coverage: the dataset mentions symptoms (22,928 times), patient (21,920), disease (19,227), blood (19,153), treatment (16,779), infection (10,541), cancer (9,842), heart (9,002), pain (7,588), and therapy (7,120) among other medical keywords.
#Final Dataset Schema
The uploaded HuggingFace dataset contains four fields per example. The text field is what the model trains on directly — it contains the complete formatted instruction-following template.
| Field | Type | Description |
|---|---|---|
| text | string | Full formatted instruction (System + User + Assistant) |
| question | string | The raw medical question |
| answer | string | The raw medical answer |
| source | string | Origin dataset (medquad, wikidoc, flashcards) |
#Key Takeaways
- Source diversity matters. Combining MedQuAD (detailed explanations), WikiDoc (textbook knowledge), and Flashcards (concise recall) produces a well-rounded SFT dataset.
- Multi-stage filtering is essential. Our six-filter pipeline removed 6.7% of low-quality examples — short answers, non-English text, and malformed questions.
- Near-duplicate removal saves compute. MinHash LSH identified 8.3% near-duplicate questions that would have wasted training time and encouraged memorization.
- Stratified splits preserve distribution. Splitting by source ensures balanced representation in train, validation, and test sets.
- Automated quality checks build confidence. Five checks — format consistency, leakage detection, answer quality, topic diversity, and sample review — validate the dataset before fine-tuning.
#Resources
Available Blogs
Explore other posts in this series.

Building a High-Quality Medical Pretraining Dataset for Small Language Models
Large language models like GPT-4 or Gemini are trained on trillions of tokens scraped from the open web. But when your goal is a Small Language Model (SLM) with only ~300 million parameters, targeted at the medical domain, quality matters far more than quantity.

Building MedSLM: A 330M Parameter Medical Language Model
In this post, we build MedSLM - a 330M parameter transformer trained from scratch on our curated medical dataset. We implement modern architecture choices like RMSNorm, Rotary Positional Embeddings, SwiGLU activations, and Grouped-Query Attention.

Training MedSLM-SFT: Supervised Fine-Tuning for Medical Instruction Following
With our pretraining corpus complete and MedSLM trained from scratch, we now focus on instruction fine-tuning. This stage teaches the model to act as a helpful medical assistant by training it on curated (instruction, response) pairs.