Blog

Insights, tutorials, and deep dives into AI/ML interview preparation and model development

April 4, 2026•15 min read

Building a High-Quality Medical Pretraining Dataset for Small Language Models

Large language models like GPT-4 or Gemini are trained on trillions of tokens scraped from the open web. But when your goal is a Small Language Model (SLM) with only ~300 million parameters, targeted at the medical domain, quality matters far more than quantity.

April 4, 2026•20 min read

Building MedSLM: A 330M Parameter Medical Language Model

In this post, we build MedSLM - a 330M parameter transformer trained from scratch on our curated medical dataset. We implement modern architecture choices like RMSNorm, Rotary Positional Embeddings, SwiGLU activations, and Grouped-Query Attention.

April 11, 2026•25 min read

Curating a Medical SFT Dataset: From Raw QA Pairs to Instruction-Ready Data

In this post, we build a high-quality Supervised Fine-Tuning (SFT) dataset for medical question answering. We combine three curated medical QA sources, apply multi-stage quality filtering, perform MinHash near-duplicate removal, and produce a clean 51K-example instruction dataset.

April 11, 2026•20 min read

Training MedSLM-SFT: Supervised Fine-Tuning for Medical Instruction Following

With our pretraining corpus complete and MedSLM trained from scratch, we now focus on instruction fine-tuning. This stage teaches the model to act as a helpful medical assistant by training it on curated (instruction, response) pairs.