AI Fundamentals

Model Distillation

Model distillation is a technique where a smaller "student" model is trained to replicate the behavior of a larger "teacher" model, preserving most of the capability at a fraction of the computational cost.

model-compressionknowledge-distillationtrainingoptimizationdeployment

Model distillation (also called knowledge distillation) is a compression technique introduced by Geoffrey Hinton in 2015. The core idea is straightforward: run a large, expensive model (the teacher) on a dataset and capture not just its final answers but its full output probability distribution — the "soft labels" that encode nuanced relationships between classes. A smaller model (the student) is then trained to match these soft outputs rather than the original hard labels. Because the teacher's probability distribution contains richer information than a simple correct/incorrect label, the student learns faster and generalizes better than it would from the raw training data alone.

In practice, distillation is how many production AI systems operate today. OpenAI's GPT-4o mini, Anthropic's Haiku, and Google's Gemma models are widely understood to leverage distillation from their larger siblings. The process typically involves generating millions of prompt-completion pairs from the teacher model, then fine-tuning the student on that synthetic dataset. Techniques like **response distillation** (matching final outputs), **logit distillation** (matching output probabilities), and **feature distillation** (matching intermediate layer representations) offer different trade-offs between fidelity and training cost.

For agent builders, distillation is a critical path to deployment. A large frontier model can prototype an agentic workflow during development, but serving it at scale may be cost-prohibitive. By distilling the agent's reasoning patterns, tool-use decisions, and domain knowledge into a smaller model, teams can reduce inference latency by 5–10x and cost by 10–50x while retaining 85–95% of task performance. This makes distillation one of the most practical techniques for moving AI agents from prototype to production.

Related Terms

Large Language Model (LLM)

A large language model (LLM) is a neural network trained on massive text data that can understand and generate human-like language, code, and reasoning.

Fine-Tuning

Fine-tuning is the process of training a pre-trained AI model on specialized data to improve its performance on specific tasks or domains.

Inference

Inference is the process of running a trained AI model to generate predictions or outputs from new inputs.

Token

A token is the basic unit of text that AI models process — roughly equivalent to a word or word fragment, typically 3-4 English characters.

Blog·Browse AI Agents·Use Cases·Comparisons

Want to see AI agents in action?