Distillation with Reasoning: can DeepSeek R1 Teach Better Than Humans?

Inclusion of thinking "chains of thought" (CoT) in the model output considerably improves its quality, however it increases reasoning expense.

Distillation transfers reasoning knowledge from a costly instructor model to a more cost-efficient trainee, minimizing general inference cost.
DeepSeek R1 can produce detailed CoT, making it an excellent teacher design.
Synthetic data created by DeepSeek R1 might outshine information produced by human professionals.

Introduction

The current release of DeepSeek R1 has taken the AI community by storm, using efficiency on par with leading frontier models-such as OpenAI's o1-at a portion of the cost. Still, R1 can be pricey for use cases with high traffic or low latency requirements.

DeepSeek R1's strength depends on its explicit detailed reasoning. Before producing a last response, it creates an internal "chain of thought" (CoT) to methodically reason through each problem. This procedure is a form of test-time calculation, permitting the design to dynamically allocate more calculate to intricate issues. However, these extended reasoning series typically increase reasoning expense.

Distillation

Distillation is a method for moving knowledge from a big, more effective teacher design to a smaller, more model. According to the DeepSeek R1 paper, R1 is extremely effective in this teacher role. Its detailed CoT sequences direct the trainee design to break down intricate jobs into smaller sized, more manageable actions.

Comparing Distillation to Human-Labeled Data

Although fine-tuning with human-labeled data can produce customized models, gathering both final answers and their corresponding thinking actions is costly. Distillation scales more quickly: instead of counting on human annotations, the instructor design instantly produces the training information for the trainee.

A Side Note on Terminology

The term "distillation" can describe different approaches:

Distribution Distillation Aligns the trainee model's output token distribution with the teacher's using Kullback-Leibler divergence (KL-divergence). Works best when both models share the exact same architecture, tokenizer, and pre-training information.

Data Distillation Uses the teacher design to generate conclusions for a set of triggers. Fine-tunes the trainee design using a standard cross-entropy loss on these produced outputs, skipping the KL-divergence term. Allows the teacher and trainee to be various model families and tokenizers (though if the teacher utilizes specialized tokens like __, it can be useful for both models to recognize them).

In this post, we concentrate on the information distillation since it supports a wider range of student-teacher pairs.

Data Generation

Training data is often a traffic jam in design development. In a recent post (add link), we explored how to generate labels by integrating model output with a verification function. Distillation takes a various method, using an instructor design to manufacture missing completions.

DeepSeek R1 stands out because it not just offers last answers however likewise exposes its detailed chain of thought-unlike other reasoning designs that keep this internal procedure hidden. If your dataset consists of ground truth responses, you can recognize top quality synthetic CoTs through rejection tasting, picking only the very best chains to further enhance your fine-tuned design. Rejection sampling can remove inaccurate data examples either by comparing the generated data against ground reality labels or by using a user-defined validation function. From the interface point of view, the validation function resembles the verifiable reward function used by value-model-free RL methods like these explained in our recent blog site post.

Case Study: GSM8K

GSM8K (Elementary School Math 8K) is a dataset of 8.5 K diverse grade-school mathematics word problems. Each data point includes:

1. An issue description.
A human professional's chain of thought.
The last response.

We broadened this dataset by including:

Synthetic R1 thinking, i.e., the CoT generated by DeepSeek R1.

Then, systemcheck-wiki.de we fine-tuned three variations of the design (using LoRA on llama-3.1 -8 B-instruct), each with different training targets:

Direct Answer Only: Generate the last response without revealing thinking. Human Expert CoT: Generate the final answer alongside a reasoning chain resembling the human professional's. Synthetic R1 CoT: Generate the last response together with DeepSeek R1's artificial thinking chain. The table below sums up typical precision and thinking length:

- Note: The accuracy for the 5-shot standard might vary from numbers reported in other places due to different assessment setups. The key focus is on comparing relative efficiency across distillation techniques, not on beating other designs.

From this study, artificial thinking CoTs from DeepSeek R1 appear superior to human-expert CoTs in improving efficiency, albeit with a higher inference expense due to their longer length.

Fireworks AI Inference and Fine-Tuning Platform

DeepSeek R1 is available on the Fireworks AI platform. An easy to use distillation user interface will soon be part of FireOptimizer. If you need earlier gain access to, please get in touch to explore choices.

Conclusions

By integrating reasoning-based information through distillation, organizations can drastically improve model performance without bearing the complete concern of human-annotated datasets. DeepSeek R1's ability to produce long, top quality thinking chains makes it an effective teacher model-showing that, sometimes, the machine may simply out-teach the human.