DeepSeek-R1 is an open-source language model constructed on DeepSeek-V3-Base that's been making waves in the AI neighborhood. Not only does it match-or even surpass-OpenAI's o1 design in many criteria, but it likewise comes with completely MIT-licensed weights. This marks it as the first non-OpenAI/Google model to provide strong thinking capabilities in an open and available manner.
What makes DeepSeek-R1 especially amazing is its transparency. Unlike the less-open methods from some industry leaders, DeepSeek has released a detailed training methodology in their paper.
The design is also extremely cost-efficient, with input tokens costing just $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).
Until ~ GPT-4, asteroidsathome.net the typical wisdom was that much better models required more information and calculate. While that's still valid, models like o1 and R1 show an alternative: inference-time scaling through thinking.
The Essentials
The DeepSeek-R1 paper provided several designs, but main among them were R1 and R1-Zero. Following these are a series of distilled models that, while fascinating, I won't talk about here.
DeepSeek-R1 utilizes two major ideas:

1. A multi-stage pipeline where a small set of cold-start data kickstarts the model, followed by massive RL.
2. Group Relative Policy Optimization (GRPO), a reinforcement learning approach that depends on comparing multiple model outputs per prompt to prevent the need for a separate critic.
R1 and R1-Zero are both reasoning designs. This essentially implies they do Chain-of-Thought before responding to. For the R1 series of designs, this takes form as believing within a tag, before responding to with a final summary.
R1-Zero vs R1
R1-Zero uses Reinforcement Learning (RL) straight to DeepSeek-V3-Base with no supervised fine-tuning (SFT). RL is utilized to enhance the model's policy to optimize benefit.
R1-Zero attains exceptional precision however in some cases produces complicated outputs, such as blending multiple languages in a single action. R1 repairs that by incorporating limited supervised fine-tuning and several RL passes, which enhances both accuracy and readability.
It is interesting how some languages may express certain concepts better, which leads the design to select the most expressive language for the task.
Training Pipeline
The training pipeline that DeepSeek published in the R1 paper is immensely interesting. It showcases how they created such strong thinking models, and what you can expect from each phase. This includes the issues that the resulting models from each phase have, and how they solved it in the next phase.
It's interesting that their training pipeline varies from the typical:
The normal training strategy: Pretraining on big dataset (train to forecast next word) to get the base model → supervised fine-tuning → preference tuning through RLHF
R1-Zero: Pretrained → RL
R1: Pretrained → Multistage training pipeline with multiple SFT and archmageriseswiki.com RL phases
Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) samples to guarantee the RL process has a decent beginning point. This offers a great model to start RL.
First RL Stage: Apply GRPO with rule-based benefits to improve reasoning correctness and formatting (such as requiring chain-of-thought into believing tags). When they were near merging in the RL procedure, they transferred to the next step. The result of this action is a strong thinking design however with weak general capabilities, e.g., poor format and language mixing.
Rejection Sampling + general information: Create brand-new SFT data through rejection tasting on the RL checkpoint (from action 2), combined with monitored data from the DeepSeek-V3-Base design. They collected around 600k high-quality thinking samples.
Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k total samples (600k thinking + 200k general tasks) for more comprehensive abilities. This action resulted in a strong thinking model with basic abilities.
Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to fine-tune the final model, in addition to the thinking benefits. The outcome is DeepSeek-R1.
They likewise did design distillation for numerous Qwen and Llama models on the reasoning traces to get distilled-R1 models.
Model distillation is a method where you use a teacher design to improve a trainee design by creating training information for the trainee model.
The teacher is generally a larger design than the trainee.
Group Relative Policy Optimization (GRPO)
The basic concept behind utilizing reinforcement learning for LLMs is to fine-tune the model's policy so that it naturally produces more precise and helpful answers.
They utilized a benefit system that examines not just for accuracy however likewise for proper format and language consistency, so the design gradually discovers to favor responses that meet these quality criteria.
In this paper, they encourage the R1 model to produce chain-of-thought reasoning through RL training with GRPO.
Instead of including a separate module at inference time, the training procedure itself nudges the design to produce detailed, detailed outputs-making the chain-of-thought an emerging habits of the optimized policy.

What makes their approach particularly intriguing is its reliance on straightforward, rule-based reward functions.
Instead of depending on costly external designs or human-graded examples as in conventional RLHF, the RL used for R1 uses basic criteria: it might offer a greater benefit if the answer is right, if it follows the expected/ formatting, and if the language of the answer matches that of the timely.
Not relying on a benefit design also implies you don't have to hang around and effort training it, and it doesn't take memory and calculate away from your main model.
GRPO was presented in the DeepSeekMath paper. Here's how GRPO works:

1. For higgledy-piggledy.xyz each input prompt, the model generates different responses.
2. Each action receives a scalar benefit based on elements like precision, formatting, and language consistency.
3. Rewards are adjusted relative to the group's efficiency, essentially determining how much better each response is compared to the others.
4. The design updates its technique slightly to favor responses with higher relative benefits. It only makes small adjustments-using techniques like clipping and a KL penalty-to ensure the policy does not wander off too far from its initial behavior.
A cool element of GRPO is its versatility. You can utilize basic rule-based benefit functions-for circumstances, awarding a benefit when the model properly utilizes the syntax-to guide the training.
While DeepSeek used GRPO, you could utilize alternative methods instead (PPO or PRIME).
For those aiming to dive much deeper, Will Brown has composed rather a nice implementation of training an LLM with RL using GRPO. GRPO has actually likewise already been added to the Transformer Reinforcement Learning (TRL) library, which is another excellent resource.
Finally, Yannic Kilcher has a great video explaining GRPO by going through the DeepSeekMath paper.
Is RL on LLMs the course to AGI?
As a last note on explaining DeepSeek-R1 and the methodologies they've presented in their paper, I wish to highlight a passage from the DeepSeekMath paper, based on a point Yannic Kilcher made in his video.

These findings indicate that RL enhances the design's overall efficiency by rendering the output distribution more robust, in other words, it seems that the enhancement is associated to increasing the correct response from TopK instead of the improvement of essential abilities.
In other words, RL fine-tuning tends to form the output circulation so that the highest-probability outputs are most likely to be correct, although the general ability (as measured by the variety of proper responses) is mainly present in the pretrained design.
This suggests that support learning on LLMs is more about refining and "forming" the existing distribution of responses instead of enhancing the design with completely brand-new abilities.
Consequently, while RL methods such as PPO and GRPO can produce substantial performance gains, there appears to be an intrinsic ceiling figured out by the underlying model's pretrained understanding.
It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next huge turning point. I'm thrilled to see how it unfolds!
Running DeepSeek-R1
I have actually utilized DeepSeek-R1 through the main chat interface for numerous problems, which it appears to resolve all right. The additional search performance makes it even better to use.
Interestingly, o3-mini(-high) was launched as I was composing this post. From my preliminary screening, R1 seems more powerful at mathematics than o3-mini.
I also rented a single H100 via Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
The main goal was to see how the design would carry out when deployed on a single H100 GPU-not to extensively evaluate the model's capabilities.
671B through Llama.cpp
DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized design by Unsloth, yogaasanas.science with a 4-bit quantized KV-cache and partial GPU offloading (29 layers working on the GPU), running via llama.cpp:
29 layers seemed to be the sweet area offered this configuration.
Performance:
A r/localllama user explained that they had the ability to get over 2 tok/sec with DeepSeek R1 671B, without utilizing their GPU on their regional gaming setup.
Digital Spaceport composed a complete guide on how to run Deepseek R1 671b fully in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second.
As you can see, the tokens/s isn't quite bearable for any severe work, but it's fun to run these big models on available hardware.
What matters most to me is a mix of usefulness and time-to-usefulness in these models. Since thinking models need to believe before responding to, their time-to-usefulness is typically higher than other models, annunciogratis.net however their usefulness is likewise typically higher.
We require to both optimize effectiveness and reduce time-to-usefulness.
70B via Ollama
70.6 b params, 4-bit KM quantized DeepSeek-R1 running by means of Ollama:
GPU utilization shoots up here, as anticipated when compared to the mainly CPU-powered run of 671B that I showcased above.
Resources
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs through Reinforcement Learning
[2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
DeepSeek R1 - Notion (Building a completely regional "deep researcher" with DeepSeek-R1 - YouTube).
DeepSeek R1's dish to replicate o1 and the future of reasoning LMs.
The Illustrated DeepSeek-R1 - by Jay Alammar.
Explainer: fishtanklive.wiki What's R1 & Everything Else? - Tim Kellogg.
DeepSeek R1 Explained to your grandmother - YouTube
DeepSeek
- Try R1 at chat.deepseek.com.
GitHub - deepseek-ai/DeepSeek-R 1.
deepseek-ai/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is a novel autoregressive framework that merges multimodal understanding and generation. It can both understand and produce images.
DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models by means of Reinforcement Learning (January 2025) This paper introduces DeepSeek-R1, an open-source reasoning model that matches the efficiency of OpenAI's o1. It presents a detailed method for training such designs using large-scale support knowing strategies.
DeepSeek-V3 Technical Report (December 2024) This report goes over the application of an FP8 blended accuracy training framework validated on an extremely large-scale model, attaining both sped up training and lowered GPU memory use.
DeepSeek LLM: setiathome.berkeley.edu Scaling Open-Source Language Models with Longtermism (January 2024) This paper explores scaling laws and presents findings that assist in the scaling of massive models in open-source configurations. It presents the DeepSeek LLM job, committed to advancing open-source language models with a long-lasting perspective.
DeepSeek-Coder: When the Large Language Model Meets Programming-The Rise of Code Intelligence (January 2024) This research study introduces the DeepSeek-Coder series, a variety of open-source code designs trained from scratch on 2 trillion tokens. The models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task to improve code generation and infilling.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) language model characterized by affordable training and efficient reasoning.
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence (June 2024) This research study introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that attains performance similar to GPT-4 Turbo in code-specific jobs.

Interesting events

- Hong Kong University replicates R1 results (Jan 25, '25).
- Huggingface reveals huggingface/open-r 1: Fully open reproduction of DeepSeek-R1 to duplicate R1, fully open source (Jan 25, '25).
- OpenAI researcher validates the DeepSeek team separately found and utilized some core concepts the OpenAI group used en route to o1
Liked this post? Join the newsletter.
