I haven't written about DeepSeek r1 for a long time because there was so much information to digest. Rumors, interviews, use cases, admiration, and skepticism are everywhere. Now, I think it's time to sort out some of the claims. Let's go:
❌ "DeepSeek spent $6 million to train the model."
Not exactly. DeepSeek spent that amount only on the final training run that produced the model we use today. This figure doesn't include all prior experiments (of which there were certainly many), nor the costs for personnel, data, and GPUs. Moreover, r1 is a model built on top of another foundational model—DeepSeek-V3. Obviously, V3 didn't come out of thin air or for free either.
And one more thing: where did they get the training data? It's clear that part of it was collected in-house. However, it seems they also synthesized some data using other models—at least OpenAI, Anthropic, and, surprisingly, Yandex. This assumption comes from the fact that r1 occasionally identifies itself with another model's name. And synthesizing data isn't free either, of course.
❌ "DeepSeek r1 is a side project."
Also unlikely. It's being framed as "a few guys made an o1-level model for fun from sticks and stones." This take gained traction after a tweet by Han Xiao (source), even though he isn't directly involved with DeepSeek.
Meanwhile, DeepSeek is entirely funded by the Chinese hedge fund High-Flyer, which manages $7 billion in assets. Its founder, Liang Wenfeng, is also the founder of DeepSeek. According to Reuters (source), in March 2023, High-Flyer announced on WeChat that they were moving beyond trading and concentrating resources on creating "a new and independent research group to study the essence of AGI." Later that same year, DeepSeek was born. Doesn't sound like a side project, does it?
✅ "DeepSeek managed with a small number of GPUs."
This seems partially true. For the V3 foundational model, they report using 2,048 H800 GPUs. It's claimed that they avoided H100 GPUs due to U.S. sanctions, which made them difficult to acquire. Instead, they optimized their model and training processes for H800 GPUs, which have lower memory bandwidth but are legally available.
To overcome H800's limitations, they employed tricks like low-level PTX programming to manage GPU communication effectively, FP8 calculations, predicting multiple tokens at once, and using Mixture of Experts. It's impressive, no doubt!
However, two points to consider:
2,048 H800s cost about $50 million (some "side project," right?).
Scale AI CEO Alexander Wang claims (source) that DeepSeek has 50,000 H100 GPUs, implying they bypassed sanctions. This is an unverified rumor, though. Elon Musk responded to this with "Obviously," but he's known for his theatrics. There's also a related tweet (source) suggesting that DeepSeek has 50,000 Hopper GPUs (not specifying H100 or H800). Either way, the source of these rumors is "trust me, bro" but I wouldn't be surprised if it's true.
✅ "DeepSeek r1 is a great model that shook up big players like OpenAI."
Yes, absolutely. The model is genuinely impressive and open-source. Its reasoning chain is fascinating to follow, and its web version includes a search engine, making it usable like Google Deep Research. All this is free on the web.
Additionally, the model is open, and theoretically, I could run it at work on 8x H100 GPUs. As far as I can tell, there aren't any better open alternatives. Plus, its API pricing is dirt cheap compared to o1.
It seems that r1's release prompted Sam Altman (source) to make o3-mini free. Google, too, began boasting (source) that their latest Gemini model, with a large context window and integrated search, is available for free.
Still, independent benchmarks that I saw suggest that r1 lags behind o1. My personal tests confirm this as well.
❌ "There are 6 versions of DeepSeek r1 in different sizes."
This is incorrect. r1 is a single MoE (Mixture of Experts) model with 671 billion parameters. Everything else is just fine-tuned versions of Qwen or Llama. The key here is that these fine-tunes lack the Reinforcement Learning stage that gives r1 its magic.
If you see news about someone running r1 on a phone—that's nonsense.
❓ "Stock markets crashed because of DeepSeek."
Honestly, I don't know. Maybe it influenced things, but I find it doubtful. Although there's a certain logic: concerns over the competitiveness of American big tech and doubts about the necessity of large infrastructure investments in AI. But it seems there are bigger factors at play in the world.
A couple of personal speculations
DeepSeek has given us a great model and has pushed Meta on the open-source front—a big deal. Personally, I believe they already have other models (e.g., r2), and we'll hear about them soon. At the same time, I think the low prices for r1 won't last long—seems like a case of strategic underpricing.
Looking forward to the next episode - R2 is coming soon! 🍿