DeepSeek-V3 Technical Report

signa11 132 points arxiv.org

−

Centigonal

The GPU-hours stat here allows us to back out some interesting figures around electricity usage and carbon emissions if we make a few assumptions.

2,788,000 GPU-hours * 350W TDP of H800 = 975,800,000 GPU Watt-hours

975,800,000 GPU Wh * (1.2 to account for non-GPU hardware) * (1.3 PUE^[1]) = 1,522,248,000 Total Wh, or 1,522,248 kWh to train DeepSeek-V3

(1,522,248 kWh) * (0.582kg CO2eq/kWh in China^[2]) = 885,948 kg CO2 equivalents to train DeepSeek-V3

A typical US passenger vehicle emits about 4.6 metric tons of CO2 per year.^[3]

885,948 kg CO2 per DeepSeek / 4,600 kg CO2 per car = 192.6 cars per DeepSeek

So, the final training run for DeepSeek-V3 emitted as much greenhouse gasses as would be emitted from running about 193 more cars on the road for a year.

I also did some more math and found that this training run used about as much electricity as 141 US households would use over the course of a year.^[4]

[1] https://enviliance.com/regions/east-asia/cn/report_10060

[2] https://ourworldindata.org/grapher/carbon-intensity-electric...

[3] https://www.epa.gov/greenvehicles/greenhouse-gas-emissions-t...

[4] divided total kWh by the value here: https://www.eia.gov/tools/faqs/faq.php?id=97&t=3

−

hugs

the nice thing about ai's energy usage is that no one complains about bitcoin's energy usage anymore. (i'm kidding, people still complain.)

−

Centigonal

Actually -- and this is insane -- the amount of electricity required to train DeepSeek-V3 would power the Bitcoin network for all of 5 minutes.

DeepSeek would have to fully train a brand new V3 every week to approach the kinds of power consumption numbers that individual bitcoin mining facilities are doing.

The energy use from BTC is ludicrous.

(I'm assuming 155 TWh/yr for Bitcoin, using the low-end estimate from here: https://www.polytechnique-insights.com/en/columns/energy/bit... )

−

amelius

If the energy of Bitcoin was diverted to AI, we would have AGI now.

Maybe we should be thankful.

−

GTP

There's nothing backing this claim.

−

mystified5016

Whoosh

−

jononor

Satoshi is Sarah Connor?? That explains.

−

pogue

Are the stats from training ChatGPT, Claude or other models public? It would be interesting to see a comparison to them.

−

Centigonal

They mostly aren't. The lack of transparency around how many parameters frontier models have and how long they're trained is a big obstacle when it comes to estimating the energy impact of training very large models.

A group at Stanford has been benchmarking model providers by transparency here: https://crfm.stanford.edu/fmti/May-2024/index.html

I think a great way to create positive change in the world is to pressure OpenAI, Anthropic, Google, XAI, and Meta to all share details about the energy cost of training and inference for their models. If every major provider provided this transparency, it would be less valuable to keep that info secret from a "keep your competitors in the dark" perspective. It would also allow customers to make decisions based on more than just performance and cost.

−

patapong

Or, the equivalent of around 3 flights between the UK and Japan (297,926kg^[0]).

[0] https://skift.com/2024/11/06/co2-setback-as-emissions-on-uk-...

−

skummetmaelk

The fact that you can unironically put the "only" modifier on a training time of 2.8 million GPU hours is nuts.

−

chvid

If they have a cluster with 2,000 H800 GPUs (which is what they have stated in public) training would take 2,800,000 / (2,000 * 24 * 30) ~ 2 months.

A cluster of 2,000 GPUs is what a second tier AI lab has access to. And it shows that you can play in the state of the art LLM-game with some capital and a lot of brains.

−

ArtTimeInvestor

Isn't the price of an H800 like $30k?

I don't know what your household budget is, but $60M might not be what most people associate with "some capital".

−

chvid

It is a lot less than what Google, OpenAI etc have.

And the GPUs would be a shared resource so what you should calculate is what it would have cost to rent them - probably something like 2 m.

−

andai

Yesterday GPT asked me if I'd like to train a small LLM and I laughed out loud.

That being said I'm amazed how far 1B models have come. I remember when TinyLlama came out a few years ago, it was not great. ($40K training cost iirc.)

That was a 1B model, but these days even 0.5B models are remarkably coherent.

−

charcircuit

An H100 has 14592 CUDA cores. 2000 * 14592 already gives you more than 2 million cores.

−

andai

Can someone put this into perspective? I'm finding heterogenous data on other models, i.e. number of tokens, number of GPUs used, cost, etc. It's hard to compare it all.

−

gdiamos

Nice to see a return to open source in models and training systems.

−

jxjnskkzxxhx

Capitalism is beautiful.

−

FirmwareBurner

When China schools the US on capitalism.

−

jxjnskkzxxhx

TIL China invented the free source movement. Cool story bro.

−

danielhanchen

Re DeepSeek-V3 0324 - I made some 2.7bit dynamic quants (230GB in size) for those interested in running them locally via llama.cpp! Tutorial on getting and running them: https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-...

−

behohippy

These articles are gold, thank you. I used your gemma one from a few weeks back to get gemma 3 performing properly. I know you guys are all GPU but do you do any testing on CPU/GPU mixes? I'd like to see the pp and t/s on pure 12 channel epyc and the same with using a 24 gig gpu to accelerate the pp.

−

danielhanchen

Oh fantastic! Oh for MoEs like DeepSeek, technically GPUs aren't that necessary! I actually tested on 1x H100 I think it was 30 layers offloaded, and the other 30 are on CPU - it wasn't that bad at all!

−

benob

I like that they give advice to hardware manufacturers: - offload communication to a dedicated co-proc - implement decent precision for accumulating fp8 operations - finer-grained quantization ...

−

kristjansson

Hasn't been updated for the -0324 release unfortunately, and diff-pdf shows only a few small additions (and consequent layout shift) for the updated arxiv version on Feb 18.

−

tmabraham

https://x.com/iScienceLuvr/status/1905144432791609480