Is it true that you need at least 46 hours and $500 to pre-train a Large Language Model*?
A lab published a ICLR 2025 paper training a large language model with 850 million parameters, 2 million tokens per batch, context size of 2048, embedding dimensions of 1536, 100 billion tokens, and using the AdamW optimizer.
I pretrained a Large Language Model with 124 million parameters, 0.5 million tokens per batch (or batch size of 64 samples), context size of 1024, embedding dimensions of 1536, 10 billion tokens and using the AdamW optimizer with 8x NVIDIA A100 for 1.5 hours.
The problem is a Large Language Model with 850 million parameters require 466 GB of VRAM. 1x NVIDIA A100 will not be enough. So we need to divide the model into smaller chunks.
We need 8x NVIDIA A100 to load 1 model. To put this into perspective, I can load the Large Language Model with 124 million parameters within each of the NVIDIA A100. In other words, 8 model training in parallel. Meanwhile, the Large Language Model with 850 million parameters need 8x NVIDIA A100 to train 1 model.
Suppose 1x iteration took 0.3 seconds and we need to do 10x iterations to make a single update to the Large Language Model with 850 million parameters. That’s roughly 46 hours.
Suppose 1x NVIDIA A100 costs $1.27/hour. Then it costs $468 to rent 8x NVIDIA A100 for 46 hours.
Funny thing, the paper did multiple variations of the model architecture. How much do they spent on that paper? Feels like burning money
#largelanguagemodel