Post
JA EN

A Field Guide to Cutting LLM Training and Fine-tuning Costs: Where to Save Memory and Compute

A Field Guide to Cutting LLM Training and Fine-tuning Costs: Where to Save Memory and Compute
  • Audience: engineers who train or fine-tune LLMs themselves, or who build a model and then want it smaller and faster
  • Prerequisites: a rough picture of how Transformers are trained and the basics of GPU memory. Each technique is explained in the text
  • Reading time: about 15 min

Overview

Running an LLM and training one draw on completely different resources. For inference you only read the weights. For training you have to hold gradients, optimizer state, and activations alongside those same parameters, all at once. That is why a full fine-tune is far heavier than inference, and why even a 7B model will not fit on a single GPU if you do it naively.

That is the spine of this article. The cost of training shows up in two places: compute (FLOPs / GPU hours) and memory (weights + gradients + optimizer state + activations). Inference optimization was about cutting the number of bytes you move per token. Training optimization splits along which of those two you decide to cut. Both are about “making it lighter,” but the thing you cut is different, so the technique that helps is different too.

There are roughly four directions. Shrink the data structures training itself carries (mixed precision, gradient checkpointing, ZeRO/FSDP). Cut the number of weights you actually train (PEFT methods like LoRA/QLoRA). Buy more model capacity for the same compute budget (MoE). And pay the training cost up front to produce a small, fast final model (distillation, pruning, QAT).

The short version: if all you need is to fine-tune an existing model, LoRA/QLoRA is close to the only practical answer, and it puts large models within reach of a single GPU. If you are training a giant model from scratch, mixed precision, gradient checkpointing, and ZeRO/FSDP to shard memory are the table stakes. If you want a smarter model for the same compute budget, MoE helps, but you pay for it with memory and training instability. The rest of this article lines these up by what each one cuts and what it costs you.

Why training eats more than inference

Inference needs the model weights and the activations for the tokens you are processing right now, and not much else. Training holds four things per parameter at the same time: the weight itself, the gradient computed during backprop, the state the optimizer keeps, and the activations from the forward pass for every layer, kept around so backprop can use them.

Take the standard setup, mixed precision (FP16/BF16) with Adam. The per-parameter memory stacks up like this: 2 bytes for the FP16 weight, 2 bytes for the FP16 gradient, and 12 bytes of optimizer state (a 4-byte FP32 master copy of the weight, 4 bytes for the first moment, 4 bytes for the second moment). That comes to about 16 bytes per parameter1. Activations are separate and grow with batch size, sequence length, and layer count.

Those 16 bytes add up fast. A 7B model needs about 112GB just for weights, gradients, and optimizer state, before activations. For comparison, inference runs a 7B model in 14GB at FP16, or under 4GB at 4-bit quantization, which shows how lopsided the two sides are. A one-trillion-parameter model needs about 16TB to hold the optimizer state, gradients, and parameters under 16-bit Adam training1. Training optimization is, at bottom, the question of which slice of that huge bill you cut.

Cutting training memory: mixed precision, gradient checkpointing, ZeRO/FSDP

Start with the techniques that shrink the memory breakdown itself, the ones that pay off when you train a large model from scratch (or with a full fine-tune).

Mixed precision training is the foundation. It computes weights, activations, and gradients in FP16/BF16, roughly halving memory2. You still keep a separate FP32 master copy of the weights (the 4 bytes from the 16-byte breakdown above). FP16 has a narrow dynamic range and small gradients can flush to zero, so you need loss scaling: multiply the loss before backprop, then scale back2.

Gradient checkpointing (activation recomputation) targets activation memory specifically. The forward pass keeps only a few checkpoints, and any activation that backprop needs is recomputed from the nearest one. This drops activation memory for an n-layer network from O(n) to O(√n). The price is roughly one extra forward pass. The original paper measured a 1000-layer model going from 48GB of activations to 7GB, with about a 30% increase in runtime3.

ZeRO (DeepSpeed) cuts the redundancy in data parallelism. Ordinary data parallelism has every GPU hold a full, redundant copy of the weights, gradients, and optimizer state. ZeRO shards these instead. It comes in stages: Stage 1 shards only the optimizer state for about a 4x reduction, Stage 2 also shards gradients for about 8x, and Stage 3 shards the parameters too, scaling with the data-parallel degree (up to 64x on 64 GPUs)1. PyTorch’s native FSDP is a native implementation of that Stage 3 behavior, sharding parameters, gradients, and optimizer state4. If it still does not fit on the GPU, you can spill optimizer state to the CPU with ZeRO-Offload (which trains a 13B model on a single V100)5, or go all the way to CPU and NVMe with ZeRO-Infinity6. All of these add communication for the sharding or spilling, so you have to overlap compute with communication or you lose speed.

Cutting the weights you train: LoRA / QLoRA

If you are adapting an existing model to your task rather than training from scratch, the whole thing gets much lighter. The trick is not to update all the weights. That is what PEFT (parameter-efficient fine-tuning) does, and the current standard is the LoRA family.

LoRA freezes the pretrained weights and injects a small pair of low-rank matrices into each layer, training only those7. Since you only update the small injected matrices, the number of trainable parameters drops sharply. The original paper, on GPT-3 175B, reported cutting trainable parameters by up to 10,000x and training VRAM from 1.2TB to 350GB (about a third)7. The savings come from optimizer state: frozen weights need neither gradients nor optimizer state, so most of those 16 bytes disappear. Per-task storage is small too. A 175B checkpoint is 350GB, but a LoRA one is around 35MB7. And because the trained matrices can be merged back into the original weights at inference time, there is no added inference latency7. Quality is reported to match or beat full fine-tuning from RoBERTa up to GPT-3 scale7.

QLoRA pushes this to the limit. It quantizes the base model to 4-bit (NF4) and freezes it, then flows gradients through that 4-bit model into small LoRA adapters8. Three pieces (4-bit NormalFloat, Double Quantization that also quantizes the quantization constants, and Paged Optimizers to absorb memory spikes) brought a 65B fine-tune that normally needs over 780GB down onto a single 48GB GPU, while keeping the task performance of full 16-bit training8. There are further refinements like DoRA, which splits a weight into magnitude and direction and applies LoRA to the direction, reportedly improving quality without adding inference overhead9.

There are traps. If you keep adapters for several tasks separate and switch them dynamically without merging, you pay inference latency7. Merge them and you can no longer switch between multiple tasks within a single batch. The rank is also task dependent: too small and the adapter lacks expressive power.

More capacity for the same compute budget: MoE

So far this has been about cutting cost. MoE is a bit different. It lets you train a larger, smarter model for the same amount of compute.

The mechanism is sparse activation. Split the FFN into several experts and have a router send each token through only some of them. The compute (FLOPs) each token uses stays fixed while the total parameter count, the model’s capacity, grows10. So for the same compute budget you can train a model larger than a dense one. Switch Transformer reported up to a 7x pretraining speedup over T5 for the same compute11, and Hugging Face’s writeup summarizes it the same way: MoE pretrains much faster than dense and lets you scale the model and data substantially for a given budget10. As a concrete case, DeepSeek-V3, a 671B-parameter MoE, finished pretraining in about 2.788 million H800 GPU hours (around $5.6M at $2/GPU-hour, in under two months), which has become a reference point for MoE training economics12 (that figure is the GPU hours for the production run and does not include research or trial-and-error costs). When you actually distribute a giant model, you use expert parallelism to place each expert on a separate device; GShard trained a 600B+ model on 2048 TPU v3 chips in 4 days this way13.

MoE training has rough spots that dense training does not. First, if the router funnels tokens to a few experts, the rest never develop, so you need an auxiliary load balancing loss to even things out. Switch Transformer set that coefficient to 0.0111. Training is unstable to begin with; ST-MoE stabilized it and improved fine-tuning quality with a router z-loss that keeps the router logits in check14. Shrink the capacity factor and overflowing tokens get dropped, hurting quality; make it large and you waste memory and compute10. Fine-tuning tends to overfit, and there is a reported pattern where, for the same pretraining perplexity, sparse models lag dense ones on downstream tasks (though instruction tuning works especially well on MoE)10. At inference, MoE is fast because few parameters are active, but memory does not shrink because all experts have to sit in VRAM. Mixtral 8x7B has 47B total parameters and needs VRAM equivalent to a dense 47B at inference15. MoE, in short, saves compute at the cost of memory and operational complexity.

Paying for training up front to get a small, fast model

Last are the compression techniques that pay the training cost now to get a permanently lighter model after deployment. These are a trade: make training heavier now to make later inference lighter.

Knowledge distillation transfers knowledge by having a small student model imitate the output distribution of a large teacher. DistilBERT cut parameters by 40% and sped up inference by 60% relative to BERT, while keeping 97% of its language understanding16. The up-front cost is the teacher’s inference plus the student’s training, but once built, generation is cheap. One use is distilling an MoE down to dense, keeping about 30% of the quality gain with roughly 1/20 the parameters11.

Pruning removes low-importance weights and retrains the rest to recover accuracy. A classic study shrank AlexNet by 9x and VGG-16 by 13x while preserving accuracy17. Combined with quantization and Huffman coding, Deep Compression reported 35x to 49x compression17. The catch: unstructured pruning of individual weights produces sparse matrices, and without specialized hardware or libraries, the compression ratio does not translate directly into speed. Retraining costs too.

Quantization-aware training (QAT) simulates the rounding of quantization during training, producing a model that holds up better at low bit widths18. Post-training quantization (PTQ, such as GPTQ/AWQ, covered in the inference article) needs no training and is easy, but QAT pays a training cost in exchange for keeping accuracy at lower bit widths. You choose between them based on your target bit width and whether you can afford the training cost.

Side by side

Here are the training-side techniques lined up by what each cuts and what it costs. Every figure is a representative value the cited source reported under specific conditions, and your results will vary by environment.

TechniqueWhat it mainly cutsWhat you payRepresentative figureMain trap
Mixed precision trainingMemory for weights, gradients, activationsThe hassle of loss scalingMemory roughly halved2FP16 underflow
Gradient checkpointingActivation memoryOne extra forward pass (+~30% time)Activations O(√n), 48→7GB3Recompute slows training
ZeRO / FSDPWeights, gradients, optimizer state (sharded)More inter-GPU communication4 / 8 / Nd x at Stage 1/2/314Must overlap comm and compute
ZeRO-Offload/InfinityGPU memory (spilled to CPU/NVMe)PCIe/NVMe bandwidth, slowdown13B on a single V1005Bandwidth is the bottleneck
LoRATrainable params, optimizer stateRank choice, task dependentTrainable params cut up to 10,000x7Latency when loaded separately
QLoRATraining memory (4-bit frozen + LoRA)Quantization accuracy, kernel dependency65B on a single 48GB GPU8bitsandbytes dependency
MoETraining and inference FLOPs (capacity grows)High memory, unstable training, aux loss7x pretraining vs T511All experts resident, overfitting
DistillationFinal model size, inference costTeacher inference + student training up front40% smaller, 60% faster, 97% kept16Hard to beat the teacher
Pruning + retrainingFinal model parameter countRetraining, (unstructured) specialized HW9-13x compression (accuracy kept)17Compression ratio ≠ speedup
QATFinal model bit widthTraining cost (simulating quantization)High accuracy even at low bits18More work than PTQ

LoRA/QLoRA, ZeRO/FSDP, mixed precision, and gradient checkpointing all “make training itself lighter.” Distillation, pruning, and QAT “make training heavier to make the final model lighter.” MoE sits in between: it saves compute and buys capacity, paying with memory and complexity.

Where to start

By goal, it breaks down like this.

If you just want to adapt an existing model to your task, start with LoRA, and reach for QLoRA when memory is tight. Even a single GPU puts large models within reach, and the quality rarely lags full fine-tuning by much. Most real work stops here.

If you are training a large model from scratch or doing a full fine-tune, build on mixed precision, add gradient checkpointing when activations get tight, and shard memory with ZeRO/FSDP if you have multiple GPUs. If it still does not fit, spill with ZeRO-Offload/Infinity. The economical order is mixed precision → gradient checkpointing → ZeRO Stage 1/2/3 → offload, turning up the dial only as far as you need.

If you want a smarter model for the same compute budget, consider MoE. Use it knowing the costs: memory, training instability, and fine-tuning overfitting.

If training aside you just want a lighter deployment, build a small, fast final model with distillation, pruning, or QAT. Pay once and later inference stays light. What comes after that (how to run the resulting small model fast) is the inference side of the story.

Caveats and limits

Every figure depends on the model, hardware, data, and configuration. They are representative values the sources reported, not a guarantee you will see the same numbers in your environment. Training has a high trial-and-error cost, so it is safer to run a small configuration first and scale up.

A few things that are easy to misread. What LoRA/QLoRA cut is mostly optimizer state and gradients; activation memory still applies (which is why gradient checkpointing is used alongside them for long sequences). MoE reduces training and inference compute but not memory. A pruning compression ratio is not a speedup. QLoRA’s “matches full 16-bit” and DistilBERT’s “97% kept” are results on the benchmarks tested, not guarantees across all tasks.

And the big assumption underneath all of this: what we covered is how to cut the resources of training and building, not what you train on (the quality and quantity of data). No amount of resource saving makes a good model if the training data does not fit the use case.

Wrapping up

Training-side optimization runs on a different axis from inference. Training cost shows up in compute (FLOPs / GPU hours) and memory (weights + gradients + optimizer state + activations). Which slice of that you cut determines the technique: LoRA/QLoRA for adapting an existing model, mixed precision + gradient checkpointing + ZeRO/FSDP for training a giant one, MoE for adding capacity, and distillation/pruning/QAT for a lighter deployment.

GoalFirst moveNext move
Fine-tune an existing modelLoRAQLoRA (when memory is tight)
Train a giant modelMixed precision + gradient checkpointingZeRO/FSDP → offload
Smarter for the same computeMoE (sparse activation)Distribute with expert parallelism
Lighter deploymentDistillationPruning / QAT

To try something today, run a small QLoRA fine-tune on a model you have and watch how much the GPU memory drops compared to a full fine-tune. Once you feel most of the “16 bytes × parameters” collapse down to the adapter alone, you start to see where the training resources can actually be cut.

This article focused on saving resources on the training and building side. Running the finished model fast and light (quantization, KV cache, speculative decoding, batching, serving frameworks) is covered in the sister article, “A Field Guide to Making LLM Inference Faster and Lighter.”

You may also be interested in these related posts:

References

References are listed in citation order, matching the numbers in the text. Multipliers and reduction rates are values each source reported under specific conditions and will vary by environment.

  1. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models - Samyam Rajbhandari et al. (2019, SC 2020). Removes data-parallel redundancy, sharding optimizer state/gradients/parameters at Stage 1/2/3 (about 4 / 8 / Nd x reduction). Also explains the per-parameter ~16-byte breakdown and ~16TB for a trillion parameters under mixed-precision Adam (see the Microsoft Research writeup). [Reliability: high (peer reviewed, top venue)] ↩︎ ↩︎2 ↩︎3 ↩︎4

  2. Mixed Precision Training - Paulius Micikevicius et al. (2017, ICLR 2018). FP16/BF16 roughly halves memory. Handles underflow with an FP32 master copy and loss scaling. [Reliability: medium to high (peer reviewed, standard technique)] ↩︎ ↩︎2 ↩︎3

  3. Training Deep Nets with Sublinear Memory Cost - Tianqi Chen et al. (2016). Gradient checkpointing (activation recomputation) brings activation memory to O(√n). Measured a 1000-layer model going from 48GB to 7GB, with about +30% runtime. [Reliability: medium to high (widely adopted)] ↩︎ ↩︎2

  4. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel - Yanli Zhao et al. (2023, VLDB 2023). A PyTorch-native implementation that shards parameters, gradients, and optimizer state (equivalent to ZeRO Stage 3). Official docs. [Reliability: medium to high (peer reviewed, official)] ↩︎ ↩︎2

  5. ZeRO-Offload: Democratizing Billion-Scale Model Training - Jie Ren et al. (2021, USENIX ATC 2021). Spills optimizer state and update computation to the CPU, training 13B parameters on a single V100. [Reliability: medium to high (peer reviewed)] ↩︎ ↩︎2

  6. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning - Samyam Rajbhandari et al. (2021, SC 2021). Uses the three tiers of GPU/CPU/NVMe to enable fine-tuning a trillion parameters on a single node. [Reliability: medium to high (peer reviewed)] ↩︎

  7. LoRA: Low-Rank Adaptation of Large Language Models - Edward Hu et al., Microsoft (2021). Freezes the weights and trains only low-rank matrices. On GPT-3 175B, cut trainable parameters by up to 10,000x and training VRAM from 1.2TB to 350GB, with checkpoints going from 350GB to 35MB. No inference latency added once merged. See also Hugging Face PEFT docs (merging and switching adapters). [Reliability: medium to high (peer reviewed, widely adopted)] ↩︎ ↩︎2 ↩︎3 ↩︎4 ↩︎5 ↩︎6 ↩︎7

  8. QLoRA: Efficient Finetuning of Quantized LLMs - Tim Dettmers et al. (2023, NeurIPS 2023). LoRA on a base frozen in 4-bit NF4. Brought a 65B fine-tune that normally needs over 780GB onto a single 48GB GPU while keeping full 16-bit-equivalent performance. [Reliability: medium to high (peer reviewed)] ↩︎ ↩︎2 ↩︎3

  9. DoRA: Weight-Decomposed Low-Rank Adaptation - Shih-Yang Liu et al., NVIDIA (2024). Decomposes a weight into magnitude and direction and applies LoRA to the direction. Reported to consistently outperform LoRA without adding inference overhead. [Reliability: medium (preprint, widely adopted)] ↩︎

  10. Mixture of Experts Explained - Hugging Face (2023). Explains that MoE scales pretraining substantially for a given compute budget, while covering the costs: all experts resident in VRAM, overfitting during fine-tuning, training instability, the load balancing auxiliary loss, and the capacity factor. [Reliability: medium to high (official explainer blog)] ↩︎ ↩︎2 ↩︎3 ↩︎4

  11. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity - William Fedus, Barret Zoph, Noam Shazeer (2021, JMLR 2022). Up to 7x pretraining speedup over T5 for the same compute with 1-expert routing. Load balancing auxiliary loss coefficient of 0.01, and MoE→dense distillation that keeps about 30% of the quality gain at roughly 1/20 the parameters. [Reliability: medium to high (peer reviewed)] ↩︎ ↩︎2 ↩︎3 ↩︎4

  12. DeepSeek-V3 Technical Report - DeepSeek-AI (2024). Pretrained a 671B-parameter MoE in about 2.788 million H800 GPU hours. A concrete example of MoE training economics (production-run GPU hours only, excluding research and trial costs). [Reliability: medium to high (technical report, measured)] ↩︎

  13. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding - Dmitry Lepikhin et al. (2020). Trained a 600B+ MoE translation model on 2048 TPU v3 chips in 4 days using expert parallelism. [Reliability: medium to high (widely referenced)] ↩︎

  14. ST-MoE: Designing Stable and Transferable Sparse Expert Models - Barret Zoph et al. (2022). Improves training instability and fine-tuning quality with a router z-loss. [Reliability: medium to high (widely referenced)] ↩︎

  15. Mixtral of Experts - Mistral AI (2024). 47B total, with about 13B active per token. Fast at inference, but all experts must be resident, requiring VRAM equivalent to a dense 47B (memory does not shrink). [Reliability: medium to high (technical report, measured)] ↩︎

  16. DistilBERT, a distilled version of BERT - Victor Sanh et al. (2019). Knowledge distillation cuts parameters by 40%, speeds up inference by 60%, and keeps 97% of language understanding. [Reliability: medium to high (widely validated original)] ↩︎ ↩︎2

  17. Deep Compression - Song Han et al. (2015, ICLR 2016). Pruning + quantization + Huffman coding compresses AlexNet 35x and VGG-16 49x (accuracy kept). Pruning + retraining alone gives 9-13x (original). [Reliability: medium to high (peer reviewed, classic)] ↩︎ ↩︎2 ↩︎3

  18. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference - Benoit Jacob et al., Google (2017, CVPR 2018). QAT simulates quantization during training. Enables integer-only inference and holds accuracy better than PTQ at lower bit widths. [Reliability: medium to high (peer reviewed)] ↩︎ ↩︎2

This post is licensed under CC BY 4.0 by the author.