A Field Guide to Making LLM Inference Faster and Lighter: Choosing Techniques by Bottleneck

Posted Jun 14, 2026

27 min read

AI-Generated Content

This article was generated by AI. The accuracy of the content is not guaranteed, and we accept no responsibility for any damages resulting from use of this article. By continuing to read, you agree to the Terms of Use.

Audience: Engineers running LLMs locally or in the cloud who are stuck on one of three problems: it’s too slow, it won’t fit in VRAM, or it costs too much
Prerequisites: A rough picture of Transformer inference and the basics of GPU memory. Each technique is explained in the text
Reading time: about 25 min

Overview

Your LLM is slow, or it won’t fit in VRAM. When that happens, it’s tempting to reach for the nearest lever: quantize it, add a GPU, see what sticks. Sometimes that works and sometimes it doesn’t, and the reason is simple. LLM inference does not have one bottleneck. Is the model too big for the card? Is each token slow to produce? Can it not keep up with concurrent requests? Is a long prompt slow to get started? Different problems need different fixes.

One fact runs through this whole article. LLM generation, the decode step, is limited by memory bandwidth, not by compute¹. Producing each token means reading the model’s entire set of weights out of VRAM, and the amount of arithmetic done on those weights is tiny by comparison. That’s why quantization (fewer bits per weight), switching to a smaller model, and compressing the KV cache all speed things up for the same underlying reason: they reduce the number of bytes you have to move per token. The flip side is that tricks aimed at making the arithmetic faster don’t help decode much.

This holds at low batch sizes, which is what you get with low-latency interactive use. Once you batch many requests together, the cost of reading the weights gets amortized across many tokens, and compute starts to matter again¹. So optimization is something you choose after working out which phase and which bottleneck your particular workload is hitting.

This article sorts the main techniques (quantization, KV cache reduction, speculative decoding, batching, model compression, parallelism) by which axis they move: memory capacity, decode bandwidth (latency), throughput, or prefill (time to first token). For each one I’ll give the upside and the trap. Then I’ll lay out what’s realistic in different settings as of early 2026 (cloud production, a single local GPU, the edge, managed APIs), and finish with how to pick by bottleneck and how to choose a serving framework. The short version of the priority order: quantize first; if it still won’t fit, cut the KV cache and offload; to speed up a conversation, add speculative decoding; to serve a lot of traffic, use continuous batching.

First principle: where is the bottleneck

Before any optimization, it helps to know that inference splits into two phases.

Prefill processes the prompt. It computes all the input tokens at once, the matrix multiplications are large, and the GPU’s arithmetic units stay busy. This phase is compute-bound. Decode generates one token at a time. Each step reads all the weights and the KV cache out of VRAM to compute a single token. There’s far more data to move than there is arithmetic to do, so decode is memory-bandwidth-bound¹.

Why does this matter? On a roofline model (a performance model that plots arithmetic intensity on the x-axis and achieved performance on the y-axis), each layer of Llama-2-7B sits around 1 OPs/byte, well below the inflection point and squarely in the bandwidth-bound region¹. In that region, performance is set by how many bytes you move. Drop the weights from FP16 to INT4 and you move roughly a quarter of the data, so token generation in decode speeds up by close to the same factor¹. That’s all “quantization is faster” and “small models are faster” really mean.

A caveat up front: “decode is bandwidth-bound” depends on batch size. Bundle requests into a large batch and the weights you read once get reused across many tokens, arithmetic intensity climbs, and you move toward compute-bound¹. Low-latency online inference and large-batch high-throughput serving call for different techniques. The rest of this article keeps that axis in mind as it walks through each one.

Quantization: the highest-leverage move

Quantization represents weights or data with fewer bits than FP16. It cuts memory capacity directly and speeds up the bandwidth-bound decode step. The effect is large and it’s worth trying first. But what you quantize changes where it helps.

Weight-only methods quantize just the weights to 4 bits: GPTQ, AWQ, and GGUF. GPTQ is a post-training quantization (PTQ) method that uses second-order information per layer to correct rounding error. It can quantize a 175B model to 4 bits in about four GPU hours and reports roughly 3.25x (A100) to 4.5x (A6000) speedups over FP16². AWQ identifies the roughly 1% of channels that matter from the activation distribution and protects them, reaching around 3x the FP16 speed with reduced memory on its own kernels³. GGUF (llama.cpp), the local standard, comes in many types with different bit allocations. For a 7B model, the balanced Q4_K_M lands around 3.80 GB with a perplexity increase of about +0.05, while Q8_0 is 6.70 GB and essentially lossless⁴. Keep in mind that weight-only mainly cuts decode bandwidth, so it does little for large-batch or prefill compute throughput.

If you quantize activations to 8 bits as well, W8A8 (INT8 or FP8), you can use Tensor Cores, so it also helps compute-bound prefill and large batches. SmoothQuant shifts outliers to the weight side with an equivalent transform so INT8 works, giving up to 1.56x speedup and 2x memory reduction⁵. FP8 assumes native support on Hopper (H100) and later, where it cuts memory by about 2x and raises throughput by up to 1.6x with minimal accuracy impact⁶. The corollary is that an A100-generation card gets essentially nothing from FP8.

On the training side, QLoRA (LoRA on top of a frozen 4-bit base) cuts memory heavily, but it’s a technique for making training lighter rather than inference faster, so the details go in the companion article on training⁷.

If the thing eating memory on long inputs is the KV cache rather than the weights, then KV cache quantization is what helps. KVQuant reports keeping perplexity degradation under 0.1 with 3-bit quantization while fitting up to a million tokens of context on a single A100-80GB for LLaMA-7B⁸. It works independently of weight quantization, so it’s worth combining for long-context, large-batch workloads.

Trimming the KV cache and attention

The KV cache holds the Key and Value tensors for every past token. Its size grows as sequence length × number of layers × number of KV heads × head dimension, so on long inputs it can exceed the model weights themselves, and since decode re-reads all of it every step, it also eats bandwidth⁹. Attention variants cut this at the design level.

MQA shares a single set of KV across all query heads, shrinking the KV cache by a factor equal to the head count¹⁰. The reduction is large but quality tends to drop. GQA sits in between: it splits query heads into groups and shares one set of KV per group. You can convert an existing MHA model with about 5% of its original training compute, keeping speeds close to MQA while staying near MHA quality⁹. It’s effectively the standard in today’s open models (the Llama family, Mistral, and others). MLA, introduced in DeepSeek-V2, compresses Key and Value into a low-rank latent vector before caching. It reports cutting the KV cache by 93.3% versus the previous generation and raising maximum generation throughput by 5.76x¹¹. These are model-architecture choices, so unless you train or convert a model yourself, you benefit by picking a model that’s built that way.

On the serving side, PagedAttention (vLLM) makes the KV cache efficient. It doesn’t change the attention math; it manages the KV cache in fixed-size blocks like an operating system’s virtual memory, driving fragmentation to nearly zero. The freed memory lets you pack more concurrent requests, giving 2 to 4x the throughput of FasterTransformer or Orca¹².

To speed up the attention computation itself, there’s the FlashAttention line. It computes the N×N attention matrix in tiles on-chip without writing it to VRAM, dropping memory from O(N²) to O(N) while running faster (about 3x on GPT-2)¹³. FlashAttention-2 raised A100 utilization, and FlashAttention-3 added another 1.5 to 2.0x for Hopper along with FP8 support¹³. Attention is heavy during prefill, so this helps time to first token.

Cutting the number of decode steps: speculative decoding

In bandwidth-bound decode, reducing how many times you read the weights (that is, the number of generation steps) makes things faster. Speculative decoding does exactly that. The shared idea: a lightweight model or head drafts several tokens ahead, and the main model verifies them together in one forward pass. The important part is that methods using rejection sampling for verification produce output identical to the main model alone (same distribution). It isn’t a “faster but lower quality” trick; it gets the same output in fewer steps.

There are several flavors. The original draft-model approach has a separate small model do the drafting and reaches 2 to 3x with no quality loss¹⁴. If a separate model is too much hassle, Medusa adds multiple prediction heads to the main model and gets 2.2 to 3.6x with no extra draft model¹⁵. The fast current option is the EAGLE line, where a lightweight head predicts using the main model’s intermediate features; EAGLE-3 reports up to 6.5x¹⁶. Lookahead Decoding needs neither a draft model nor head training: it builds n-gram candidates with Jacobi iteration. The gain is more modest, up to about 1.8x, but it’s easy to drop in¹⁷.

Two traps. First, speed depends heavily on the acceptance rate (how often drafts are correct). If your domain is off and drafts miss, there’s no gain. Second, batch size: speculative decoding helps most at low batch, and at large batch the GPU is already saturated so the gain shrinks (EAGLE-3 is the generation aimed at fixing this weakness)¹⁶. For tasks where input and output vocabulary overlap, like summarization or editing, n-gram methods that pull candidates from string matches in the prompt (prompt lookup) are also lightweight and effective.

Raising throughput: batching and caching

If you want to serve many concurrent requests, the axis you care about is throughput, not latency.

Continuous batching does not wait for every request to finish the way static batching does. It injects new requests on an iteration basis whenever a sequence ends, removing GPU idle time. In Anyscale’s benchmark it reports about 8x throughput on its own over naive static batching, and up to 23x combined with PagedAttention¹⁸. That multiplier is measured against a naive baseline, so against an already-optimized dynamic batcher it’s smaller.

Prefix caching reuses the KV cache for a shared prompt prefix to skip prefill. It helps when a system prompt is shared, in multi-turn conversations, and for repeated questions over the same long document, and it lowers time to first token¹⁹. It only helps prefill; the generation (decode) phase doesn’t get a millisecond shorter, which the official docs state plainly¹⁹. For a workload with no shared prefix and long generations, the gain is zero. Prompt caching in commercial APIs works the same way, reducing billed tokens and time to first token.

Making the model smaller, or spreading it out

When techniques can’t squeeze it down enough, you change the model itself or split it across GPUs.

Distillation trains a small student on the output distribution of a large teacher. DistilBERT is the standard reference point: 40% smaller, 60% faster inference, 97% of the performance retained²⁰. Pruning removes low-importance weights. Structured pruning, which removes whole heads or neurons, maps cleanly to hardware and speeds things up; unstructured sparsity, which removes individual weights, rarely shows up as real speed without specialized kernels²¹.

MoE splits the FFN into multiple experts and activates only some per token. Mixtral 8x7B uses only 12.9B of its 46.7B total parameters per token, matching a 70B-class model at inference that’s roughly 6x faster²². One thing to watch: MoE does not reduce memory capacity. At inference you don’t know which expert will be picked, so all experts have to sit in VRAM, and capacity is actually larger. The speed comes from reading fewer weights per token (active params), which is the same bandwidth principle again.

When a model won’t fit on one GPU, you reach for parallelism. Tensor Parallelism (TP) splits each layer’s weights across GPUs horizontally and can lower latency, but every layer runs an all-reduce, so it assumes a fast interconnect like NVLink. Pipeline Parallelism (PP) assigns layers to stages, and its only communication is adjacent transfers between stages, so it suits thin-bandwidth setups (multiple nodes or PCIe), though pipeline bubbles increase single-request latency²³. In practice, a hybrid of TP within a node and PP across nodes is common. If it still won’t fit in VRAM, you can use llama.cpp’s partial offload to set how many layers go on the GPU and run the rest on CPU. But there’s a bandwidth cliff: the moment you spill out of VRAM, throughput drops sharply because the whole thing gets limited by the slow bandwidth of CPU RAM²³. That’s the bandwidth principle yet again.

Comparison table

Here are the techniques so far, sorted by the axis they move, with a representative number, the quality impact, and the main constraint. The multipliers all depend on model, hardware, and workload; they’re the representative values each source reported.

Technique	Main axis	Representative number	Quality impact	Main constraint / drawback
Quantization (weight-only INT4)	Capacity, decode bandwidth	3 to 4.5x vs FP16, ~1/4 memory²³	Minor (depends on type and bits)⁴	Little help for large batch / prefill, needs special kernels
Quantization (W8A8 INT8/FP8)	Capacity, compute throughput	1.5 to 1.6x, ~2x less memory⁵⁶	Minor	FP8 needs Hopper or later⁶
KV cache quantization	Long-context capacity	3-bit, ppl loss <0.1⁸	Minor	Weak for short inputs / small batch
GQA / MQA / MLA	KV capacity, decode bandwidth	MLA: 93.3% less KV, 5.76x¹¹	MQA degrades, GQA small, MLA on par or better⁹¹¹	A model-design choice (self-conversion needs training)
PagedAttention	Throughput (KV fragmentation)	2 to 4x¹²	No loss	Kernel implementation complexity
FlashAttention	Prefill speed, activation capacity	O(N²)→O(N), 3x on GPT-2¹³	No loss (exact)	GPU-generation dependent (FA3 is Hopper)
Speculative decoding	Decode latency	2 to 6.5x¹⁴¹⁵¹⁶	No loss (distribution match)	Acceptance-rate dependent, shrinks at large batch
Continuous batching	Throughput	8x alone, 23x with PagedAttn¹⁸	No loss	Value is vs naive baseline, depends on concurrency
Prefix caching	TTFT, prefill	Depends on prefix length (no fixed factor)¹⁹	No loss	No effect on decode, needs shared prefix
Distillation	Capacity, bandwidth, latency	40% smaller, 60% faster, 97% kept²⁰	Some loss possible	Requires separate training
Pruning	Capacity, (compute if structured)	Depends on sparsity rate	Degrades at high sparsity²¹	Unstructured needs special kernels
MoE	Compute, decode bandwidth	~6x on Mixtral²²	Design-dependent	Capacity doesn’t drop (all experts resident)
TP / PP	Capacity, latency/throughput	Environment-dependent²³	No loss	TP needs fast interconnect, PP has bubbles

Note that speculative decoding, PagedAttention, continuous batching, prefix caching, and FlashAttention all line up as “no loss.” They speed things up without changing the output, so you can stack them without worrying about a quality tradeoff. Quantization, distillation, pruning, and MoE all trade quality to some degree.

Frameworks bundle the techniques together

You don’t have to implement each technique one by one. Serving frameworks stack the major ones for you. The criterion for choosing is where you’re running.

For a high-throughput endpoint on cloud or server GPUs, vLLM is the standard. With PagedAttention and continuous batching at its core, it stacks quantization, prefix caching, speculative decoding, and torch.compile on top¹²²⁴. To push NVIDIA GPUs to the limit there’s TensorRT-LLM, but it’s NVIDIA-only and the build cost is high. SGLang is strong at reusing shared prefixes (RadixAttention), structured generation, and agent workloads, and in recent benchmarks it matches or beats vLLM in places²³. For local, edge, or mixed-CPU constrained environments, llama.cpp covers a wide range of hardware with GGUF quantization and CPU+GPU offload. Ollama sits on top of it and sells ease of use for personal and development work. One note: Hugging Face’s TGI, once a go-to, is in maintenance mode as of 2026, so it’s hard to recommend as a first pick for new projects²³.

What’s realistic by environment (early 2026)

Separate from the bottleneck discussion, where you run also changes the sweet spot. This area moves fast and the rankings reshuffle every six months, so read the following as defaults for early 2026, not a permanent verdict. The premise is that you measure on your own workload at the end.

For cloud or server-GPU production serving many users, vLLM is the safe standard. Earn throughput with continuous batching and PagedAttention, then add weight quantization (FP8 on the H100 generation, AWQ otherwise), prefix caching, and speculative decoding. For workloads with lots of shared prefix, like RAG or multi-turn, or for reasoning models, SGLang with RadixAttention can beat vLLM. The two keep leapfrogging each other, so the honest move is to narrow it to those two and compare on your own load²⁵. If you’re fixed on NVIDIA, your model is stable, and you can stomach the compile wait and specialization for a gain in the low tens of percent to about 30% (benchmark-dependent), TensorRT-LLM is the fastest²⁵. If that overhead isn’t worth it, vLLM is enough for most teams.

On a single local GPU (24 GB or less, for development and personal use), “does it fit” comes before “is it fast.” The usual move is to quantize around GGUF Q4_K_M and run it with llama.cpp, or Ollama on top if you’d rather keep it simple. Picking a GQA- or MLA-designed model at selection time keeps the KV cache from clogging on long inputs.

In non-NVIDIA environments (CPU-only, Apple Silicon, the edge), the practical choice is the llama.cpp family. You run low-bit GGUF with partial offload, watching for the bandwidth cliff that drops throughput the moment you spill out of VRAM (or RAM). That’s fine for low-concurrency internal tools and offline use, but it doesn’t scale to customer-facing volume²⁵.

If you don’t want to run servers yourself, a managed API with prompt caching is the realistic path. Caching shared prefixes lowers billed tokens and time to first token. If you want to switch between models on cost and quality, you can put a routing layer in front.

When the model is too large for a single GPU, move to multi-GPU parallelism. The common setup is a hybrid of NVLink-based Tensor Parallelism within a node and Pipeline Parallelism across nodes, built with vLLM or TensorRT-LLM.

Environment	Current default	Main techniques to apply
Cloud production (many users, high throughput)	vLLM / SGLang	Continuous batching, PagedAttention, FP8/AWQ, speculative decoding
NVIDIA-fixed, performance first	TensorRT-LLM	Above + FP8 + compile optimization (tradeoff with overhead)
Single local GPU (dev, personal)	llama.cpp / Ollama	GGUF Q4 quantization, GQA/MLA model choice
CPU, Apple Silicon, edge	llama.cpp	Low-bit GGUF, partial offload (mind the bandwidth cliff)
No self-hosting (API / managed)	Prompt-caching API	Prefix/prompt caching, routing
Too large for one GPU	vLLM / TensorRT-LLM + parallelism	TP within node × PP across nodes

Choosing by bottleneck

In the order you’d actually work through it:

Start by solving “does it fit in VRAM” with quantization. Weight-only INT4 (GGUF/AWQ/GPTQ) cuts memory to about a quarter while also speeding up decode, so it’s almost always the first move. If it still won’t fit, use KV cache quantization for long context, and if that’s not enough, go to offload or parallelism. Just watch the bandwidth cliff on offload.

If single-token speed matters, as in a conversation, add speculative decoding on top of weight-only quantization. Since the output is identical, you don’t have to worry about quality, which is the strong part. To go faster still, switch to a model that’s simply smaller (distilled, or GQA/MLA by design).

If you want to serve a lot of requests, continuous batching and PagedAttention matter more than latency. If there’s a shared prompt, cut prefill with prefix caching. In this area it’s faster to lean on vLLM or SGLang than to build it yourself.

If a long prompt is slow to get started (time to first token), use a framework with FlashAttention support, and reuse the prefix if it’s shared. W8A8/FP8 also speeds up prefill compute.

When in doubt, start by running a quantized model on vLLM or llama.cpp, measure, then decide the next move. You can’t know the bottleneck without measuring, and once you measure, the next technique to apply mostly chooses itself.

Caveats and limits

Every number above depends on model, hardware, kernel, and workload. They’re the representative values each source reported, with no guarantee you’ll see the same multiplier in your setup. The thing that matters most is to set up a measurement environment first and compare before and after under the same conditions.

A few points that are easy to get wrong, restated. “Decode is bandwidth-bound” assumes low batch; at large batch, compute starts to matter. MoE does not reduce memory capacity (only active params drop). Prefix caching only helps prefill. FP8 needs Hopper or later and doesn’t work on the A100. Weight-only quantization only delivers its speed when the right kernels exist. Get these backward and you’ll spend time on techniques that don’t help.

And the big premise: every technique here is for making your current model faster and lighter, not for rescuing a model that doesn’t deliver the quality you need. Before any of this, check that you’ve actually picked a model good enough for the job.

Wrapping up

LLM inference optimization isn’t a single “go faster” button; it’s a fix matched to a bottleneck. The one principle that decode is limited by memory bandwidth runs through and explains why quantization, KV reduction, going smaller, and speculative decoding all work.

Problem to solve first	First move	Next move
Won’t fit in VRAM	Weight-only INT4 quantization	KV quantization → offload/parallelism
Slow per token (conversation)	Quantization + speculative decoding	Distilled / GQA-MLA small model
Can’t keep up with concurrent requests	Continuous batching + PagedAttention	Prefix caching
Long prompt slow to start	FlashAttention	Prefix reuse / W8A8-FP8

To try something today, quantize a model you have to 4 bits (GGUF or AWQ), run it on vLLM or llama.cpp, and measure token generation speed and VRAM use before and after. Once you’ve seen how the bandwidth limit plays out in actual numbers, the call on which technique to reach for next gets a lot steadier.

This article focused on optimizing the side that runs the model. If you go as far as training or fine-tuning your own, the cost shows up in different places. Why MoE can train a larger model on the same compute budget, and how QLoRA and distillation cut training resources, are covered in the companion piece, A Field Guide to Cutting LLM Training & Build Costs.

You may also be interested in these related posts:

The Technical Limits of LLM Code Generation: Hallucination, Inefficiency, and Practical Workarounds - The limits of using local and small LLMs in real work
The Limits of LLM Knowledge and the Skill/Rule Boundary - What you can pack into a single model
Thinking vs. Knowing: A Question of Priority - Specialization and interference, seen through a learning metaphor

References

References below are listed in citation order to match the numbers in the text. Multipliers and reduction rates are values each source reported under specific conditions and vary by environment.

LLM Inference Unveiled: Survey and Roofline Model Insights - Zhihang Yuan et al. (2024). Shows quantitatively with a roofline model that prefill is compute-bound and decode is memory-bandwidth-bound, and explains why quantization speeds things up by reducing memory access. [Reliability: medium to high (survey paper, preprint)] ↩︎ ↩︎² ↩︎³ ↩︎⁴ ↩︎⁵ ↩︎⁶
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers - Elias Frantar et al. (2022). Weight-only PTQ using second-order information. Quantizes 175B to 3 to 4 bits in about four GPU hours, 3.25 to 4.5x vs FP16. [Reliability: medium to high (accepted at ICLR 2023)] ↩︎ ↩︎²
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration - Ji Lin et al. (2023, MLSys 2024 Best Paper). 4-bit quantization that protects important channels based on the activation distribution. Around 3x vs FP16 on dedicated kernels. [Reliability: medium to high (peer-reviewed award paper)] ↩︎ ↩︎²
llama.cpp quantize README / 7B quantization benchmark - ggml-org. Sizes and perplexity increases for each GGUF type (7B example: Q4_K_M 3.80GB/+0.0535, Q8_0 6.70GB/+0.0004). [Reliability: medium (official repo benchmark)] ↩︎ ↩︎²
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models - Guangxuan Xiao et al. (2022, ICML 2023). Moves outliers to the weights with an equivalent transform to enable W8A8 (INT8). Up to 1.56x and 2x less memory. [Reliability: medium to high (peer-reviewed)] ↩︎ ↩︎²
FP8 W8A8 Quantization (vLLM official docs) - vLLM. FP8 gives about 2x less memory and up to 1.6x throughput, requiring Hopper (H100) / Ada or later. [Reliability: medium to high (official docs)] ↩︎ ↩︎² ↩︎³
QLoRA: Efficient Finetuning of Quantized LLMs - Tim Dettmers et al. (2023, NeurIPS 2023). NF4, double quant, and a paged optimizer fine-tune a 65B model on a single 48GB GPU at 16-bit-equivalent quality. [Reliability: medium to high (peer-reviewed)] ↩︎
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization - Coleman Hooper et al. (2024). 3-bit KV cache quantization with ppl loss <0.1, up to a million tokens of context on a single A100 for LLaMA-7B. [Reliability: medium (preprint)] ↩︎ ↩︎²
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints - Joshua Ainslie et al. (2023, EMNLP 2023). Groups query heads to share KV. Converts from MHA with about 5% of original training, MQA-like speed near MHA quality. [Reliability: medium to high (peer-reviewed)] ↩︎ ↩︎² ↩︎³
Fast Transformer Decoding: One Write-Head is All You Need - Noam Shazeer (2019). MQA, sharing a single KV across all query heads. Cuts the KV cache by a factor of the head count. [Reliability: medium to high (original paper)] ↩︎
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model - DeepSeek-AI (2024). MLA, compressing Key and Value into a low-rank latent vector. 93.3% less KV cache, up to 5.76x maximum generation throughput. [Reliability: medium to high (technical report, measured)] ↩︎ ↩︎² ↩︎³
Efficient Memory Management for Large Language Model Serving with PagedAttention - Woosuk Kwon et al. (2023, SOSP 2023). Manages the KV cache in OS-style pages, nearly eliminating fragmentation, 2 to 4x throughput vs FasterTransformer/Orca. The core of vLLM. [Reliability: high (peer-reviewed, top conference)] ↩︎ ↩︎² ↩︎³
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness - Tri Dao et al. (2022). IO-aware tiled computation that avoids writing the N×N matrix, memory O(N²)→O(N) with a speedup. Followed by FlashAttention-2 (2307.08691) and FlashAttention-3 (2407.08608, 1.5 to 2x for Hopper with FP8 support). [Reliability: medium to high (widely adopted, partly peer-reviewed)] ↩︎ ↩︎² ↩︎³
Fast Inference from Transformers via Speculative Decoding - Yaniv Leviathan et al. (2022, ICML 2023). A small draft model reads ahead and the main model verifies in parallel. No quality loss, 2 to 3x. [Reliability: medium to high (peer-reviewed)] ↩︎ ↩︎²
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads - Tianle Cai et al. (2024). A self-speculative method that adds multiple prediction heads to the main model. No extra draft model, 2.2 to 3.6x. [Reliability: medium (preprint, widely adopted)] ↩︎ ↩︎²
EAGLE-3: Scaling up Inference Acceleration of LLMs via Training-Time Test - Yuhui Li et al. (2025, NeurIPS 2025). Speculative decoding with a lightweight head using intermediate features. Up to 6.5x, with improved gains at large batch. [Reliability: medium to high (accepted at a peer-reviewed conference)] ↩︎ ↩︎² ↩︎³
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding - Yichao Fu et al. (2024, ICML 2024). No draft model, generates and verifies n-gram candidates with Jacobi iteration. No quality loss, up to 1.8x. [Reliability: medium to high (peer-reviewed)] ↩︎
How continuous batching enables 23x throughput in LLM inference - Anyscale. Continuous batching that injects requests on an iteration basis. 8x on its own vs naive static batching, up to 23x combined with PagedAttention. [Reliability: medium (vendor official blog, measured)] ↩︎ ↩︎²
Automatic Prefix Caching (vLLM official docs) - vLLM. Reuse of the KV cache for shared prefixes. States explicitly that it shortens only prefill, not decode. [Reliability: medium to high (official docs)] ↩︎ ↩︎² ↩︎³
DistilBERT, a distilled version of BERT - Victor Sanh et al. (2019). Knowledge distillation, 40% smaller, 60% faster inference, 97% performance retained. [Reliability: medium to high (widely verified original)] ↩︎ ↩︎²
Model Compression and Efficient Inference for Large Language Models: A Survey - Wenxiao Wang et al. (2024). A survey across quantization, pruning, and distillation. Lays out the tradeoffs of structured vs unstructured sparsity. [Reliability: medium (survey, preprint)] ↩︎ ↩︎²
Mixtral of Experts - Mistral AI (2024). A Sparse MoE that activates 12.9B of 46.7B total per token. Matches a 70B-class model at inference roughly 6x faster (“6x vs Llama 2 70B” per Mistral AI’s official announcement). Memory capacity doesn’t drop because all experts stay resident. [Reliability: medium to high (technical report, measured)] ↩︎ ↩︎²
BentoML LLM Inference Handbook: parallelism and serving - BentoML. Tradeoffs of Tensor/Pipeline parallelism and guidance on framework choice. See also SGLang (arXiv:2312.07104) and the “bandwidth cliff” of llama.cpp partial offload. [Reliability: medium (practitioner-oriented technical write-up)] ↩︎ ↩︎² ↩︎³ ↩︎⁴ ↩︎⁵
Introduction to torch.compile and How It Works with vLLM - vLLM (2025). Kernel fusion and reduced launch overhead via torch.compile and CUDA graphs. Improves latency/throughput with no quality loss. [Reliability: medium to high (official blog)] ↩︎
Choosing the best LLM inference engine in 2026 - BIZON (2026). Guidance by use case (laptop→Ollama, workstation→llama.cpp, many users→vLLM/SGLang, NVIDIA production→TensorRT-LLM). See also the H100 benchmark comparison (Spheron, 2026) and HF’s official move of TGI to maintenance mode with a vLLM/SGLang recommendation. Multipliers and rankings vary by version. [Reliability: medium (vendor/comparison blogs, benchmarks)] ↩︎ ↩︎² ↩︎³

AI・Technology

This post is licensed under CC BY 4.0 by the author.