The Reality of Sparse Attention: Examining DeepSeek NSA's Technical Progress and Trade-offs

Examining the technical progress and trade-offs of DeepSeek Sparse Attention. An improved version of technology that has existed since 2019, with reasons why GPT-4 and Claude haven't adopted it. We explain why FlashAttention became the industry standard and the fundamental problems with Sparse Attention.

Posted Dec 4, 2025

11 min read

AI-Generated Content

This article was generated by AI. The accuracy of the content is not guaranteed, and we accept no responsibility for any damages resulting from use of this article. By continuing to read, you agree to the Terms of Use.

Target Audience: AI Engineers, Machine Learning Engineers, LLM Developers
Prerequisites: Transformer architecture, basics of Attention mechanisms
Reading Time: 30 minutes

Overview

In 2025, DeepSeek announced Sparse Attention, heavily publicized as “50% API cost reduction” and “128K token long-context processing.” As if it were a revolutionary new technology.

However, upon closer examination, questions arise:

Sparse Attention is technology OpenAI published in 2019. Why is it being reported as “innovation” after more than 5 years?
GPT-4, Claude, and Gemini don’t use Sparse Attention. Why don’t the most advanced models use it?
The industry standard is FlashAttention. Why Flash instead of Sparse?

This article examines DeepSeek Sparse Attention and reveals the real reasons it’s getting attention and the real reasons it’s not being adopted.

Part 1: Not a “New Technology”—The History of Sparse Attention

2019: Technology Published by OpenAI

Sparse Attention was published by OpenAI in April 2019¹. The paper “Generating Long Sequences with Sparse Transformers” presented a method to reduce computational complexity from O(L²) to O(L√L).

“We introduce sparse factorizations of the attention matrix which reduce this to O(n√n).” — OpenAI, 2019

In other words, DeepSeek’s announcement is merely an improved version of technology from over 5 years ago.

2020: Longformer and BigBird Emerge

In 2020, more practical Sparse Attention methods were published in succession:

Method	Publisher	Features
Longformer	Allen AI, 2020	Sliding window + global tokens
BigBird	Google, 2020	Combination of random + local + global

BigBird claimed to process 8 times longer sequences (4,096 tokens)².

Why Is It Getting Attention Now?

The answer is simple: Cost competition.

DeepSeek is entering the market with low cost as its strength. NSA builds on traditional Sparse Attention research while including new contributions such as hardware optimization and dynamic sparse patterns. However, marketing tends to emphasize novelty, making the historical context of the technology less visible.

Some analyses point out that “the context window length competition resembles the megapixel competition. Vendors shout 128K, 1M, but increasing token count doesn’t mean the model is smarter—it just means it can process more text.”

Part 2: Why GPT-4, Claude, and Gemini Don’t Adopt It

There are technical reasons why the most advanced models don’t adopt Sparse Attention.

Major Model Architectures

According to multiple analyses (officially undisclosed):

Model	Attention Method	Notes
GPT-4/GPT-4o	Dense Attention (estimated)	Architecture undisclosed
Claude	Dense Transformer (estimated)	Architecture undisclosed
Gemini	MoE + details undisclosed	Sparse MoE but Attention unclear

Notable point: While there are no official announcements, the highest-performing models are presumed to be based on Full Attention (Dense Attention).

Why Choose Full Attention?

Reason 1: Quality First

OpenAI, Anthropic, and Google have not disclosed architectural details. However, according to multiple technical analyses, these models are presumed to be based on Full Attention (Dense Attention). They are thought to prioritize quality over cost, avoiding the risk of information loss from sparsification.

Reason 2: FlashAttention Is Sufficient

As we’ll discuss later, FlashAttention is a technology that speeds up Full Attention without sacrificing quality. There’s no need to take on the risks of sparsification.

Reason 3: Degradation in Complex Reasoning Tasks

According to rigorous evaluation in the Sparse Frontier paper, Sparse Attention degrades noticeably in reasoning tasks. The paper notes that “no single sparsification approach or configuration works uniformly well across all tasks and phases.”

Task Type	Compression Tolerance	Notes
Single-query QA	High (20x compression possible)	Simple task
Multi-query (4 queries)	Moderate	Slight degradation
16 queries	Low	Significant accuracy degradation
Reasoning tasks	Low	Requires uniform attention distribution

Part 3: Why FlashAttention Became the Industry Standard

What Is FlashAttention?

FlashAttention (2022) is technology that speeds up Full Attention. It’s fundamentally different from sparsification³.

Characteristic	Sparse Attention	FlashAttention
Approach	Skip computations	Optimize memory access
Accuracy	Approximate (information loss)	Exact match
Speedup	Task-dependent	Consistent 2-4x
Adoption cost	Requires retraining	Drop-in replacement

Why FlashAttention Won

Reason 1: Solves the Real Bottleneck

According to the FlashAttention paper³, conventional methods like sparse or low-rank approximations sacrificed model quality in exchange for theoretical computational reduction. However, these methods failed to achieve actual speedup because they didn’t address the fundamental memory I/O bottleneck.

Sparse Attention focused on reducing FLOPs, but the actual bottleneck was memory I/O. FlashAttention solved this fundamental problem.

Reason 2: Complete Quality Guarantee

According to the FlashAttention paper, a key factor in FlashAttention’s widespread adoption is that it “produces exact attention results.” FlashAttention generates mathematically identical results, with zero impact on model quality.

Reason 3: Faster Even for Short Sequences

“The runtimes of many approximate/sparse attention mechanisms grow linearly with sequence length, but FlashAttention still runs faster than approximate and sparse attention for short sequences due to fewer memory accesses.”

Sparse Attention is slower than FlashAttention for sequences under 512-1024 tokens. For many practical scenarios, FlashAttention has the advantage.

graph TD
    subgraph Short["Short Sequences (<1K)"]
        FA1["FlashAttention"] -->|Fast| W1["Winner"]
        SA1["Sparse Attention"] -->|Overhead| L1["Loser"]
    end

    subgraph Long["Long Sequences (4K+)"]
        FA2["FlashAttention"] -->|Quadratic| L2["Limited"]
        SA2["Sparse Attention"] -->|Linear| W2["Advantaged"]
    end

    Short --> Long

Part 4: Fundamental Problems with Sparse Attention

Problem 1: Information Loss Is Inevitable

Sparse Attention ignores “unimportant” tokens. However, it’s impossible to perfectly determine what’s important in advance⁴.

Existing Sparse Attention methods create systematic biases in attention distribution: excessive focus on important tokens amplifies their attention weights, while complete neglect of unimportant tokens causes loss of relevant attention weights.

Specific Failure Patterns

Problem	Description	Impact
Permanent exclusion	Once excluded, tokens cannot be restored	Information needed later is missing
Cumulative error	Errors accumulate in long generation	Degradation in reasoning tasks
Distributed attention	Tasks requiring uniform attention fail	Problems in reasoning & summarization

Problem 2: Hardware Inefficiency

“One of the main impediments to the large scale adoption of sparse attention is the fact that sparse operations are quite inefficient in modern hardware.” — Google Research

GPUs are optimized for sequential memory access. Sparse Attention’s scattered lookups cannot translate theoretical computational reduction into actual speedup.

graph TD
    subgraph GPU["GPU Optimization Pattern"]
        Dense["Sequential Memory Access"]
        Dense -->|High Efficiency| Good["High Throughput"]
    end

    subgraph SA["Sparse Attention Reality"]
        Sparse["Scattered Lookups"]
        Sparse -->|Inefficient| Bad["Below Theoretical"]
    end

    GPU --> SA

Problem 3: Lack of Generality

“Method/task adaptivity: No single sparsification approach or configuration works uniformly well across all tasks and phases.”

A sparse pattern optimal for one task can be worst for another. This makes adoption difficult in general-purpose LLMs that handle “diverse tasks with one model.”

Problem 4: Training Inefficiency

Many conventional Sparse Attention methods were applied only during inference, using Full Attention during training. DeepSeek’s NSA is sparse during training too, but still incurs additional training costs⁵.

Part 5: DeepSeek’s “Catch”—A Critical Perspective

Gap Between Marketing and Reality

Trade-offs exist. Sparse Attention isn’t “same quality at 50% cheaper” but rather “50% cheaper with some quality sacrifice.” Understanding this trade-off—gaining speed and memory efficiency in exchange for approximate computation—is necessary.

Questions About Benchmark Evaluation

Cherry-Picking Allegations

According to SemiAnalysis analysis:

“When R1 is compared to o1, benchmarks where it doesn’t lead are not mentioned. While comparable in reasoning performance, it’s not a clear winner across all metrics and is inferior to o1 in many cases.”

Benchmarks DeepSeek emphasizes:

Math (AIME 2024, MATH-500)
Reasoning tasks

Benchmarks tending to be avoided:

Software engineering
Cybersecurity
Multilingual tasks

NIST Independent Evaluation (September 2025)

NIST CAISI conducted an independent evaluation and found significant discrepancies from DeepSeek’s self-reported benchmarks:

Evaluation Item	DeepSeek V3.1	Best US Model	Difference
SWE-bench Verified	55%	63-67%	-12%
Cybench	40%	74% (GPT-5)	-34%
Software Engineering Overall	-	-	20%+ behind

NIST’s conclusion: “The best US models outperform the best DeepSeek model (V3.1) on nearly all benchmarks.”

Note: Some analyses point out methodological issues with the CAISI evaluation (US models evaluated via cloud API, DeepSeek in local environment).

Training Cost Opacity

RAND notes:

“The R1 paper contains no mention of computational resources used. This is no coincidence—synthetic data generation and RL require massive computation.”

Furthermore: “DeepSeek operates Asia’s first 10,000 Nvidia A100 cluster, reportedly possessing 50,000 ‘Hopper’ units.”

Developing efficiency technology may have actually required massive trial-and-error and computational resources.

Areas Not Shown in Benchmarks

Complex multi-step reasoning
Subtle relationship comprehension across long texts
Cases requiring precision in legal/medical documents
Security-related tasks

Cost Competition Context

DeepSeek’s pricing is highly competitive:

Provider	Input (per 1M tokens)	Output (per 1M tokens)	Notes
OpenAI o1	$15	$60	-
OpenAI GPT-4o	$5	$20	50% off with caching
DeepSeek V3	$0.27 (miss) / $0.07 (hit)	$1.10	Significant discount on cache hit
DeepSeek R1	$0.55 (miss) / $0.14 (hit)	$2.19	Reasoning-specialized model

Note: Prices as of December 2025. “Miss” = cache miss, “Hit” = cache hit pricing. Check DeepSeek official and OpenAI official for latest information.

This price difference cannot be fully explained by technical superiority alone. Strategic pricing for market share acquisition has also been suggested.

Quality Concerns

Some third-party evaluations report cases where DeepSeek models underperform competitors in specific tasks.

While DeepSeek is excellent at coding and math, there are evaluations suggesting gaps with GPT-4 and Claude on general tasks. Whether this difference stems from information loss due to Sparse Attention requires further verification.

Part 6: Where Sparse Attention Is Still Needed

Having discussed Sparse Attention’s challenges, there are valid use cases.

Valid Use Cases

Use Case	Reason
Ultra-long context (100K+ tokens)	Physically impossible with Full Attention
Cost-constrained applications	Price more important than quality
Simple tasks	Information loss impact is small
Batch processing	Efficiency over latency

Technical Progress Is Real

DeepSeek’s NSA and DSA are technically advanced:

Hardware optimization: Optimized for Hopper/Blackwell generation
Dynamic sparse patterns: From fixed patterns to learning-based
Training support: Sparse during training, not just inference

However, these advances don’t replace Full Attention but expand options under specific conditions.

Part 7: Future Outlook

Realistic Position of Sparse Attention

graph TD
    subgraph Reality["Reality from 2025 Onwards"]
        FA["FlashAttention"] -->|Industry Standard| Standard["Quality-Focused Apps"]
        SA["Sparse Attention"] -->|Specialized Use| Niche["Cost-Focused & Ultra-Long"]
        Hybrid["Hybrid"] -->|Research Stage| Future["Future Possibilities"]
    end

Predictions

FlashAttention continues to dominate: As long as speedup without quality sacrifice is demanded
Sparse Attention for specialized uses: Where Full Attention is physically impossible, like 100M+ token processing
Rise of hybrid approaches: Full for important parts, Sparse for the rest

Implications for Engineers

Prioritize FlashAttention-2/3: 2-4x speedup without sacrificing quality
Evaluate Sparse Attention carefully: Don’t take benchmark results at face value
Consider task characteristics: Full Attention if reasoning or complex relationship comprehension needed
If cost optimization is paramount: DeepSeek API worth considering (with quality trade-off understood)

Conclusion: Trade-offs, Not Innovation

About DeepSeek Sparse Attention, the following has become clear:

Facts

✅ An improved version of technology existing since 2019—includes new contributions but base technology is existing

✅ GPT-4, Claude, Gemini don’t adopt it—there are reasons

✅ FlashAttention is the industry standard—speedup without sacrificing quality

✅ Risk of information loss fundamentally exists—especially notable in reasoning tasks

✅ Getting attention in the context of cost competition

Conclusion

Sparse Attention is not a silver bullet. It should be understood as one tool that’s effective under specific conditions, not a universal solution.

The fact that leading AI companies choose Full Attention + FlashAttention indicates that this is currently best if quality is the priority.

DeepSeek’s price competitiveness is attractive, but choices should be made understanding what is being sacrificed.

References

Additional References

The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs - (2025). arXiv. [Reliability: High]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning - Stanford (2023). [Reliability: High]
Longformer: The Long-Document Transformer - Allen AI (2020). arXiv. [Reliability: High]
Rethinking Attention with Performers - Google Research. [Reliability: High]
Constructing Transformers For Longer Sequences with Sparse Attention Methods - Google Research. [Reliability: High]
DeepSeek Models & Pricing - DeepSeek. [Pricing Info]
OpenAI API Pricing - OpenAI. [Pricing Info]
CAISI Evaluation of DeepSeek AI Models - NIST (2025). [Third-party Evaluation]
DeepSeek Debates: Chinese Leadership On Cost, True Training Cost - SemiAnalysis (2025). [Technical Analysis]
The Rise of DeepSeek: What the Headlines Miss - RAND (2025). [Policy Analysis]

About Citation Accuracy: Research cited in this article was verified using these methods:

Academic papers: Confirmed via arXiv, Google Scholar
Technical blogs: Citations confirmed on official blogs like Google Research
Pricing: Confirmed via official API documentation (DeepSeek, OpenAI) as of December 2025

Important Notes:

GPT-4, Claude, Gemini architecture: Officially undisclosed; estimates based on multiple technical analyses
Pricing: Fluctuates frequently. DeepSeek especially has frequent price changes; check official sites for latest
Caching: Listed prices distinguish cache hit/miss. Cache hits are significantly cheaper

Generating Long Sequences with Sparse Transformers - OpenAI (2019). arXiv:1904.10509. [Reliability: High] ↩︎
Big Bird: Transformers for Longer Sequences - Google (2020). NeurIPS 2020. [Reliability: High] ↩︎
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness - Stanford (2022). arXiv:2205.14135. [Reliability: High] ↩︎ ↩︎²
Post-Training Sparse Attention with Double Sparsity - (2024). arXiv. [Reliability: High] ↩︎
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention - DeepSeek AI (2025). arXiv. [Reliability: High] ↩︎

This post is licensed under CC BY 4.0 by the author.