Post
JA EN

The Reality of Sparse Attention: Examining DeepSeek NSA's Technical Progress and Trade-offs

Examining the technical progress and trade-offs of DeepSeek Sparse Attention. An improved version of technology that has existed since 2019, with reasons why GPT-4 and Claude haven't adopted it. We explain why FlashAttention became the industry standard and the fundamental problems with Sparse Attention.

The Reality of Sparse Attention: Examining DeepSeek NSA's Technical Progress and Trade-offs
  • Target Audience: AI Engineers, Machine Learning Engineers, LLM Developers
  • Prerequisites: Transformer architecture, basics of Attention mechanisms
  • Reading Time: 30 minutes

Overview

In 2025, DeepSeek announced Sparse Attention, heavily publicized as “50% API cost reduction” and “128K token long-context processing.” As if it were a revolutionary new technology.

However, upon closer examination, questions arise:

  1. Sparse Attention is technology OpenAI published in 2019. Why is it being reported as “innovation” after more than 5 years?
  2. GPT-4, Claude, and Gemini don’t use Sparse Attention. Why don’t the most advanced models use it?
  3. The industry standard is FlashAttention. Why Flash instead of Sparse?

This article examines DeepSeek Sparse Attention and reveals the real reasons it’s getting attention and the real reasons it’s not being adopted.


Part 1: Not a “New Technology”—The History of Sparse Attention

2019: Technology Published by OpenAI

Sparse Attention was published by OpenAI in April 20191. The paper “Generating Long Sequences with Sparse Transformers” presented a method to reduce computational complexity from O(L²) to O(L√L).

“We introduce sparse factorizations of the attention matrix which reduce this to O(n√n).” — OpenAI, 2019

In other words, DeepSeek’s announcement is merely an improved version of technology from over 5 years ago.

2020: Longformer and BigBird Emerge

In 2020, more practical Sparse Attention methods were published in succession:

MethodPublisherFeatures
LongformerAllen AI, 2020Sliding window + global tokens
BigBirdGoogle, 2020Combination of random + local + global

BigBird claimed to process 8 times longer sequences (4,096 tokens)2.

Why Is It Getting Attention Now?

The answer is simple: Cost competition.

DeepSeek is entering the market with low cost as its strength. NSA builds on traditional Sparse Attention research while including new contributions such as hardware optimization and dynamic sparse patterns. However, marketing tends to emphasize novelty, making the historical context of the technology less visible.

Some analyses point out that “the context window length competition resembles the megapixel competition. Vendors shout 128K, 1M, but increasing token count doesn’t mean the model is smarter—it just means it can process more text.”


Part 2: Why GPT-4, Claude, and Gemini Don’t Adopt It

There are technical reasons why the most advanced models don’t adopt Sparse Attention.

Major Model Architectures

According to multiple analyses (officially undisclosed):

ModelAttention MethodNotes
GPT-4/GPT-4oDense Attention (estimated)Architecture undisclosed
ClaudeDense Transformer (estimated)Architecture undisclosed
GeminiMoE + details undisclosedSparse MoE but Attention unclear

Notable point: While there are no official announcements, the highest-performing models are presumed to be based on Full Attention (Dense Attention).

Why Choose Full Attention?

Reason 1: Quality First

OpenAI, Anthropic, and Google have not disclosed architectural details. However, according to multiple technical analyses, these models are presumed to be based on Full Attention (Dense Attention). They are thought to prioritize quality over cost, avoiding the risk of information loss from sparsification.

Reason 2: FlashAttention Is Sufficient

As we’ll discuss later, FlashAttention is a technology that speeds up Full Attention without sacrificing quality. There’s no need to take on the risks of sparsification.

Reason 3: Degradation in Complex Reasoning Tasks

According to rigorous evaluation in the Sparse Frontier paper, Sparse Attention degrades noticeably in reasoning tasks. The paper notes that “no single sparsification approach or configuration works uniformly well across all tasks and phases.”

Task TypeCompression ToleranceNotes
Single-query QAHigh (20x compression possible)Simple task
Multi-query (4 queries)ModerateSlight degradation
16 queriesLowSignificant accuracy degradation
Reasoning tasksLowRequires uniform attention distribution

Part 3: Why FlashAttention Became the Industry Standard

What Is FlashAttention?

FlashAttention (2022) is technology that speeds up Full Attention. It’s fundamentally different from sparsification3.

CharacteristicSparse AttentionFlashAttention
ApproachSkip computationsOptimize memory access
AccuracyApproximate (information loss)Exact match
SpeedupTask-dependentConsistent 2-4x
Adoption costRequires retrainingDrop-in replacement

Why FlashAttention Won

Reason 1: Solves the Real Bottleneck

According to the FlashAttention paper3, conventional methods like sparse or low-rank approximations sacrificed model quality in exchange for theoretical computational reduction. However, these methods failed to achieve actual speedup because they didn’t address the fundamental memory I/O bottleneck.

Sparse Attention focused on reducing FLOPs, but the actual bottleneck was memory I/O. FlashAttention solved this fundamental problem.

Reason 2: Complete Quality Guarantee

According to the FlashAttention paper, a key factor in FlashAttention’s widespread adoption is that it “produces exact attention results.” FlashAttention generates mathematically identical results, with zero impact on model quality.

Reason 3: Faster Even for Short Sequences

“The runtimes of many approximate/sparse attention mechanisms grow linearly with sequence length, but FlashAttention still runs faster than approximate and sparse attention for short sequences due to fewer memory accesses.”

Sparse Attention is slower than FlashAttention for sequences under 512-1024 tokens. For many practical scenarios, FlashAttention has the advantage.

graph TD
    subgraph Short["Short Sequences (<1K)"]
        FA1["FlashAttention"] -->|Fast| W1["Winner"]
        SA1["Sparse Attention"] -->|Overhead| L1["Loser"]
    end

    subgraph Long["Long Sequences (4K+)"]
        FA2["FlashAttention"] -->|Quadratic| L2["Limited"]
        SA2["Sparse Attention"] -->|Linear| W2["Advantaged"]
    end

    Short --> Long

Part 4: Fundamental Problems with Sparse Attention

Problem 1: Information Loss Is Inevitable

Sparse Attention ignores “unimportant” tokens. However, it’s impossible to perfectly determine what’s important in advance4.

Existing Sparse Attention methods create systematic biases in attention distribution: excessive focus on important tokens amplifies their attention weights, while complete neglect of unimportant tokens causes loss of relevant attention weights.

Specific Failure Patterns

ProblemDescriptionImpact
Permanent exclusionOnce excluded, tokens cannot be restoredInformation needed later is missing
Cumulative errorErrors accumulate in long generationDegradation in reasoning tasks
Distributed attentionTasks requiring uniform attention failProblems in reasoning & summarization

Problem 2: Hardware Inefficiency

“One of the main impediments to the large scale adoption of sparse attention is the fact that sparse operations are quite inefficient in modern hardware.” — Google Research

GPUs are optimized for sequential memory access. Sparse Attention’s scattered lookups cannot translate theoretical computational reduction into actual speedup.

graph TD
    subgraph GPU["GPU Optimization Pattern"]
        Dense["Sequential Memory Access"]
        Dense -->|High Efficiency| Good["High Throughput"]
    end

    subgraph SA["Sparse Attention Reality"]
        Sparse["Scattered Lookups"]
        Sparse -->|Inefficient| Bad["Below Theoretical"]
    end

    GPU --> SA

Problem 3: Lack of Generality

“Method/task adaptivity: No single sparsification approach or configuration works uniformly well across all tasks and phases.”

A sparse pattern optimal for one task can be worst for another. This makes adoption difficult in general-purpose LLMs that handle “diverse tasks with one model.”

Problem 4: Training Inefficiency

Many conventional Sparse Attention methods were applied only during inference, using Full Attention during training. DeepSeek’s NSA is sparse during training too, but still incurs additional training costs5.


Part 5: DeepSeek’s “Catch”—A Critical Perspective

Gap Between Marketing and Reality

Trade-offs exist. Sparse Attention isn’t “same quality at 50% cheaper” but rather “50% cheaper with some quality sacrifice.” Understanding this trade-off—gaining speed and memory efficiency in exchange for approximate computation—is necessary.

Questions About Benchmark Evaluation

Cherry-Picking Allegations

According to SemiAnalysis analysis:

“When R1 is compared to o1, benchmarks where it doesn’t lead are not mentioned. While comparable in reasoning performance, it’s not a clear winner across all metrics and is inferior to o1 in many cases.”

Benchmarks DeepSeek emphasizes:

  • Math (AIME 2024, MATH-500)
  • Reasoning tasks

Benchmarks tending to be avoided:

  • Software engineering
  • Cybersecurity
  • Multilingual tasks

NIST Independent Evaluation (September 2025)

NIST CAISI conducted an independent evaluation and found significant discrepancies from DeepSeek’s self-reported benchmarks:

Evaluation ItemDeepSeek V3.1Best US ModelDifference
SWE-bench Verified55%63-67%-12%
Cybench40%74% (GPT-5)-34%
Software Engineering Overall--20%+ behind

NIST’s conclusion: “The best US models outperform the best DeepSeek model (V3.1) on nearly all benchmarks.”

Note: Some analyses point out methodological issues with the CAISI evaluation (US models evaluated via cloud API, DeepSeek in local environment).

Training Cost Opacity

RAND notes:

“The R1 paper contains no mention of computational resources used. This is no coincidence—synthetic data generation and RL require massive computation.”

Furthermore: “DeepSeek operates Asia’s first 10,000 Nvidia A100 cluster, reportedly possessing 50,000 ‘Hopper’ units.”

Developing efficiency technology may have actually required massive trial-and-error and computational resources.

Areas Not Shown in Benchmarks

  • Complex multi-step reasoning
  • Subtle relationship comprehension across long texts
  • Cases requiring precision in legal/medical documents
  • Security-related tasks

Cost Competition Context

DeepSeek’s pricing is highly competitive:

ProviderInput (per 1M tokens)Output (per 1M tokens)Notes
OpenAI o1$15$60-
OpenAI GPT-4o$5$2050% off with caching
DeepSeek V3$0.27 (miss) / $0.07 (hit)$1.10Significant discount on cache hit
DeepSeek R1$0.55 (miss) / $0.14 (hit)$2.19Reasoning-specialized model

Note: Prices as of December 2025. “Miss” = cache miss, “Hit” = cache hit pricing. Check DeepSeek official and OpenAI official for latest information.

This price difference cannot be fully explained by technical superiority alone. Strategic pricing for market share acquisition has also been suggested.

Quality Concerns

Some third-party evaluations report cases where DeepSeek models underperform competitors in specific tasks.

While DeepSeek is excellent at coding and math, there are evaluations suggesting gaps with GPT-4 and Claude on general tasks. Whether this difference stems from information loss due to Sparse Attention requires further verification.


Part 6: Where Sparse Attention Is Still Needed

Having discussed Sparse Attention’s challenges, there are valid use cases.

Valid Use Cases

Use CaseReason
Ultra-long context (100K+ tokens)Physically impossible with Full Attention
Cost-constrained applicationsPrice more important than quality
Simple tasksInformation loss impact is small
Batch processingEfficiency over latency

Technical Progress Is Real

DeepSeek’s NSA and DSA are technically advanced:

  • Hardware optimization: Optimized for Hopper/Blackwell generation
  • Dynamic sparse patterns: From fixed patterns to learning-based
  • Training support: Sparse during training, not just inference

However, these advances don’t replace Full Attention but expand options under specific conditions.


Part 7: Future Outlook

Realistic Position of Sparse Attention

graph TD
    subgraph Reality["Reality from 2025 Onwards"]
        FA["FlashAttention"] -->|Industry Standard| Standard["Quality-Focused Apps"]
        SA["Sparse Attention"] -->|Specialized Use| Niche["Cost-Focused & Ultra-Long"]
        Hybrid["Hybrid"] -->|Research Stage| Future["Future Possibilities"]
    end

Predictions

  1. FlashAttention continues to dominate: As long as speedup without quality sacrifice is demanded
  2. Sparse Attention for specialized uses: Where Full Attention is physically impossible, like 100M+ token processing
  3. Rise of hybrid approaches: Full for important parts, Sparse for the rest

Implications for Engineers

  • Prioritize FlashAttention-2/3: 2-4x speedup without sacrificing quality
  • Evaluate Sparse Attention carefully: Don’t take benchmark results at face value
  • Consider task characteristics: Full Attention if reasoning or complex relationship comprehension needed
  • If cost optimization is paramount: DeepSeek API worth considering (with quality trade-off understood)

Conclusion: Trade-offs, Not Innovation

About DeepSeek Sparse Attention, the following has become clear:

Facts

An improved version of technology existing since 2019—includes new contributions but base technology is existing

GPT-4, Claude, Gemini don’t adopt it—there are reasons

FlashAttention is the industry standard—speedup without sacrificing quality

Risk of information loss fundamentally exists—especially notable in reasoning tasks

Getting attention in the context of cost competition

Conclusion

Sparse Attention is not a silver bullet. It should be understood as one tool that’s effective under specific conditions, not a universal solution.

The fact that leading AI companies choose Full Attention + FlashAttention indicates that this is currently best if quality is the priority.

DeepSeek’s price competitiveness is attractive, but choices should be made understanding what is being sacrificed.


References

Additional References


About Citation Accuracy: Research cited in this article was verified using these methods:

  • Academic papers: Confirmed via arXiv, Google Scholar
  • Technical blogs: Citations confirmed on official blogs like Google Research
  • Pricing: Confirmed via official API documentation (DeepSeek, OpenAI) as of December 2025

Important Notes:

  • GPT-4, Claude, Gemini architecture: Officially undisclosed; estimates based on multiple technical analyses
  • Pricing: Fluctuates frequently. DeepSeek especially has frequent price changes; check official sites for latest
  • Caching: Listed prices distinguish cache hit/miss. Cache hits are significantly cheaper
  1. Generating Long Sequences with Sparse Transformers - OpenAI (2019). arXiv:1904.10509. [Reliability: High] ↩︎

  2. Big Bird: Transformers for Longer Sequences - Google (2020). NeurIPS 2020. [Reliability: High] ↩︎

  3. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness - Stanford (2022). arXiv:2205.14135. [Reliability: High] ↩︎ ↩︎2

  4. Post-Training Sparse Attention with Double Sparsity - (2024). arXiv. [Reliability: High] ↩︎

  5. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention - DeepSeek AI (2025). arXiv. [Reliability: High] ↩︎

This post is licensed under CC BY 4.0 by the author.