Post
JA EN

LLM Knowledge Limits and the Skills/Rules Boundary: What Prompts Can and Cannot Fix

LLM Knowledge Limits and the Skills/Rules Boundary: What Prompts Can and Cannot Fix
  • Target audience: Engineers using AI coding tools (Claude Code, Cursor, etc.)
  • Prerequisites: Basic understanding of LLMs and prompt engineering
  • Reading time: 12 minutes

Summary

AI coding tools like Claude Code and Cursor allow you to define rules in CLAUDE.md or skill files to control AI behavior. But this raises a question: “Can adding more rules make AI do anything?”

Here are the conclusions upfront:

  1. More constraints = lower instruction compliance — Multiple studies have proven this
  2. The most effective approach: “Name known concepts, detail only project-specific parts”
  3. For new concepts, provide documentation directly instead of teaching through rules

This article explains these claims based on research data from 2024-2025.

Research Shows: More Constraints = Lower Compliance

Benchmarks Reveal Clear Limits

Multiple studies have quantitatively demonstrated the limits of LLM instruction-following capabilities.

ComplexBench (2024) evaluated LLMs’ ability to handle compound constraints using 1,150 instructions and 5,306 scoring questions1. The results were clear:

Constraint StructureGPT-4 Score
Simple (And)0.881
Chain0.766
Selection0.765
Nested (3+ layers)0.626

Scores clearly drop as constraints become more complex. Furthermore, even the strongest models achieved only 0.532 accuracy on length constraints, and GPT-4 dropped to 14.9% accuracy on multi-layer nested structures.

Key Findings from the 18-Model Benchmark

AI Muse’s 18-model benchmark2 confirmed similar issues in a more practical context. On a creative task with 10 constraints (writing a children’s story), zero models achieved a perfect score.

Constraint Violation Patterns:

Constraint TypeViolation Rate
Forbidden words94%
Name usage limits89%
Cliché prohibition67%
Character count range39%

The highest score was GPT-4 o1 at 7/10. In other words, even the most advanced model failed to follow 3 out of 10 constraints.

The Difficulty of Instruction Compliance

These research results show that expecting “AI will follow rules if you write them” is overly optimistic. Even state-of-the-art models have clear limitations in handling compound constraints, and compliance rates decrease as the number and complexity of constraints increase.

Why More Constraints Lower Compliance

Context Rot

Chroma’s research team reported a phenomenon called “Context Rot”3.

The key question is: what degrades?

Type of DegradationToleranceImpact on Work
Processing speedHighJust longer wait times
Token consumptionHighCosts increase but predictable
Instruction complianceLow“It doesn’t do what I asked”
Output qualityLowReview burden increases

The essence of “performance degradation” that research shows is not about speed or cost — it’s declining instruction compliance.

  • Longer inputs lead to performance degradation
  • “Lost in the Middle” problem: Instructions at the beginning and end are followed better; rules in the middle are more likely to be forgotten4
  • More complex tasks see worse degradation

The Comprehension-Competence Gap

There’s an even trickier problem. LLMs can verbalize (explain) rules yet fail to apply them in actual tasks5.

This is called “computational split-brain syndrome,” indicating that understanding instructions and executing actions are functionally separated. In other words, “writing rules doesn’t guarantee they’ll be followed.”

Effective Approach: Name Known Concepts

CLAUDE.md Optimization Research Results

Arize’s research team quantitatively measured the effects of CLAUDE.md optimization6:

  • Cross-repository testing: +5.19% accuracy improvement
  • Same-repository testing: +10.87% accuracy improvement

The key insight from this research: effective rules should focus on specific patterns, conventions, and potential pitfalls of the codebase.

The “Name It” Approach

For concepts LLMs already know, naming them is sufficient. By detailing only project-specific parts, you save context and maintain instruction compliance.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
❌ Bad example (verbose):
"Please follow Domain-Driven Design principles.
Domain-Driven Design is an approach to software design
that centers on the business domain,
using concepts like entities, value objects, aggregates,
repositories, and services.
Entities have unique identifiers..." (long explanation)

✅ Good example (efficient):
"Follow DDD.
Project-specific rules:
- User aggregate root is UserEntity
- Repositories go in /src/infrastructure/repositories
- Domain events publish to Kafka"

Why This Works

ApproachContext UsageCompliance Rate
Explain concept from scratchHighProne to decline
Name onlyLowHigh
Name + specific rulesMediumHigh

Re-explaining “what LLMs already know” wastes context and causes instruction dilution.

Examples of Concepts LLMs Know

CategoryExamplesHow to Instruct
Classic patternsMVC, Layered ArchitectureName only
Established principlesSOLID, DRY, KISSName only
Mature architecturesDDD, CQRS, Clean ArchitectureName + specific application rules
Relatively new approachesVertical Slice ArchitectureName + detailed supplementation

Important note: Even if LLMs know “concepts” like DDD or Clean Architecture, the “correct application” in a specific project is a separate matter. That’s why naming the concept and detailing only project-specific application rules is effective.

Handling New Concepts

Teaching Through Rules Is Counterproductive

LLM knowledge stops at the training data cutoff7. For libraries released in 2025, APIs after breaking changes, or the latest best practices, LLMs may simply “not know.”

“Can’t we teach by writing detailed explanations in skill files?” is a natural thought, but as mentioned earlier, more explanations = lower compliance rates. This can be counterproductive.

Effective Approaches

1. Include Documentation Directly in Context

1
2
3
4
5
6
# Effective approach
"Read the new API documentation and implement based on that"

# Ineffective approach
"This API changed significantly in v3. The new usage is..."
(writing long explanations in skills)

The former leverages LLMs’ “comprehension ability” to process new information. The latter bloats skills and risks lowering compliance with other rules.

2. Add Verification Layers

Whether LLMs correctly applied new concepts is unknown until you verify the output.

flowchart TB
    A["Request LLM to use new API"]
    B["Code generation"]
    C["Run automated tests"]
    D{Test result}
    E["Success: Adopt"]
    F["Failure: Feedback and regenerate"]

    A --> B --> C --> D
    D -->|Pass| E
    D -->|Fail| F
    F --> B

The AI Muse benchmark also concluded that “human review is essential”2. Since even state-of-the-art models fail to follow 3 out of 10 constraints, deploying to production without verification is risky.

3. Use WebSearch/RAG

By utilizing Claude Code’s WebSearch feature or RAG systems, LLMs can access latest information they weren’t trained on.

Practical Rule Design

Lessons from the AI Muse Benchmark

The AI Muse benchmark improved system prompt writing in 3 stages, resulting in significantly improved average scores2:

VersionAverage Score
S-0 (Initial)2.4
S-1 (Full spec in one block)6.0
S-2 (Optimized)6.3

Key finding: “Consolidating all rules as one continuous block in the system prompt” was effective. Partial or delayed constraints were consistently ignored.

Rule Design Best Practices

1
2
3
4
5
6
7
8
9
10
11
❌ Avoid:
- Rules covering every edge case
- Exceptions to exceptions
- Vague expressions ("appropriately," "as needed")
- Rules scattered across multiple locations

✅ Effective:
- Limit to 5-10 most important rules
- Include concrete examples
- Eliminate contradicting rules
- Consolidate all rules in one block

Regular Review

Skills and rules become stale over time. Regularly check:

  • Contradicting rules: Do new rules conflict with old ones?
  • Obsolete rules: Have any rules become meaningless as the project evolved?
  • Rule bloat: Are you keeping only truly necessary rules?

Conclusion

What research shows:

  • More constraints = lower compliance (ComplexBench: 0.881 → 0.626)
  • Zero models achieved perfection with 10 constraints (AI Muse)
  • Even state-of-the-art models have clear limits on compound constraints

Effective approaches:

  • Name known concepts: Trust and leverage LLMs’ existing knowledge
  • Detail only specific rules: Conserve context
  • Keep rules few and consolidated: Prevent instruction dilution

Handling new concepts:

  • Writing long explanations in skills is counterproductive
  • Include documentation directly in context
  • Add verification layers to check output

Ultimately, LLMs are “amplification tools,” not “universal tools.” Leveraging concepts LLMs already know and supplementing only project-specific parts — this combination is currently the most accurate approach.

See also these related articles:

References

References corresponding to citation numbers in the text are listed in numerical order.

Additional References (not cited by number in text)


About citation accuracy: The research cited in this article is primarily from academic databases (arXiv, ACL Anthology, EMNLP), official company blogs, and reliable technical media. Some preprints are papers before peer review, so reliability levels are explicitly stated.

  1. Benchmarking Complex Instruction-Following with Multiple Constraints Composition - arXiv (2024). [Reliability: Medium-High (Preprint)] ↩︎

  2. System Prompts Versus User Prompts: Empirical Lessons from an 18-Model LLM Benchmark - AI Muse (2025). [Reliability: Medium] ↩︎ ↩︎2 ↩︎3

  3. Context Rot: How Increasing Input Tokens Impacts LLM Performance - Chroma Research (2025). [Reliability: High] ↩︎

  4. Lost in the Middle: How Language Models Use Long Contexts - Liu et al. (2024). [Reliability: High (TACL peer-reviewed)] ↩︎

  5. Comprehension Without Competence: Architectural Limits of LLMs - arXiv (2025). [Reliability: Medium-High (Preprint)] ↩︎

  6. CLAUDE.md: Best Practices Learned from Optimizing Claude Code with Prompt Learning - Arize (2025). [Reliability: Medium-High] ↩︎

  7. Knowledge cutoff - Wikipedia - Wikipedia (2025). [Reliability: Medium-High] ↩︎

This post is licensed under CC BY 4.0 by the author.