LLM Knowledge Limits and the Skills/Rules Boundary: What Prompts Can and Cannot Fix

Posted Feb 6, 2026

8 min read

AI-Generated Content

This article was generated by AI. The accuracy of the content is not guaranteed, and we accept no responsibility for any damages resulting from use of this article. By continuing to read, you agree to the Terms of Use.

Target audience: Engineers using AI coding tools (Claude Code, Cursor, etc.)
Prerequisites: Basic understanding of LLMs and prompt engineering
Reading time: 12 minutes

Summary

AI coding tools like Claude Code and Cursor allow you to define rules in CLAUDE.md or skill files to control AI behavior. But this raises a question: “Can adding more rules make AI do anything?”

Here are the conclusions upfront:

More constraints = lower instruction compliance — Multiple studies have proven this
The most effective approach: “Name known concepts, detail only project-specific parts”
For new concepts, provide documentation directly instead of teaching through rules

This article explains these claims based on research data from 2024-2025.

Research Shows: More Constraints = Lower Compliance

Benchmarks Reveal Clear Limits

Multiple studies have quantitatively demonstrated the limits of LLM instruction-following capabilities.

ComplexBench (2024) evaluated LLMs’ ability to handle compound constraints using 1,150 instructions and 5,306 scoring questions¹. The results were clear:

Constraint Structure	GPT-4 Score
Simple (And)	0.881
Chain	0.766
Selection	0.765
Nested (3+ layers)	0.626

Scores clearly drop as constraints become more complex. Furthermore, even the strongest models achieved only 0.532 accuracy on length constraints, and GPT-4 dropped to 14.9% accuracy on multi-layer nested structures.

Key Findings from the 18-Model Benchmark

AI Muse’s 18-model benchmark² confirmed similar issues in a more practical context. On a creative task with 10 constraints (writing a children’s story), zero models achieved a perfect score.

Constraint Violation Patterns:

Constraint Type	Violation Rate
Forbidden words	94%
Name usage limits	89%
Cliché prohibition	67%
Character count range	39%

The highest score was GPT-4 o1 at 7/10. In other words, even the most advanced model failed to follow 3 out of 10 constraints.

The Difficulty of Instruction Compliance

These research results show that expecting “AI will follow rules if you write them” is overly optimistic. Even state-of-the-art models have clear limitations in handling compound constraints, and compliance rates decrease as the number and complexity of constraints increase.

Why More Constraints Lower Compliance

Context Rot

Chroma’s research team reported a phenomenon called “Context Rot”³.

The key question is: what degrades?

Type of Degradation	Tolerance	Impact on Work
Processing speed	High	Just longer wait times
Token consumption	High	Costs increase but predictable
Instruction compliance	Low	“It doesn’t do what I asked”
Output quality	Low	Review burden increases

The essence of “performance degradation” that research shows is not about speed or cost — it’s declining instruction compliance.

Longer inputs lead to performance degradation
“Lost in the Middle” problem: Instructions at the beginning and end are followed better; rules in the middle are more likely to be forgotten⁴
More complex tasks see worse degradation

The Comprehension-Competence Gap

There’s an even trickier problem. LLMs can verbalize (explain) rules yet fail to apply them in actual tasks⁵.

This is called “computational split-brain syndrome,” indicating that understanding instructions and executing actions are functionally separated. In other words, “writing rules doesn’t guarantee they’ll be followed.”

Effective Approach: Name Known Concepts

CLAUDE.md Optimization Research Results

Arize’s research team quantitatively measured the effects of CLAUDE.md optimization⁶:

Cross-repository testing: +5.19% accuracy improvement
Same-repository testing: +10.87% accuracy improvement

The key insight from this research: effective rules should focus on specific patterns, conventions, and potential pitfalls of the codebase.

The “Name It” Approach

For concepts LLMs already know, naming them is sufficient. By detailing only project-specific parts, you save context and maintain instruction compliance.

  
❌ Bad example (verbose):
"Please follow Domain-Driven Design principles.
Domain-Driven Design is an approach to software design
that centers on the business domain,
using concepts like entities, value objects, aggregates,
repositories, and services.
Entities have unique identifiers..." (long explanation)

✅ Good example (efficient):
"Follow DDD.
Project-specific rules:
- User aggregate root is UserEntity
- Repositories go in /src/infrastructure/repositories
- Domain events publish to Kafka"

Why This Works

Approach	Context Usage	Compliance Rate
Explain concept from scratch	High	Prone to decline
Name only	Low	High
Name + specific rules	Medium	High

Re-explaining “what LLMs already know” wastes context and causes instruction dilution.

Examples of Concepts LLMs Know

Category	Examples	How to Instruct
Classic patterns	MVC, Layered Architecture	Name only
Established principles	SOLID, DRY, KISS	Name only
Mature architectures	DDD, CQRS, Clean Architecture	Name + specific application rules
Relatively new approaches	Vertical Slice Architecture	Name + detailed supplementation

Important note: Even if LLMs know “concepts” like DDD or Clean Architecture, the “correct application” in a specific project is a separate matter. That’s why naming the concept and detailing only project-specific application rules is effective.

Handling New Concepts

Teaching Through Rules Is Counterproductive

LLM knowledge stops at the training data cutoff⁷. For libraries released in 2025, APIs after breaking changes, or the latest best practices, LLMs may simply “not know.”

“Can’t we teach by writing detailed explanations in skill files?” is a natural thought, but as mentioned earlier, more explanations = lower compliance rates. This can be counterproductive.

Effective Approaches

1. Include Documentation Directly in Context

# Effective approach
"Read the new API documentation and implement based on that"

# Ineffective approach
"This API changed significantly in v3. The new usage is..."
(writing long explanations in skills)

The former leverages LLMs’ “comprehension ability” to process new information. The latter bloats skills and risks lowering compliance with other rules.

2. Add Verification Layers

Whether LLMs correctly applied new concepts is unknown until you verify the output.

flowchart TB
    A["Request LLM to use new API"]
    B["Code generation"]
    C["Run automated tests"]
    D{Test result}
    E["Success: Adopt"]
    F["Failure: Feedback and regenerate"]

    A --> B --> C --> D
    D -->|Pass| E
    D -->|Fail| F
    F --> B

The AI Muse benchmark also concluded that “human review is essential”². Since even state-of-the-art models fail to follow 3 out of 10 constraints, deploying to production without verification is risky.

3. Use WebSearch/RAG

By utilizing Claude Code’s WebSearch feature or RAG systems, LLMs can access latest information they weren’t trained on.

Practical Rule Design

Lessons from the AI Muse Benchmark

The AI Muse benchmark improved system prompt writing in 3 stages, resulting in significantly improved average scores²:

Version	Average Score
S-0 (Initial)	2.4
S-1 (Full spec in one block)	6.0
S-2 (Optimized)	6.3

Key finding: “Consolidating all rules as one continuous block in the system prompt” was effective. Partial or delayed constraints were consistently ignored.

Rule Design Best Practices

  
❌ Avoid:
- Rules covering every edge case
- Exceptions to exceptions
- Vague expressions ("appropriately," "as needed")
- Rules scattered across multiple locations

✅ Effective:
- Limit to 5-10 most important rules
- Include concrete examples
- Eliminate contradicting rules
- Consolidate all rules in one block

Regular Review

Skills and rules become stale over time. Regularly check:

Contradicting rules: Do new rules conflict with old ones?
Obsolete rules: Have any rules become meaningless as the project evolved?
Rule bloat: Are you keeping only truly necessary rules?

Conclusion

What research shows:

More constraints = lower compliance (ComplexBench: 0.881 → 0.626)
Zero models achieved perfection with 10 constraints (AI Muse)
Even state-of-the-art models have clear limits on compound constraints

Effective approaches:

Name known concepts: Trust and leverage LLMs’ existing knowledge
Detail only specific rules: Conserve context
Keep rules few and consolidated: Prevent instruction dilution

Handling new concepts:

Writing long explanations in skills is counterproductive
Include documentation directly in context
Add verification layers to check output

Ultimately, LLMs are “amplification tools,” not “universal tools.” Leveraging concepts LLMs already know and supplementing only project-specific parts — this combination is currently the most accurate approach.

References

References corresponding to citation numbers in the text are listed in numerical order.

Additional References (not cited by number in text)

FollowBench: A Multi-level Fine-grained Constraints Following Benchmark - ACL 2024. [Reliability: High (Peer-reviewed)]
RECAST: Expanding the Boundaries of LLMs’ Complex Instruction Following - arXiv (2025). [Reliability: Medium-High (Preprint)]
Best Practices for Claude Code - Anthropic (2025). [Reliability: High]

About citation accuracy: The research cited in this article is primarily from academic databases (arXiv, ACL Anthology, EMNLP), official company blogs, and reliable technical media. Some preprints are papers before peer review, so reliability levels are explicitly stated.

Benchmarking Complex Instruction-Following with Multiple Constraints Composition - arXiv (2024). [Reliability: Medium-High (Preprint)] ↩︎
System Prompts Versus User Prompts: Empirical Lessons from an 18-Model LLM Benchmark - AI Muse (2025). [Reliability: Medium] ↩︎ ↩︎² ↩︎³
Context Rot: How Increasing Input Tokens Impacts LLM Performance - Chroma Research (2025). [Reliability: High] ↩︎
Lost in the Middle: How Language Models Use Long Contexts - Liu et al. (2024). [Reliability: High (TACL peer-reviewed)] ↩︎
Comprehension Without Competence: Architectural Limits of LLMs - arXiv (2025). [Reliability: Medium-High (Preprint)] ↩︎
CLAUDE.md: Best Practices Learned from Optimizing Claude Code with Prompt Learning - Arize (2025). [Reliability: Medium-High] ↩︎
Knowledge cutoff - Wikipedia - Wikipedia (2025). [Reliability: Medium-High] ↩︎

AI・Technology

This post is licensed under CC BY 4.0 by the author.