LLM Knowledge Limits and the Skills/Rules Boundary: What Prompts Can and Cannot Fix
This article was generated by AI. The accuracy of the content is not guaranteed, and we accept no responsibility for any damages resulting from use of this article. By continuing to read, you agree to the Terms of Use.
- Target audience: Engineers using AI coding tools (Claude Code, Cursor, etc.)
- Prerequisites: Basic understanding of LLMs and prompt engineering
- Reading time: 12 minutes
Summary
AI coding tools like Claude Code and Cursor allow you to define rules in CLAUDE.md or skill files to control AI behavior. But this raises a question: “Can adding more rules make AI do anything?”
Here are the conclusions upfront:
- More constraints = lower instruction compliance — Multiple studies have proven this
- The most effective approach: “Name known concepts, detail only project-specific parts”
- For new concepts, provide documentation directly instead of teaching through rules
This article explains these claims based on research data from 2024-2025.
Research Shows: More Constraints = Lower Compliance
Benchmarks Reveal Clear Limits
Multiple studies have quantitatively demonstrated the limits of LLM instruction-following capabilities.
ComplexBench (2024) evaluated LLMs’ ability to handle compound constraints using 1,150 instructions and 5,306 scoring questions1. The results were clear:
| Constraint Structure | GPT-4 Score |
|---|---|
| Simple (And) | 0.881 |
| Chain | 0.766 |
| Selection | 0.765 |
| Nested (3+ layers) | 0.626 |
Scores clearly drop as constraints become more complex. Furthermore, even the strongest models achieved only 0.532 accuracy on length constraints, and GPT-4 dropped to 14.9% accuracy on multi-layer nested structures.
Key Findings from the 18-Model Benchmark
AI Muse’s 18-model benchmark2 confirmed similar issues in a more practical context. On a creative task with 10 constraints (writing a children’s story), zero models achieved a perfect score.
Constraint Violation Patterns:
| Constraint Type | Violation Rate |
|---|---|
| Forbidden words | 94% |
| Name usage limits | 89% |
| Cliché prohibition | 67% |
| Character count range | 39% |
The highest score was GPT-4 o1 at 7/10. In other words, even the most advanced model failed to follow 3 out of 10 constraints.
The Difficulty of Instruction Compliance
These research results show that expecting “AI will follow rules if you write them” is overly optimistic. Even state-of-the-art models have clear limitations in handling compound constraints, and compliance rates decrease as the number and complexity of constraints increase.
Why More Constraints Lower Compliance
Context Rot
Chroma’s research team reported a phenomenon called “Context Rot”3.
The key question is: what degrades?
| Type of Degradation | Tolerance | Impact on Work |
|---|---|---|
| Processing speed | High | Just longer wait times |
| Token consumption | High | Costs increase but predictable |
| Instruction compliance | Low | “It doesn’t do what I asked” |
| Output quality | Low | Review burden increases |
The essence of “performance degradation” that research shows is not about speed or cost — it’s declining instruction compliance.
- Longer inputs lead to performance degradation
- “Lost in the Middle” problem: Instructions at the beginning and end are followed better; rules in the middle are more likely to be forgotten4
- More complex tasks see worse degradation
The Comprehension-Competence Gap
There’s an even trickier problem. LLMs can verbalize (explain) rules yet fail to apply them in actual tasks5.
This is called “computational split-brain syndrome,” indicating that understanding instructions and executing actions are functionally separated. In other words, “writing rules doesn’t guarantee they’ll be followed.”
Effective Approach: Name Known Concepts
CLAUDE.md Optimization Research Results
Arize’s research team quantitatively measured the effects of CLAUDE.md optimization6:
- Cross-repository testing: +5.19% accuracy improvement
- Same-repository testing: +10.87% accuracy improvement
The key insight from this research: effective rules should focus on specific patterns, conventions, and potential pitfalls of the codebase.
The “Name It” Approach
For concepts LLMs already know, naming them is sufficient. By detailing only project-specific parts, you save context and maintain instruction compliance.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
❌ Bad example (verbose):
"Please follow Domain-Driven Design principles.
Domain-Driven Design is an approach to software design
that centers on the business domain,
using concepts like entities, value objects, aggregates,
repositories, and services.
Entities have unique identifiers..." (long explanation)
✅ Good example (efficient):
"Follow DDD.
Project-specific rules:
- User aggregate root is UserEntity
- Repositories go in /src/infrastructure/repositories
- Domain events publish to Kafka"
Why This Works
| Approach | Context Usage | Compliance Rate |
|---|---|---|
| Explain concept from scratch | High | Prone to decline |
| Name only | Low | High |
| Name + specific rules | Medium | High |
Re-explaining “what LLMs already know” wastes context and causes instruction dilution.
Examples of Concepts LLMs Know
| Category | Examples | How to Instruct |
|---|---|---|
| Classic patterns | MVC, Layered Architecture | Name only |
| Established principles | SOLID, DRY, KISS | Name only |
| Mature architectures | DDD, CQRS, Clean Architecture | Name + specific application rules |
| Relatively new approaches | Vertical Slice Architecture | Name + detailed supplementation |
Important note: Even if LLMs know “concepts” like DDD or Clean Architecture, the “correct application” in a specific project is a separate matter. That’s why naming the concept and detailing only project-specific application rules is effective.
Handling New Concepts
Teaching Through Rules Is Counterproductive
LLM knowledge stops at the training data cutoff7. For libraries released in 2025, APIs after breaking changes, or the latest best practices, LLMs may simply “not know.”
“Can’t we teach by writing detailed explanations in skill files?” is a natural thought, but as mentioned earlier, more explanations = lower compliance rates. This can be counterproductive.
Effective Approaches
1. Include Documentation Directly in Context
1
2
3
4
5
6
# Effective approach
"Read the new API documentation and implement based on that"
# Ineffective approach
"This API changed significantly in v3. The new usage is..."
(writing long explanations in skills)
The former leverages LLMs’ “comprehension ability” to process new information. The latter bloats skills and risks lowering compliance with other rules.
2. Add Verification Layers
Whether LLMs correctly applied new concepts is unknown until you verify the output.
flowchart TB
A["Request LLM to use new API"]
B["Code generation"]
C["Run automated tests"]
D{Test result}
E["Success: Adopt"]
F["Failure: Feedback and regenerate"]
A --> B --> C --> D
D -->|Pass| E
D -->|Fail| F
F --> B
The AI Muse benchmark also concluded that “human review is essential”2. Since even state-of-the-art models fail to follow 3 out of 10 constraints, deploying to production without verification is risky.
3. Use WebSearch/RAG
By utilizing Claude Code’s WebSearch feature or RAG systems, LLMs can access latest information they weren’t trained on.
Practical Rule Design
Lessons from the AI Muse Benchmark
The AI Muse benchmark improved system prompt writing in 3 stages, resulting in significantly improved average scores2:
| Version | Average Score |
|---|---|
| S-0 (Initial) | 2.4 |
| S-1 (Full spec in one block) | 6.0 |
| S-2 (Optimized) | 6.3 |
Key finding: “Consolidating all rules as one continuous block in the system prompt” was effective. Partial or delayed constraints were consistently ignored.
Rule Design Best Practices
1
2
3
4
5
6
7
8
9
10
11
❌ Avoid:
- Rules covering every edge case
- Exceptions to exceptions
- Vague expressions ("appropriately," "as needed")
- Rules scattered across multiple locations
✅ Effective:
- Limit to 5-10 most important rules
- Include concrete examples
- Eliminate contradicting rules
- Consolidate all rules in one block
Regular Review
Skills and rules become stale over time. Regularly check:
- Contradicting rules: Do new rules conflict with old ones?
- Obsolete rules: Have any rules become meaningless as the project evolved?
- Rule bloat: Are you keeping only truly necessary rules?
Conclusion
What research shows:
- More constraints = lower compliance (ComplexBench: 0.881 → 0.626)
- Zero models achieved perfection with 10 constraints (AI Muse)
- Even state-of-the-art models have clear limits on compound constraints
Effective approaches:
- Name known concepts: Trust and leverage LLMs’ existing knowledge
- Detail only specific rules: Conserve context
- Keep rules few and consolidated: Prevent instruction dilution
Handling new concepts:
- Writing long explanations in skills is counterproductive
- Include documentation directly in context
- Add verification layers to check output
Ultimately, LLMs are “amplification tools,” not “universal tools.” Leveraging concepts LLMs already know and supplementing only project-specific parts — this combination is currently the most accurate approach.
Related Articles
See also these related articles:
- Technical Limitations of LLM Code Generation - Technical aspects of hallucinations and inefficiencies
- AI’s “Overthinking” Problem - The comprehension-competence gap
- Using Skills Without Writing Skills - Effective skill utilization
- Meta-Prompting and Orchestrator Mindset - High-level instruction approaches
References
References corresponding to citation numbers in the text are listed in numerical order.
Additional References (not cited by number in text)
FollowBench: A Multi-level Fine-grained Constraints Following Benchmark - ACL 2024. [Reliability: High (Peer-reviewed)]
RECAST: Expanding the Boundaries of LLMs’ Complex Instruction Following - arXiv (2025). [Reliability: Medium-High (Preprint)]
Best Practices for Claude Code - Anthropic (2025). [Reliability: High]
About citation accuracy: The research cited in this article is primarily from academic databases (arXiv, ACL Anthology, EMNLP), official company blogs, and reliable technical media. Some preprints are papers before peer review, so reliability levels are explicitly stated.
Benchmarking Complex Instruction-Following with Multiple Constraints Composition - arXiv (2024). [Reliability: Medium-High (Preprint)] ↩︎
System Prompts Versus User Prompts: Empirical Lessons from an 18-Model LLM Benchmark - AI Muse (2025). [Reliability: Medium] ↩︎ ↩︎2 ↩︎3
Context Rot: How Increasing Input Tokens Impacts LLM Performance - Chroma Research (2025). [Reliability: High] ↩︎
Lost in the Middle: How Language Models Use Long Contexts - Liu et al. (2024). [Reliability: High (TACL peer-reviewed)] ↩︎
Comprehension Without Competence: Architectural Limits of LLMs - arXiv (2025). [Reliability: Medium-High (Preprint)] ↩︎
CLAUDE.md: Best Practices Learned from Optimizing Claude Code with Prompt Learning - Arize (2025). [Reliability: Medium-High] ↩︎
Knowledge cutoff - Wikipedia - Wikipedia (2025). [Reliability: Medium-High] ↩︎