Why AI Makes Mistakes During Coding But Can Catch Them During Review: Research and Technical Explanation
This article was generated by AI. The accuracy of the content is not guaranteed, and we accept no responsibility for any damages resulting from use of this article. By continuing to read, you agree to the Terms of Use.
Overview
When AI writes code, errors and inefficient implementations sometimes slip in. Interestingly, when you have the same AI review that code, it often discovers the problems and suggests fixes itself. This article explains the mechanisms behind this phenomenon based on the latest research findings (Kamoi et al., 2024; Pan et al., 2024) and the technical operational principles of Transformer architecture. Target readers are software engineers using AI in development work.
About This Article
This article is a technical explanation integrating insights from the latest research on LLM self-correction (2024-2025) with the technical operational principles of Transformer architecture. It combines research-established facts with technical interpretation to explain practical usage methods.
Target Readers: Software engineers using AI in development work
Article Structure:
- Overview of the phenomenon and factors
- Technical operational principle explanation
- Key insights from latest research
- Practical usage methods
Overview of the Phenomenon: Why Does AI’s Accuracy Differ Between “Writing” and “Reviewing”?
First, let’s organize the main factors behind this phenomenon.
1. Context Window and Attention Distribution
During coding, AI is considering multiple elements simultaneously: architecture design, logic implementation, syntax accuracy, edge case handling, performance optimization, naming convention consistency, and more. In this multitasking state, it’s easy to overlook small mistakes.
On the other hand, during review, it can concentrate on analyzing existing code, allowing more careful observation of details.
2. Difference Between Generation Phase and Evaluation Phase
This is the most important point. The AI’s internal processing differs fundamentally when generating code versus evaluating it. Details follow below.
3. Prompt Explicitness
“Write code” is a relatively open-ended request, but “Review this code” has a clear purpose of finding errors and improvements. This purpose clarity focuses AI’s attention on problem discovery.
4. Difficulty Difference Between Verification and Generation Tasks
During generation, “creating correct code” is a difficult task, but during evaluation, “finding error patterns” is a relatively easier task. This is also supported by research discussed later.
5. Possibility of Step-by-Step Reasoning
During review, it can oversee the entire code and reason step by step:
- What is this code trying to do?
- What is it actually doing?
- Are there gaps or problems?
During generation, it progresses sequentially, making this birds-eye perspective difficult to maintain.
Technical Explanation: Difference Between Generation and Evaluation Phases
From here, we’ll technically explain the differences between generation and evaluation based on Transformer architecture operational principles.
Generation Phase Mechanism
Language models operate autoregressively (Vaswani et al., 2017). This is a fundamental characteristic of Transformer architecture.
1. Sequential Prediction
One token at a time, it predicts the “most probable next word/symbol.”
1
2
3
def calculate_sum( → 'numbers' is most likely next
→ ')' comes next
→ ':' comes next...
Already generated tokens become context for the next prediction.
2. Path Dependency (Technical Interpretation)
Once a token is generated, that token becomes input for the next prediction. This creates “path dependency” where initial choices influence subsequent generation.
Example: If you start writing a variable name as total_sum, there’s a tendency to keep using that name in subsequent code. Even if partway through you “judge” that total_count would be more appropriate, you try to maintain consistency with already generated content.
3. Local Optimization (Technical Interpretation)
At each step, it makes “the choice that seems best at this moment,” but that’s not necessarily optimal overall. It’s prone to “seeing the trees but not the forest.”
4. Chain of Commitment (Example)
1
2
3
4
5
6
# Starting to write code with wrong assumption...
def process_data(items):
result = [] # Committed to using list
for item in items:
result.append(item * 2) # Continues with list assumption
return result[0] # Logic error here
The initial choice of result = [] dictates subsequent code structure. Even though processing all elements but returning only the first is contradictory, it gets pulled along by the already-generated flow.
Evaluation Phase Mechanism
On the other hand, a different approach becomes possible during review.
1. Complete Context
The entire code already exists, and you can make the whole thing the subject of attention at once. You can survey the relationship between start and end, input and output simultaneously.
2. Bidirectional Reasoning (Technical Interpretation)
1
2
3
4
5
Thought process during review:
"What does this function do?" → Look at function name and docstring
"Does implementation match intent?"→ Trace the logic
"What about edge cases?" → Check conditional branches
"Are variable names consistent?" → Scan the whole thing
Reasoning from back to front is possible, not just front to back.
3. Constraint Verification
There’s a clear goal of “this code should do X,” so you can evaluate against that. “Is this correct?” during evaluation is an easier judgment than “What should I write?” during generation.
4. Pattern Matching
You can concentrate on finding known bug patterns:
- off-by-one error
- null pointer exception
- type mismatch
- division by zero
Comparison with Concrete Examples
Failure Example in Generation Phase
Prompt: “Write a function that calculates the average of a list”
AI’s Processing (Simplified):
1
2
3
4
5
6
7
→ Need def
→ Function name... calculate_average
→ Arguments... numbers
→ Calculate sum... sum(numbers)
→ Divide by length... / len(numbers)
→ Return... return
→ Done!
In this process, “what if numbers is an empty list” might be forgotten. Because it’s concentrating on generating “the correct next token” at each step.
Discovery in Evaluation Phase
Prompt: “Review this code”
AI’s Processing:
1
2
3
4
5
→ Read entire function
→ Check input type: list
→ Think about edge cases: empty list?
→ If len(numbers) is 0, ZeroDivisionError!
→ Problem found
During evaluation, looking at the entire code and then searching for “what’s wrong” makes discovery easier.
From the Transformer Architecture Perspective
From a technical perspective, regarding language model internal operations (Vaswani et al., 2017):
- During generation: Only causal attention from left to right is used. Future tokens aren’t visible (restricted by causal masking)
- During evaluation: Attention can be applied to the entire completed text. Relationships between all tokens can be evaluated simultaneously
Causal attention is a mechanism used in autoregressive models like GPT, masking so each token can only access tokens before it. This is essential for preventing “seeing” future information during generation, but simultaneously limits the ability to grasp overall context.
Key Insights from Latest Research
What we’ve covered so far has been technical operational principle explanation, but understanding the latest research results is also important.
The Truth About “Recognition is Easier Than Avoidance” Hypothesis
Kamoi et al. (2024)’s research “When Can LLMs Actually Correct Their Own Mistakes?” has important discoveries about self-correction.
This hypothesis (Saunders et al., 2022) has long been foundational to self-correction research, but the latest research reveals:
This hypothesis is only conditionally true.
Main Research Findings
Pure self-correction is limited
Without external tools or feedback, LLMs correcting their own mistakes with simple prompts alone has not succeeded for general tasks.
Only effective for tasks where verification is easier
Self-correction works only for specific tasks where “the verification task is significantly easier than the original task.”
External feedback is important
When external feedback exists—code execution results, unit test results, compiler errors, etc.—self-correction works effectively.
Fine-tuning effects
Large-scale fine-tuning can improve self-correction ability.
Reality of Self-Correction in Code Generation
Combining Pan et al. (2024) and Chen et al. (2024) research, in the code generation context:
Successful Cases:
- Unit test execution results exist
- Compiler error messages exist
- Linter feedback exists
- Runtime error traces exist
Limited Cases:
- Requesting review with simple prompts only (the phenomenon explained in this article)
- Obvious syntax errors or logic mistakes
Failure Cases:
- Complex logic errors
- Performance problems
- Security vulnerabilities
- Self-discovery without external feedback
Implications for Practice
From these research results, the following understanding is important for practice:
- Simple review requests have limited effect - The phenomenon explained in this article exists, but it’s not omnipotent
- Importance of external verification - Objective feedback from test execution, compilation, static analysis is essential
- Value of iteration - Quality improves by cycling through generate → verify → fix
Practical Usage Methods
Here are workflows usable in practice, based on research insights and technical understanding.
Basic Workflow
1
2
3
4
5
6
7
8
9
1. Generate code
↓
2. Run automated tests/static analysis
↓
3. Feed back results to AI
↓
4. Request AI review + fix
↓
5. Repeat as needed
Implementation Methods by Level
Level 1: Basic Review Request (Limited Effect)
1
2
3
Prompt example:
"Review this code and point out problems.
Especially from edge case, error handling, and performance perspectives."
Problems with expected effectiveness:
- Obvious syntax errors
- Simple logic mistakes
- Basic edge case oversights
Problems without expected effectiveness:
- Complex logic errors
- Performance problems
- Security vulnerabilities
Level 2: External Feedback Utilization (Recommended)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 1. Code generation
code = ai.generate("Function to calculate list average")
# 2. Run tests
test_result = run_tests(code)
# 3. Feed back results
improved_code = ai.improve(
code=code,
feedback=f"""
Test results:
{test_result}
If there are errors, please fix them.
"""
)
Problems with expected effectiveness:
- All bugs detectable by tests
- Compile errors
- Runtime errors
Level 3: CI/CD Pipeline Integration (Full Production)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# GitHub Actions example
- name: Generate Code
run: ai-generate --prompt "$"
- name: Run Tests
run: pytest tests/
continue-on-error: true
- name: Static Analysis
run: |
pylint src/
mypy src/
- name: AI Review and Fix
if: failure()
run: |
ai-review \
--code src/ \
--test-results test-results.xml \
--lint-results pylint-results.txt \
--auto-fix
- name: Re-run Tests
run: pytest tests/
Effective Prompt Design
Effective prompt composition based on research results:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[Context]
The following code is a function to execute ○○.
[Code]
{generated code}
[External Feedback] (Important!)
Test results: {test_output}
Static analysis results: {lint_output}
Execution results: {execution_output}
[Request]
1. Please analyze the above feedback
2. If there are problems, identify the cause
3. Provide fixed code
4. Briefly explain the fixes
Anti-patterns
Usages to avoid:
❌ Simple review request only
1
2
"Is there any problem with this code?"
→ Only limited effect
❌ No external verification
1
2
Generate → Review → Done
→ Complex problems are missed
❌ Infinite loop
1
2
Generate → Review → Fix → Review → Fix → ...
→ If not improving, reconsider requirements
✅ Recommended pattern
1
2
Generate → Run tests → Feed back results → Review + Fix → Re-test
→ Include external verification
Tool Usage Examples
Combinations of tools actually usable:
Development Environment:
- GitHub Copilot / Cursor / Claude Code: Generation
- pytest / jest: Testing
- pylint / ESLint: Static analysis
- AI API: Review + Fix
Automation:
- GitHub Actions / GitLab CI: CI/CD
- pre-commit hooks: Pre-commit checks
- Renovate: Dependency updates
Summary
The phenomenon where AI makes mistakes during coding but can catch them during review is a characteristic derived from language model operational principles.
Technical Understanding
- Generation Phase: Autoregressive, sequential, local optimization
- Evaluation Phase: Holistic view, bidirectional reasoning, pattern matching
- Transformer: Constraints from causal attention
Insights from Research
- Pure self-correction capability has limits (Kamoi et al., 2024)
- External feedback is effective (Pan et al., 2024; Chen et al., 2024)
- Only effective for tasks where verification is easy
Practical Usage
Key points for effective workflow:
- Separate generation and evaluation - Execute with different prompts
- Make external verification mandatory - Tests, static analysis, execution confirmation
- Feedback loop - Improve based on results
- Understand limitations - Humans review complex problems
This is similar to the difference when humans write versus proofread text. While writing, you go with the flow, but when reading back later, you notice obvious typos. However, complex logical errors may be missed. Similarly, AI can more reliably find problems only with external verification means.
Most Important Point for Practice: Don’t rely solely on simple review requests—by combining objective feedback from test execution and linting, you can maximize AI’s self-correction ability.
Related Articles
References
Academic Papers
When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs - Kamoi, R., Zhang, Y., Zhang, N., Han, J., & Zhang, R. (2024). Transactions of the Association for Computational Linguistics, 12, 1417–1440. [Reliability: High]
Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies - Pan, L., Saxon, M., Xu, W., Nathani, D., Wang, X., & Wang, W. Y. (2024). Transactions of the Association for Computational Linguistics, 12, 484–506. [Reliability: High]
An Empirical Study on Self-correcting Large Language Models for Data Science Code Generation - Thai, T. T. Q., Duc, H. M., Tho, Q. T., & Nguyen-Duc, A. (2024). arXiv preprint arXiv:2408.15658. [Reliability: High]
Attention is all you need - Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Advances in Neural Information Processing Systems, 30. [Reliability: High]
Self-critiquing models for assisting human evaluators - Saunders, W., Yeh, C., Wu, J., Bills, S., Ouyang, L., Ward, J., & Leike, J. (2022). arXiv preprint arXiv:2206.05802. [Reliability: High]
Self-Refine: Iterative Refinement with Self-Feedback - Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Welleck, S., Majumder, B. P., Gupta, S., Yazdanbakhsh, A., & Clark, P. (2023). Advances in Neural Information Processing Systems, 36. [Reliability: High]
The content of this article is based on research as of October 2025. As AI technology and research are advancing rapidly, we recommend also referring to the latest papers.