Why AI Makes Mistakes During Coding But Can Catch Them During Review: Research and Technical Explanation

Posted Oct 23, 2025

12 min read

AI-Generated Content

This article was generated by AI. The accuracy of the content is not guaranteed, and we accept no responsibility for any damages resulting from use of this article. By continuing to read, you agree to the Terms of Use.

Overview

When AI writes code, errors and inefficient implementations sometimes slip in. Interestingly, when you have the same AI review that code, it often discovers the problems and suggests fixes itself. This article explains the mechanisms behind this phenomenon based on the latest research findings (Kamoi et al., 2024; Pan et al., 2024) and the technical operational principles of Transformer architecture. Target readers are software engineers using AI in development work.

About This Article

This article is a technical explanation integrating insights from the latest research on LLM self-correction (2024-2025) with the technical operational principles of Transformer architecture. It combines research-established facts with technical interpretation to explain practical usage methods.

Target Readers: Software engineers using AI in development work

Article Structure:

Overview of the phenomenon and factors
Technical operational principle explanation
Key insights from latest research
Practical usage methods

Overview of the Phenomenon: Why Does AI’s Accuracy Differ Between “Writing” and “Reviewing”?

First, let’s organize the main factors behind this phenomenon.

1. Context Window and Attention Distribution

During coding, AI is considering multiple elements simultaneously: architecture design, logic implementation, syntax accuracy, edge case handling, performance optimization, naming convention consistency, and more. In this multitasking state, it’s easy to overlook small mistakes.

On the other hand, during review, it can concentrate on analyzing existing code, allowing more careful observation of details.

2. Difference Between Generation Phase and Evaluation Phase

This is the most important point. The AI’s internal processing differs fundamentally when generating code versus evaluating it. Details follow below.

3. Prompt Explicitness

“Write code” is a relatively open-ended request, but “Review this code” has a clear purpose of finding errors and improvements. This purpose clarity focuses AI’s attention on problem discovery.

4. Difficulty Difference Between Verification and Generation Tasks

During generation, “creating correct code” is a difficult task, but during evaluation, “finding error patterns” is a relatively easier task. This is also supported by research discussed later.

5. Possibility of Step-by-Step Reasoning

During review, it can oversee the entire code and reason step by step:

What is this code trying to do?
What is it actually doing?
Are there gaps or problems?

During generation, it progresses sequentially, making this birds-eye perspective difficult to maintain.

Technical Explanation: Difference Between Generation and Evaluation Phases

From here, we’ll technically explain the differences between generation and evaluation based on Transformer architecture operational principles.

Generation Phase Mechanism

Language models operate autoregressively (Vaswani et al., 2017). This is a fundamental characteristic of Transformer architecture.

1. Sequential Prediction

One token at a time, it predicts the “most probable next word/symbol.”

  
def calculate_sum(  →  'numbers' is most likely next
                   →  ')' comes next
                   →  ':' comes next...

Already generated tokens become context for the next prediction.

2. Path Dependency (Technical Interpretation)

Once a token is generated, that token becomes input for the next prediction. This creates “path dependency” where initial choices influence subsequent generation.

Example: If you start writing a variable name as total_sum, there’s a tendency to keep using that name in subsequent code. Even if partway through you “judge” that total_count would be more appropriate, you try to maintain consistency with already generated content.

3. Local Optimization (Technical Interpretation)

At each step, it makes “the choice that seems best at this moment,” but that’s not necessarily optimal overall. It’s prone to “seeing the trees but not the forest.”

4. Chain of Commitment (Example)

  
# Starting to write code with wrong assumption...
def process_data(items):
    result = []  # Committed to using list
    for item in items:
        result.append(item * 2)  # Continues with list assumption
    return result[0]  # Logic error here

The initial choice of result = [] dictates subsequent code structure. Even though processing all elements but returning only the first is contradictory, it gets pulled along by the already-generated flow.

Evaluation Phase Mechanism

On the other hand, a different approach becomes possible during review.

1. Complete Context

The entire code already exists, and you can make the whole thing the subject of attention at once. You can survey the relationship between start and end, input and output simultaneously.

2. Bidirectional Reasoning (Technical Interpretation)

Thought process during review:
"What does this function do?"     →  Look at function name and docstring
"Does implementation match intent?"→  Trace the logic
"What about edge cases?"          →  Check conditional branches
"Are variable names consistent?"  →  Scan the whole thing

Reasoning from back to front is possible, not just front to back.

3. Constraint Verification

There’s a clear goal of “this code should do X,” so you can evaluate against that. “Is this correct?” during evaluation is an easier judgment than “What should I write?” during generation.

4. Pattern Matching

You can concentrate on finding known bug patterns:

off-by-one error
null pointer exception
type mismatch
division by zero

Comparison with Concrete Examples

Failure Example in Generation Phase

Prompt: “Write a function that calculates the average of a list”

AI’s Processing (Simplified):

→ Need def
→ Function name... calculate_average
→ Arguments... numbers
→ Calculate sum... sum(numbers)
→ Divide by length... / len(numbers)
→ Return... return
→ Done!

In this process, “what if numbers is an empty list” might be forgotten. Because it’s concentrating on generating “the correct next token” at each step.

Discovery in Evaluation Phase

Prompt: “Review this code”

AI’s Processing:

→ Read entire function
→ Check input type: list
→ Think about edge cases: empty list?
→ If len(numbers) is 0, ZeroDivisionError!
→ Problem found

During evaluation, looking at the entire code and then searching for “what’s wrong” makes discovery easier.

From the Transformer Architecture Perspective

From a technical perspective, regarding language model internal operations (Vaswani et al., 2017):

During generation: Only causal attention from left to right is used. Future tokens aren’t visible (restricted by causal masking)
During evaluation: Attention can be applied to the entire completed text. Relationships between all tokens can be evaluated simultaneously

Causal attention is a mechanism used in autoregressive models like GPT, masking so each token can only access tokens before it. This is essential for preventing “seeing” future information during generation, but simultaneously limits the ability to grasp overall context.

Key Insights from Latest Research

What we’ve covered so far has been technical operational principle explanation, but understanding the latest research results is also important.

The Truth About “Recognition is Easier Than Avoidance” Hypothesis

Kamoi et al. (2024)’s research “When Can LLMs Actually Correct Their Own Mistakes?” has important discoveries about self-correction.

This hypothesis (Saunders et al., 2022) has long been foundational to self-correction research, but the latest research reveals:

This hypothesis is only conditionally true.

Main Research Findings

Pure self-correction is limited
Without external tools or feedback, LLMs correcting their own mistakes with simple prompts alone has not succeeded for general tasks.
Only effective for tasks where verification is easier
Self-correction works only for specific tasks where “the verification task is significantly easier than the original task.”
External feedback is important
When external feedback exists—code execution results, unit test results, compiler errors, etc.—self-correction works effectively.
Fine-tuning effects
Large-scale fine-tuning can improve self-correction ability.

Reality of Self-Correction in Code Generation

Combining Pan et al. (2024) and Chen et al. (2024) research, in the code generation context:

Successful Cases:

Unit test execution results exist
Compiler error messages exist
Linter feedback exists
Runtime error traces exist

Limited Cases:

Requesting review with simple prompts only (the phenomenon explained in this article)
Obvious syntax errors or logic mistakes

Failure Cases:

Complex logic errors
Performance problems
Security vulnerabilities
Self-discovery without external feedback

Implications for Practice

From these research results, the following understanding is important for practice:

Simple review requests have limited effect - The phenomenon explained in this article exists, but it’s not omnipotent
Importance of external verification - Objective feedback from test execution, compilation, static analysis is essential
Value of iteration - Quality improves by cycling through generate → verify → fix

Practical Usage Methods

Here are workflows usable in practice, based on research insights and technical understanding.

Basic Workflow

1. Generate code
   ↓
2. Run automated tests/static analysis
   ↓
3. Feed back results to AI
   ↓
4. Request AI review + fix
   ↓
5. Repeat as needed

Implementation Methods by Level

Level 1: Basic Review Request (Limited Effect)

Prompt example:
"Review this code and point out problems.
Especially from edge case, error handling, and performance perspectives."

Problems with expected effectiveness:

Obvious syntax errors
Simple logic mistakes
Basic edge case oversights

Problems without expected effectiveness:

Complex logic errors
Performance problems
Security vulnerabilities

Level 2: External Feedback Utilization (Recommended)

  
# 1. Code generation
code = ai.generate("Function to calculate list average")

# 2. Run tests
test_result = run_tests(code)

# 3. Feed back results
improved_code = ai.improve(
    code=code,
    feedback=f"""
    Test results:
    {test_result}

    If there are errors, please fix them.
    """
)

Problems with expected effectiveness:

All bugs detectable by tests
Compile errors
Runtime errors

Level 3: CI/CD Pipeline Integration (Full Production)

  
# GitHub Actions example
- name: Generate Code
  run: ai-generate --prompt "$"

- name: Run Tests
  run: pytest tests/
  continue-on-error: true

- name: Static Analysis
  run: |
    pylint src/
    mypy src/

- name: AI Review and Fix
  if: failure()
  run: |
    ai-review \
      --code src/ \
      --test-results test-results.xml \
      --lint-results pylint-results.txt \
      --auto-fix

- name: Re-run Tests
  run: pytest tests/

Effective Prompt Design

Effective prompt composition based on research results:

[Context]
The following code is a function to execute ○○.

[Code]
{generated code}

[External Feedback] (Important!)
Test results: {test_output}
Static analysis results: {lint_output}
Execution results: {execution_output}

[Request]
1. Please analyze the above feedback
2. If there are problems, identify the cause
3. Provide fixed code
4. Briefly explain the fixes

Anti-patterns

Usages to avoid:

❌ Simple review request only

"Is there any problem with this code?"
→ Only limited effect

❌ No external verification

Generate → Review → Done
→ Complex problems are missed

❌ Infinite loop

Generate → Review → Fix → Review → Fix → ...
→ If not improving, reconsider requirements

✅ Recommended pattern

Generate → Run tests → Feed back results → Review + Fix → Re-test
→ Include external verification

Tool Usage Examples

Combinations of tools actually usable:

Development Environment:

GitHub Copilot / Cursor / Claude Code: Generation
pytest / jest: Testing
pylint / ESLint: Static analysis
AI API: Review + Fix

Automation:

GitHub Actions / GitLab CI: CI/CD
pre-commit hooks: Pre-commit checks
Renovate: Dependency updates

Summary

The phenomenon where AI makes mistakes during coding but can catch them during review is a characteristic derived from language model operational principles.

Technical Understanding

Generation Phase: Autoregressive, sequential, local optimization
Evaluation Phase: Holistic view, bidirectional reasoning, pattern matching
Transformer: Constraints from causal attention

Insights from Research

Pure self-correction capability has limits (Kamoi et al., 2024)
External feedback is effective (Pan et al., 2024; Chen et al., 2024)
Only effective for tasks where verification is easy

Practical Usage

Key points for effective workflow:

Separate generation and evaluation - Execute with different prompts
Make external verification mandatory - Tests, static analysis, execution confirmation
Feedback loop - Improve based on results
Understand limitations - Humans review complex problems

This is similar to the difference when humans write versus proofread text. While writing, you go with the flow, but when reading back later, you notice obvious typos. However, complex logical errors may be missed. Similarly, AI can more reliably find problems only with external verification means.

Most Important Point for Practice: Don’t rely solely on simple review requests—by combining objective feedback from test execution and linting, you can maximize AI’s self-correction ability.

Part 2: Code Review in the AI Coding Era: Organizational-Level Challenges and Countermeasures

References

Academic Papers

When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs - Kamoi, R., Zhang, Y., Zhang, N., Han, J., & Zhang, R. (2024). Transactions of the Association for Computational Linguistics, 12, 1417–1440. [Reliability: High]
Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies - Pan, L., Saxon, M., Xu, W., Nathani, D., Wang, X., & Wang, W. Y. (2024). Transactions of the Association for Computational Linguistics, 12, 484–506. [Reliability: High]
An Empirical Study on Self-correcting Large Language Models for Data Science Code Generation - Thai, T. T. Q., Duc, H. M., Tho, Q. T., & Nguyen-Duc, A. (2024). arXiv preprint arXiv:2408.15658. [Reliability: High]
Attention is all you need - Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Advances in Neural Information Processing Systems, 30. [Reliability: High]
Self-critiquing models for assisting human evaluators - Saunders, W., Yeh, C., Wu, J., Bills, S., Ouyang, L., Ward, J., & Leike, J. (2022). arXiv preprint arXiv:2206.05802. [Reliability: High]
Self-Refine: Iterative Refinement with Self-Feedback - Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Welleck, S., Majumder, B. P., Gupta, S., Yazdanbakhsh, A., & Clark, P. (2023). Advances in Neural Information Processing Systems, 36. [Reliability: High]

The content of this article is based on research as of October 2025. As AI technology and research are advancing rapidly, we recommend also referring to the latest papers.

Technical Guide

This post is licensed under CC BY 4.0 by the author.