Post
JA EN

Why AI Makes Mistakes During Coding But Can Catch Them During Review: Research and Technical Explanation

Why AI Makes Mistakes During Coding But Can Catch Them During Review: Research and Technical Explanation

Overview

When AI writes code, errors and inefficient implementations sometimes slip in. Interestingly, when you have the same AI review that code, it often discovers the problems and suggests fixes itself. This article explains the mechanisms behind this phenomenon based on the latest research findings (Kamoi et al., 2024; Pan et al., 2024) and the technical operational principles of Transformer architecture. Target readers are software engineers using AI in development work.

About This Article

This article is a technical explanation integrating insights from the latest research on LLM self-correction (2024-2025) with the technical operational principles of Transformer architecture. It combines research-established facts with technical interpretation to explain practical usage methods.

Target Readers: Software engineers using AI in development work

Article Structure:

  • Overview of the phenomenon and factors
  • Technical operational principle explanation
  • Key insights from latest research
  • Practical usage methods

Overview of the Phenomenon: Why Does AI’s Accuracy Differ Between “Writing” and “Reviewing”?

First, let’s organize the main factors behind this phenomenon.

1. Context Window and Attention Distribution

During coding, AI is considering multiple elements simultaneously: architecture design, logic implementation, syntax accuracy, edge case handling, performance optimization, naming convention consistency, and more. In this multitasking state, it’s easy to overlook small mistakes.

On the other hand, during review, it can concentrate on analyzing existing code, allowing more careful observation of details.

2. Difference Between Generation Phase and Evaluation Phase

This is the most important point. The AI’s internal processing differs fundamentally when generating code versus evaluating it. Details follow below.

3. Prompt Explicitness

“Write code” is a relatively open-ended request, but “Review this code” has a clear purpose of finding errors and improvements. This purpose clarity focuses AI’s attention on problem discovery.

4. Difficulty Difference Between Verification and Generation Tasks

During generation, “creating correct code” is a difficult task, but during evaluation, “finding error patterns” is a relatively easier task. This is also supported by research discussed later.

5. Possibility of Step-by-Step Reasoning

During review, it can oversee the entire code and reason step by step:

  1. What is this code trying to do?
  2. What is it actually doing?
  3. Are there gaps or problems?

During generation, it progresses sequentially, making this birds-eye perspective difficult to maintain.

Technical Explanation: Difference Between Generation and Evaluation Phases

From here, we’ll technically explain the differences between generation and evaluation based on Transformer architecture operational principles.

Generation Phase Mechanism

Language models operate autoregressively (Vaswani et al., 2017). This is a fundamental characteristic of Transformer architecture.

1. Sequential Prediction

One token at a time, it predicts the “most probable next word/symbol.”

1
2
3
def calculate_sum(    'numbers' is most likely next
                     ')' comes next
                     ':' comes next...

Already generated tokens become context for the next prediction.

2. Path Dependency (Technical Interpretation)

Once a token is generated, that token becomes input for the next prediction. This creates “path dependency” where initial choices influence subsequent generation.

Example: If you start writing a variable name as total_sum, there’s a tendency to keep using that name in subsequent code. Even if partway through you “judge” that total_count would be more appropriate, you try to maintain consistency with already generated content.

3. Local Optimization (Technical Interpretation)

At each step, it makes “the choice that seems best at this moment,” but that’s not necessarily optimal overall. It’s prone to “seeing the trees but not the forest.”

4. Chain of Commitment (Example)

1
2
3
4
5
6
# Starting to write code with wrong assumption...
def process_data(items):
    result = []  # Committed to using list
    for item in items:
        result.append(item * 2)  # Continues with list assumption
    return result[0]  # Logic error here

The initial choice of result = [] dictates subsequent code structure. Even though processing all elements but returning only the first is contradictory, it gets pulled along by the already-generated flow.

Evaluation Phase Mechanism

On the other hand, a different approach becomes possible during review.

1. Complete Context

The entire code already exists, and you can make the whole thing the subject of attention at once. You can survey the relationship between start and end, input and output simultaneously.

2. Bidirectional Reasoning (Technical Interpretation)

1
2
3
4
5
Thought process during review:
"What does this function do?"     →  Look at function name and docstring
"Does implementation match intent?"→  Trace the logic
"What about edge cases?"          →  Check conditional branches
"Are variable names consistent?"  →  Scan the whole thing

Reasoning from back to front is possible, not just front to back.

3. Constraint Verification

There’s a clear goal of “this code should do X,” so you can evaluate against that. “Is this correct?” during evaluation is an easier judgment than “What should I write?” during generation.

4. Pattern Matching

You can concentrate on finding known bug patterns:

  • off-by-one error
  • null pointer exception
  • type mismatch
  • division by zero

Comparison with Concrete Examples

Failure Example in Generation Phase

Prompt: “Write a function that calculates the average of a list”

AI’s Processing (Simplified):

1
2
3
4
5
6
7
→ Need def
→ Function name... calculate_average
→ Arguments... numbers
→ Calculate sum... sum(numbers)
→ Divide by length... / len(numbers)
→ Return... return
→ Done!

In this process, “what if numbers is an empty list” might be forgotten. Because it’s concentrating on generating “the correct next token” at each step.

Discovery in Evaluation Phase

Prompt: “Review this code”

AI’s Processing:

1
2
3
4
5
→ Read entire function
→ Check input type: list
→ Think about edge cases: empty list?
→ If len(numbers) is 0, ZeroDivisionError!
→ Problem found

During evaluation, looking at the entire code and then searching for “what’s wrong” makes discovery easier.

From the Transformer Architecture Perspective

From a technical perspective, regarding language model internal operations (Vaswani et al., 2017):

  • During generation: Only causal attention from left to right is used. Future tokens aren’t visible (restricted by causal masking)
  • During evaluation: Attention can be applied to the entire completed text. Relationships between all tokens can be evaluated simultaneously

Causal attention is a mechanism used in autoregressive models like GPT, masking so each token can only access tokens before it. This is essential for preventing “seeing” future information during generation, but simultaneously limits the ability to grasp overall context.

Key Insights from Latest Research

What we’ve covered so far has been technical operational principle explanation, but understanding the latest research results is also important.

The Truth About “Recognition is Easier Than Avoidance” Hypothesis

Kamoi et al. (2024)’s research “When Can LLMs Actually Correct Their Own Mistakes?” has important discoveries about self-correction.

This hypothesis (Saunders et al., 2022) has long been foundational to self-correction research, but the latest research reveals:

This hypothesis is only conditionally true.

Main Research Findings

  1. Pure self-correction is limited

    Without external tools or feedback, LLMs correcting their own mistakes with simple prompts alone has not succeeded for general tasks.

  2. Only effective for tasks where verification is easier

    Self-correction works only for specific tasks where “the verification task is significantly easier than the original task.”

  3. External feedback is important

    When external feedback exists—code execution results, unit test results, compiler errors, etc.—self-correction works effectively.

  4. Fine-tuning effects

    Large-scale fine-tuning can improve self-correction ability.

Reality of Self-Correction in Code Generation

Combining Pan et al. (2024) and Chen et al. (2024) research, in the code generation context:

Successful Cases:

  • Unit test execution results exist
  • Compiler error messages exist
  • Linter feedback exists
  • Runtime error traces exist

Limited Cases:

  • Requesting review with simple prompts only (the phenomenon explained in this article)
  • Obvious syntax errors or logic mistakes

Failure Cases:

  • Complex logic errors
  • Performance problems
  • Security vulnerabilities
  • Self-discovery without external feedback

Implications for Practice

From these research results, the following understanding is important for practice:

  1. Simple review requests have limited effect - The phenomenon explained in this article exists, but it’s not omnipotent
  2. Importance of external verification - Objective feedback from test execution, compilation, static analysis is essential
  3. Value of iteration - Quality improves by cycling through generate → verify → fix

Practical Usage Methods

Here are workflows usable in practice, based on research insights and technical understanding.

Basic Workflow

1
2
3
4
5
6
7
8
9
1. Generate code
   ↓
2. Run automated tests/static analysis
   ↓
3. Feed back results to AI
   ↓
4. Request AI review + fix
   ↓
5. Repeat as needed

Implementation Methods by Level

Level 1: Basic Review Request (Limited Effect)

1
2
3
Prompt example:
"Review this code and point out problems.
Especially from edge case, error handling, and performance perspectives."

Problems with expected effectiveness:

  • Obvious syntax errors
  • Simple logic mistakes
  • Basic edge case oversights

Problems without expected effectiveness:

  • Complex logic errors
  • Performance problems
  • Security vulnerabilities
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 1. Code generation
code = ai.generate("Function to calculate list average")

# 2. Run tests
test_result = run_tests(code)

# 3. Feed back results
improved_code = ai.improve(
    code=code,
    feedback=f"""
    Test results:
    {test_result}

    If there are errors, please fix them.
    """
)

Problems with expected effectiveness:

  • All bugs detectable by tests
  • Compile errors
  • Runtime errors

Level 3: CI/CD Pipeline Integration (Full Production)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# GitHub Actions example
- name: Generate Code
  run: ai-generate --prompt "$"

- name: Run Tests
  run: pytest tests/
  continue-on-error: true

- name: Static Analysis
  run: |
    pylint src/
    mypy src/

- name: AI Review and Fix
  if: failure()
  run: |
    ai-review \
      --code src/ \
      --test-results test-results.xml \
      --lint-results pylint-results.txt \
      --auto-fix

- name: Re-run Tests
  run: pytest tests/

Effective Prompt Design

Effective prompt composition based on research results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[Context]
The following code is a function to execute ○○.

[Code]
{generated code}

[External Feedback] (Important!)
Test results: {test_output}
Static analysis results: {lint_output}
Execution results: {execution_output}

[Request]
1. Please analyze the above feedback
2. If there are problems, identify the cause
3. Provide fixed code
4. Briefly explain the fixes

Anti-patterns

Usages to avoid:

Simple review request only

1
2
"Is there any problem with this code?"
→ Only limited effect

No external verification

1
2
Generate → Review → Done
→ Complex problems are missed

Infinite loop

1
2
Generate → Review → Fix → Review → Fix → ...
→ If not improving, reconsider requirements

Recommended pattern

1
2
Generate → Run tests → Feed back results → Review + Fix → Re-test
→ Include external verification

Tool Usage Examples

Combinations of tools actually usable:

Development Environment:

  • GitHub Copilot / Cursor / Claude Code: Generation
  • pytest / jest: Testing
  • pylint / ESLint: Static analysis
  • AI API: Review + Fix

Automation:

  • GitHub Actions / GitLab CI: CI/CD
  • pre-commit hooks: Pre-commit checks
  • Renovate: Dependency updates

Summary

The phenomenon where AI makes mistakes during coding but can catch them during review is a characteristic derived from language model operational principles.

Technical Understanding

  • Generation Phase: Autoregressive, sequential, local optimization
  • Evaluation Phase: Holistic view, bidirectional reasoning, pattern matching
  • Transformer: Constraints from causal attention

Insights from Research

  • Pure self-correction capability has limits (Kamoi et al., 2024)
  • External feedback is effective (Pan et al., 2024; Chen et al., 2024)
  • Only effective for tasks where verification is easy

Practical Usage

Key points for effective workflow:

  1. Separate generation and evaluation - Execute with different prompts
  2. Make external verification mandatory - Tests, static analysis, execution confirmation
  3. Feedback loop - Improve based on results
  4. Understand limitations - Humans review complex problems

This is similar to the difference when humans write versus proofread text. While writing, you go with the flow, but when reading back later, you notice obvious typos. However, complex logical errors may be missed. Similarly, AI can more reliably find problems only with external verification means.

Most Important Point for Practice: Don’t rely solely on simple review requests—by combining objective feedback from test execution and linting, you can maximize AI’s self-correction ability.

References

Academic Papers


The content of this article is based on research as of October 2025. As AI technology and research are advancing rapidly, we recommend also referring to the latest papers.

This post is licensed under CC BY 4.0 by the author.