Post
JA EN

Is "AI Does It, So Checking Can Be Light" True? The More You Delegate to AI, the Heavier Checking Gets

Is "AI Does It, So Checking Can Be Light" True? The More You Delegate to AI, the Heavier Checking Gets
  • Who this is for: Engineers unsure how far they should check AI output, and the tech leads, EMs, and QA leads designing review and QA workflows around an AI-first reality
  • Assumed knowledge: Some hands-on exposure to generative AI for code generation and review assistance
  • Reading time: about 18 min

Overview

“AI does the work, so a quick once-over is enough.” In the AI era, this attitude is spreading everywhere. And it isn’t entirely wrong. If you scrutinized every line AI produces in full detail, you would erase the speed gain that made AI worth using in the first place. Keeping checking light is, by itself, a reasonable demand.

But here is the conclusion. The more you can delegate to AI, the heavier—and harder—checking actually gets. Because what AI made cheap is only one step, “implementation” (writing the code), while the work of judging whether it is fit to ship stays entirely on the human side. And the substance of that checking changes. This article frames it as four questions: what to check (object) / how deep (depth) / how (the questions you ask) / who (responsibility). All four require an axis—deep expertise.

The most overlooked of the four is “what.” AI doesn’t only produce code. It also produces requirements, specs, and plans. Everyone worries about verifying code, but the verification of the “what to build” that AI hands you slips by unexamined—and getting that wrong does the most damage.

This article is one of three. Whether you can succeed broad-but-shallow without ever building an axis is covered in the main article, Why the Axis-less Generalist Hits a Ceiling. The general “AI means you don’t need an axis” argument is examined in the sibling article, Where “AI Means You Don’t Need an Axis” Falls Apart.

Checking can’t be skipped—cut corners and you crash, check everything and nothing moves

First, notice that checking is squeezed between two traps.

Cut corners and you crash. This is not a matter of weak willpower. The fact that scrutiny slackens in the presence of automation is rooted in the very nature of human attention. Automation research calls this “automation bias” and “complacency.” According to an integrative review, the tendency appears even in experts and cannot be fully prevented by training or warnings1. In a classic flight-simulator experiment, having an automated aid that was “highly but not perfectly reliable” actually produced worse monitoring performance than having no aid at all2. In clinical decision support too, swayed by incorrect advice, 5.2% of decisions that had been correct flipped to incorrect3. In software, subjects using an AI assistant wrote more vulnerable code and were more likely to mistakenly believe they had written it securely4. As Bainbridge put it 40 years ago in “Ironies of Automation,” the more a person is relegated to the monitoring role, the weaker their ability to catch rare anomalies becomes5.

But you can’t check everything deeply, either. In an RCT with experienced developers, using AI made them 19% slower to finish. The effort of re-verifying AI output against their own standards ate up the speed of generation6. In developer surveys, the biggest frustration is fixing output that is “almost right but subtly off”7; in another survey, four in ten said reviewing AI code is more work than reviewing human code, and fewer than half always verify it8. “Look carefully at everything” is an ideal; rules that can’t be kept get skipped in practice, and you’re right back to cutting corners.

The two traps face each other. Cut corners and you miss things. Try to see everything and nothing moves. So checking is not a question of “do it or don’t”; it becomes a problem of design: what, how deep, how, and who. Let’s take each in turn.

① What to check—not just code, but the “requirements and specs” AI produces

The first question is the most overlooked. The object of checking is not just code.

AI now handles the steps before writing code—organizing requirements, drafting specs, proposing an implementation plan. It comes back with “let’s build this feature,” “let’s go with this design.” The problem is that accepting it without verification means building the wrong thing correctly—being correctly wrong.

Errors in these upstream steps are dangerous for two reasons. First, upstream errors get amplified downstream. A bug in code is usually a malfunction in a single feature, but an error in requirements or specs sends the whole product running in the wrong direction. This is old wisdom in software engineering. Second, upstream work is overwhelmingly harder to verify. Code can be partially verified by “does it run” and “do the tests pass.” But there is no standard for the “correctness” of requirements or specs in any test or CI pipeline—it lives only inside the person reviewing (the axis). Throw a vague instruction at AI and it fills the gaps with its own assumptions, returning a plausible but off-target result9. Only someone who understands the domain and the product can ask “is this requirement really what users want?” or “is this edge case covered by the spec?”

This is where the “axis at both ends” picture from the sibling article comes in. Verifying AI-generated code requires a technical axis; verifying AI-generated requirements, specs, and plans requires a product/domain axis. Once AI produces both ends, both ends need checking—and each needs a different axis. Everyone worries about reviewing code. But it is the review of the “what to build” that AI hands you that is most overlooked, and missing it goes most wrong.

② How deep to check—allocate depth by risk

Even once the object is set, you can’t examine everything at the same depth (as noted above, that doesn’t scale). So allocate depth according to risk. This is not a novel idea; it is standard in the regulatory world. The EU’s AI regulation sorts systems into four risk tiers, assessing higher-risk ones more strictly10. Medical-device software has likewise shifted from uniform validation to a risk-based approach11.

But when you try to apply this on the ground in software, you hit an awkward fact. “Depth of checking” and “magnitude of risk” do not map cleanly onto each other. A change to authentication or payments—plainly high-risk—can need only shallow checking if it’s “one line guarded by the type system.” Conversely, even a mere documentation update can take down production if a sample snippet meant to be copy-pasted is wrong. The label and the danger actually carried often diverge.

As a starting point, here are signals to raise your depth of checking. I deliberately avoid a complete answer key—for reasons given below.

  • It can’t be undone: deletes, money transfers, sends, rewrites of production data
  • It touches money, authentication, authorization, or personal data
  • It feeds external input into processing without validation
  • An AI-suggested library or API you don’t recognize (is a nonexistent dependency slipped in?)
  • It changes the behavior of a broad existing surface (shared modules, configuration)

Conversely, a change that is guarded by types and tests, has no side effects, is local, and is instantly reversible can be waved through. Of the hundred lines AI produced, these are the few you should read deeply—and the very ability to narrow it down like that is what makes speed and safety coexist.

But a strong caveat. This is not a checklist; it’s a sample of the thinking. Even in the same situation, where you should look deeply shifts with the product’s scale, its risk tolerance, and how thick the existing tests are. For instance, if you “had AI write a user-registration feature,” what you read deeply is password storage, input validation, behavior on error, and “who can read this data”—while you wave through variable names and log wording. If you’re “recommended an unfamiliar library,” you check whether it actually exists, whether it’s maintained, and its license. But whether the change in front of you “trusts external input” is something you can’t even notice unless you can read the code. Signals are a starting point, not an endpoint.

③ How to check—pose questions at the critical points

So what, concretely, is checking? It’s posing questions at the critical points. “Will this authorization check be bypassed if someone impersonates another role?” “Will this routine avoid an N+1 as the row count grows?”—throwing such questions back at the AI output, one by one, is the substance of checking.

And a good question can only be posed because you know where the mines are buried. A classic large-scale study of code review showed that the core of review lies less in “finding defects” itself than in understanding the change12. Review without understanding degrades into rubber-stamping. A person without an axis looks at the same code and no question arises—they pass it with “it works, so it’s fine.” A person with an axis naturally feels a question rise where things are shaky.

That the presence or absence of an axis splits outcomes has been observed in other contexts too. In a field experiment giving AI to entrepreneurs, the strong performers improved their results while the struggling ones actually got worse. The difference lay not in “the advice they received” but in judging what to delegate to AI and where to keep their own judgment13. Translate this into checking: a person with an axis correctly draws the line between “wave this through” and “look at this deeply,” while a person without an axis can’t draw it, misjudging the most dangerous spot as “good enough” and waving it by. “AI does it, so checking can be light” is sensible labor-saving when a person with an axis says it, and a shortcut to disaster when a person without one says it. The same words mean opposite things depending on whether the axis is there. So the depth of checking is, in the end, the sharpness and number of questions you can pose.

Is review itself unnecessary?—the answer is the exact opposite depending on the object

Here, let me address a more extreme claim: “code review and technical verification aren’t needed in the first place.” You get the answer wrong if you don’t distinguish what is being reviewed.

  • Heavyweight external approval of human-written code (approval gates outside the development flow, like change advisory boards) has no confirmed effect on reducing change failures and instead slows delivery. If you’re pairing—writing with two people—it has already passed a second pair of eyes. Within this scope, “you can skip it” is valid14.
  • But verification of AI-generated code, requirements, and specs cannot be skipped. It in fact balloons as AI produces more. “AI reviews it, so humans aren’t needed” doesn’t hold either—AI code review catches only about one-tenth of the quality problems humans find, remaining a complement, not a replacement15. In regulated domains like payments, a separation of duties—you can’t approve your own change—is required as a matter of policy16.

The tricky part is that many who preach “review is unnecessary” use the correctness of the former (heavy approval is wasteful) as grounds to also skip the latter (verifying AI output) along with it. What you can skip because humans write less is the former; what increases because AI writes more is the latter. What should be skipped and what must not be skipped have gotten swapped.

④ Who decides, and who bears responsibility

The last question moves from the individual to the organization. If you introduce “checking can be light” while leaving who decides “how deep to check,” and who is responsible ambiguous, a different problem arises.

A study examining 41 policies that mandated “human oversight” of government algorithms reached an ironic conclusion. Humans could not carry out the oversight expected of them, and as a result the oversight requirement could instead serve as a tool to legitimize the adoption of flawed systems and to blur where responsibility lies17. A further concept is the “moral crumple zone”: when an automated system errs, moral and legal responsibility is pushed onto the human at the end of the chain, who in reality had almost no control18.

Where this structure is most exposed is the phrasing you often hear in the AI era: “verification isn’t needed, but you’re still responsible.” It sounds plausible, but it doesn’t hold. To be responsible means being able to catch problems before they ship, and being able to explain afterward “why I judged this acceptable.” Both presuppose the means to verify, and an axis with which to verify. To be saddled with responsibility while verification is taken away is not responsibility—it is a scapegoat offered up when an accident happens. So if you’re going to say “be responsible,” you must hand over, as a set, the means to verify and the axis to wield them.

If you’re going to say “checking can be light” in an organization, there are things to decide as a set. Who draws the bar for depth? Who bears responsibility for that call? Leave this ambiguous while only the slogan spreads, and what’s left on the ground is “checking for show” plus “passing the buck when it matters.”

Summary

  • “AI does it, so checking can be light” is backwards. The more you can delegate to AI, the heavier and harder checking gets. What AI made cheap is only implementation; the judgment of checking stays with humans.
  • Checking is squeezed between two traps. Cut corners and you crash (automation bias)14; check everything and nothing moves (verification cost)6. So design “what, how deep, how, and who.”
  • ① What: the object isn’t just code. The requirements, specs, and plans AI produces are the most overlooked and most fatal. Verifying upstream needs a product/domain axis; downstream needs a technical axis (axis at both ends)9.
  • ② How deep: allocate depth by risk. But depth ≠ magnitude of risk1011. Signals are a starting point; applying them takes judgment.
  • ③ How: checking is posing questions at the critical points. A good question only arises from an axis that knows where the mines are1213.
  • “Review is unnecessary” is the opposite depending on the object: heavyweight external approval can be skipped14, but verifying AI-generated code and requirements cannot (AI review is only a complement15; regulation requires separation of duties16). Watch for the swap.
  • ④ Who: taking away verification while imposing responsibility is scapegoating (the moral crumple zone)1718. Decide the bar for depth and where responsibility lies as a set.
  • In the end, the only person who can correctly allocate all four of checking—what, how deep, how, who—is the one who holds an axis.

This article is one of three.

You may also be interested in these related articles:

References

References are listed in order, corresponding to the citation numbers in the text.

  1. Complacency and Bias in Human Use of Automation: An Attentional Integration - Parasuraman & Manzey, Human Factors 52(3) (2010). DOI: 10.1177/0018720810376055. An empirical integrative review showing that automation complacency/bias appears even in experts, cannot be fully prevented by training, and is pronounced under multitasking load. 【Reliability: High】 ↩︎ ↩︎2

  2. Does automation bias decision-making? - Skitka, Mosier, Burdick, International Journal of Human-Computer Studies 51(5) (1999). DOI: 10.1006/ijhc.1999.0252. Monitoring performance degrades in the presence of a “highly but not perfectly reliable” automated aid; demonstrates both omission and commission errors. 【Reliability: High】 ↩︎

  3. Automation bias: a systematic review of frequency, effect mediators, and mitigators - Goddard, Roudsari, Wyatt, JAMIA 19(1) (2012). DOI: 10.1136/amiajnl-2011-000089. A review of clinical decision support; swayed by incorrect advice, 5.2% of prescribing decisions that had been correct flipped to incorrect. 【Reliability: High】 ↩︎

  4. Do Users Write More Insecure Code with AI Assistants? - Perry, Srivastava, Kumar, Boneh, ACM CCS (2023). The AI-assistant group wrote significantly more vulnerable code and was more likely to mistakenly believe it was secure. 【Reliability: High】 ↩︎ ↩︎2

  5. Ironies of Automation - Lisanne Bainbridge, Automatica 19(6), 775–779 (1983). DOI: 10.1016/0005-1098(83)90046-8. The classic: the more automation advances, the more important and yet more difficult human monitoring, exception handling, and judgment become. 【Reliability: High】 ↩︎

  6. Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity - METR (2025). An RCT with 16 experienced OSS developers and 246 tasks; completion was 19% slower when using AI. Note its small scale and specific context for generalization. 【Reliability: Medium–High】 ↩︎ ↩︎2

  7. Stack Overflow Developer Survey 2025: AI - Stack Overflow (2025). About 49,000 respondents; 66% spend time fixing AI output that is “almost right but subtly off.” 【Reliability: Medium–High】 ↩︎

  8. Sonar data reveals critical verification gap in AI coding - Sonar State of Code Survey (2026). About four in ten say reviewing AI code is more work than reviewing human code, and fewer than half always verify it. Note this is a vendor survey. 【Reliability: Medium】 ↩︎

  9. Spec-driven development with AI: Get started with a new open-source toolkit - GitHub Blog (2025). Vague instructions get filled by AI’s own assumptions and the result collapses; frames the spec as the device that names decisions up front. Illustrates the importance of verifying upstream work (requirements/specs). 【Reliability: Medium (vendor development blog)】 ↩︎ ↩︎2

  10. EU AI Act: High-level summary - European Union. A risk-based approach that sorts AI systems into four risk tiers (unacceptable/high/limited/minimal) and graduates obligations according to risk. 【Reliability: High (official primary source)】 ↩︎ ↩︎2

  11. Computer Software Assurance for Production and Quality System Software - U.S. FDA, Federal Register (2022 draft / finalized September 2025). A shift from uniform validation to a risk-based approach that varies the rigor of assurance with the level of process risk. 【Reliability: High (official guidance)】 ↩︎ ↩︎2

  12. Expectations, Outcomes, and Challenges of Modern Code Review - Bacchelli & Bird, ICSE (2013). An empirical study at Microsoft; the core of review is “understanding the change” more than finding defects per se, and it doesn’t work without that understanding. 【Reliability: High】 ↩︎ ↩︎2

  13. The Uneven Impact of Generative AI on Entrepreneurial Performance - Otis, Clarke, Delecourt, Holtz, Koning, Harvard Business School Working Paper 24-042. High performers improved, low performers got worse; the difference lay in “judging what to delegate to AI and what to decide oneself.” 【Reliability: Medium–High】 ↩︎ ↩︎2

  14. Streamlining Change Approval - Google DORA. External heavyweight change approval (CABs and the like) is not associated with lower change-failure rates and instead has a negative effect on delivery performance. Recommends peer review plus automation, and treats a second review as unnecessary when work is already paired. 【Reliability: Medium–High】 ↩︎ ↩︎2

  15. Studying Quality Improvements Recommended via Manual and Automated Code Review - Crupi, Tufano, Bavota, ICPC (2026). An analysis of 240 PRs and 739 comments; AI (GPT-4) caught only about one-tenth of the quality problems humans find, remaining a complement to human review rather than a replacement. 【Reliability: Medium–High】 ↩︎ ↩︎2

  16. PCI DSS Requirement 6 - PCI DSS 4.0.1 (commentary). Requires secure code review of custom code that touches cardholder data, and separation of duties (a developer cannot approve their own change). An example where skipping verification is not permitted in a regulated domain. 【Reliability: Medium–High】 ↩︎ ↩︎2

  17. The Flaws of Policies Requiring Human Oversight of Government Algorithms - Ben Green, Computer Law & Security Review 45 (2022). An analysis of 41 policies requiring human oversight; oversight does not function and can become a tool to legitimize flawed systems and allow evasion of responsibility. 【Reliability: Medium–High】 ↩︎ ↩︎2

  18. Moral Crumple Zones: Cautionary Tales in Human-Robot Interaction - Madeleine Clare Elish, Engaging Science, Technology, and Society 5 (2019). The structure (moral crumple zone) in which, when an automated system malfunctions, responsibility is pushed onto the human at the end of the chain who had almost no control. A conceptual proposal. 【Reliability: Medium】 ↩︎ ↩︎2

This post is licensed under CC BY 4.0 by the author.