Why Juniors Should Still Write Code by Hand in the AI Era — A Practical Guide to Avoid the 17-Point Understanding Gap
This article was generated by AI. The accuracy of the content is not guaranteed, and we accept no responsibility for any damages resulting from use of this article. By continuing to read, you agree to the Terms of Use.
- Target audience: Junior software engineers (0-5 years experience) prioritizing long-term skill formation
- Prerequisites: Basic experience with GitHub Copilot, Cursor, Claude Code, etc.
- Reading time: 13 minutes
Overview
In a world where AI can spit out working code in seconds, is writing code by hand just a waste of time? From a pure productivity standpoint, the usual answer is “juniors should use AI the most”1. The joint MIT/Microsoft RCT reported a +27-39% productivity gain for less-experienced developers1. Looking only at short-term output, hand-coding really does look inefficient.
But in January 2026, Anthropic published an RCT (n=52, mostly juniors) that painted the opposite picture2. When learning a new Python library, the AI-assisted group scored 50% on quiz comprehension versus 67% for the hand-coding group — a 17-point gap. The difference was largest on debugging, and the productivity gain was not statistically significant. Cohen’s d = 0.738, p = 0.01 — what psychology classifies as a “large effect” shows up here as a comprehension deficit in the AI group.
On top of that, Prather et al. at ICER 2024 observed 21 students and documented three novel metacognitive difficulties under AI assistance: an “illusion of competence” along with patterns they call Interruption, Mislead, and false sense of Progression3. Students could produce working code, but couldn’t explain why it worked.
This article, taking the position that long-term skill formation is the top priority, lays out a practical framework for keeping hand-coding at the center and treating AI as limited assistance. Read alongside its counterpart productivity-focused guide and the overall tradeoff breakdown to clarify your own priorities.
What the data says about “learning degradation under AI assistance”
The shock of the Anthropic 2026 RCT
Shen & Tamkin (Anthropic, published January 29, 2026) studied 52 software engineers implementing features in Python’s async library Trio, randomly assigning them to AI-assisted and hand-coding groups2. Participants were Python users with at least one year of weekly usage who were unfamiliar with Trio — the design mimics the real-world situation of “learning a new library on the job.”
The results were clear.
| Metric | AI group | Hand-coding group | Difference |
|---|---|---|---|
| Quiz score (comprehension) | 50% | 67% | −17 points |
| Productivity (completion time) | ~2 min faster | — | Not significant |
| Debugging ability | Largest gap | — | — |
An effect size of Cohen’s d = 0.738 falls under “large” in educational psychology. Anthropic themselves describe it as “nearly two grade bands of difference.” And the productivity gain wasn’t statistically significant — meaning this isn’t even the familiar “trade skills for speed” tradeoff. That is the harshest implication of the RCT.
“How you use it” is what splits the outcomes
The most actionable finding from the Anthropic study isn’t the averages — it’s the cluster analysis2. When they categorized AI usage patterns within the high-scoring group (65%+) and low-scoring group (under 40%), a sharp pattern emerged.
flowchart TB
A["AI usage pattern"] --> B["High scorers 65%+<br>Conceptual inquiry n=7"]
A --> C["Low scorers under 40%<br>Code delegation n=7"]
B --> D["Debug errors themselves"]
C --> E["Paste AI output as-is"]
classDef good stroke:#2ea44f,stroke-width:3px
classDef bad stroke:#cf222e,stroke-width:3px
class B,D good
class C,E bad
The top scorers never asked AI to write code — they used it only for conceptual questions. When errors came up, they debugged by themselves, treating the AI as “a textbook you can query in natural language.” The lowest scorers, by contrast, kept asking AI to “fix it” and pasted the returned code directly into their work.
In other words, it isn’t whether you use AI that matters, but which mode you use it in — and the learning outcome can flip entirely. The “vibe-coding” style of offloading whole problems is, statistically, the most dangerous option.
ICER 2024’s “widening gap”
Prather et al. (ICER 2024) followed 21 novice programmers with observation and eye-tracking3. What they saw was a bifurcation between students who accelerated and students who stalled. The stalled group exhibited three distinct metacognitive difficulties:
- Interruption: AI suggestions break the flow of thought, robbing students of the moment they would have asked themselves “what would I write here?”
- Mislead: AI confidently points in a subtly wrong direction, and students can’t catch it on their own
- False sense of progression: code grows and appears to work, but understanding isn’t keeping up
These difficulties compound into the “illusion of competence” — the feeling of having understood. Students look at the AI-assisted code and feel they get it, but can’t reproduce the same functionality from scratch.
This lines up with classic findings from learning science. Karpicke & Roediger (2008, Science) showed that while rereading and highlighting produce a strong feeling of “getting it,” actual retrieval practice (active recall) is far better for long-term retention4. AI code generation creates a cognitive process much closer to the worst-performing mode in learning science: passive rereading.
A Japanese replication at Nara College
Domestic research shows a similar pattern. Kawamura & Uchida (Nara College of Technology, 2025) compared ChatGPT-assisted and hand-coded work in students and reported that the AI group finished faster with lower variance, but showed no significant difference in comprehension test scores5. They also noted qualitatively that “AI may reduce opportunities for thinking and exploration.”
“Faster to finish, but shallower learning” — that’s the common structural finding across multiple data points.
Why the act of writing matters — a cognitive science view
Generation effect and desirable difficulties
Learning science has a concept from Bjork & Bjork called “desirable difficulties”6. Retrieval practice, spacing, interleaving, and — critically — generation (producing the answer yourself) all lower short-term performance but strengthen long-term memory and transfer.
Writing code by hand is generation. Choosing variable names, designing data structures, composing control flow — each is an act of producing the answer yourself, and each serves as cognitive strength training that deepens memory and understanding.
When AI takes over the generation, short-term fluency goes up but long-term retention goes down. Bjork’s classic prediction is exactly what the 17-point gap in the Anthropic RCT is showing.
Cognitive offloading erodes critical thinking
Gerlich (2025, Societies, n=666) measured the relationship between AI usage and cognitive offloading (delegating judgment to an external system)7. The numbers are sobering.
- AI usage vs. cognitive offloading: r = +0.72 (strong positive correlation)
- Cognitive offloading vs. critical thinking: r = −0.75 (strong negative correlation)
- Younger users show higher AI dependence and lower critical thinking scores
“Forming the habit of using AI while young crowds out the chance to build critical thinking.” This is correlational, not causal, but as a signal it’s strong enough to justify juniors being especially careful.
Why “debugging muscle” fails to develop
The largest gap in the Anthropic RCT was in debugging2, and there’s a structural reason.
Writing code is a process of “externalizing the model in your head.” Debugging is the reverse: “inferring internal state from the code’s behavior.” The latter is precisely the training where you reconstruct the code inside your head — and when you hand it off to AI, the muscle for reading code atrophies.
Over a career of 5 or 10 years, what really pays off is the ability to read and reason through other people’s massive codebases. That ability doesn’t come from writing practice — it comes from reading, tracing, and inferring. And AI quietly takes those opportunities away.
Practice — the four-step “hand-coding base + limited AI assist”
Step 1: Think on your own for 5 minutes before opening AI
Before opening AI, spend just 5 minutes sketching your own solution in pseudocode or notes. It doesn’t need to be complete — just a skeleton of “roughly this flow” is enough.
flowchart TB
A["Receive task"] --> B["5 min solo thinking<br>Pseudocode / sketch"]
B --> C["Your solution v1"]
C --> D["Get AI's alternative"]
D --> E["Diff against your own"]
E --> F["Write it yourself or<br>incorporate AI's good parts"]
classDef think stroke:#2ea44f,stroke-width:3px
class B,C,E think
Those 5 minutes are when you form a hypothesis of what you would write. Once the AI’s output comes back, learning happens in the diff between your hypothesis and its answer. Without a hypothesis, there is no diff — just a vague “huh, I see” that leaves nothing behind.
Step 2: Open AI in “conceptual inquiry” mode only
This is exactly what the top-scoring cluster in the Anthropic RCT did2. Don’t ask it to write code; use it as something you ask questions to.
[Bad: code generation mode]
"Write a React component that implements this feature."
[Good: conceptual inquiry mode]
"What's the criterion for splitting state between useState and useReducer?"
"Give me three reasons this useMemo is needed."
"List three scenarios where this error message can appear."Take the answers and write the code yourself. Treat AI as “a senior engineer you can query in text.” That alone prevents most of the “illusion of competence.”
Step 3: Debug on your own for at least 30 minutes
Given that debugging had the largest gap in the Anthropic RCT2 and that Prather’s Mislead phenomenon specifically affects debugging3, debugging is the place you should rely on AI least.
The rule is simple:
- First 30 minutes: figure out the cause without AI. Use print, log, a debugger, and an MCVE (minimal reproducible example).
- Still stuck after 30 minutes: ask AI for hints only. “Where should I look?” and “What does this error mean?” are OK. “Fix it” is not.
- Once you have a hint: go back to fixing it yourself.
It’s slower at first, but within three months your debugging speed will noticeably pick up. This is the single biggest gap that will open up between you and a peer who leans on AI for five years.
Step 4: Trace AI-generated code line by line
Taking AI-written code as-is is the most dangerous habit. Before accepting it, always trace each line with your finger and explain out loud (or in writing) why that line is there.
- “Why is this
ifhere?” → Because there’s a case that needs a null check. - “What is this
useMemopreventing?” → An expensive computation when the parent re-renders. - “Why is this exception handled at this layer?” → …
If there’s a line you can’t explain out loud or in writing, that’s a line you don’t understand. Drop it or investigate until you do.
Yes, this is slow. But remember — in the Anthropic RCT, the AI group’s productivity gain was not statistically significant2. The intuition that “AI makes you faster” isn’t actually backed by the data. Tracing carefully probably doesn’t cost you much in final completion time anyway.
Replies to common objections
“Won’t I fall behind while everyone else uses AI?”
Stack Overflow 2025 reports that 84% overall use or plan to use AI8 — but trust in AI has fallen to 60% (down sharply from over 70% in 2023-24). “Everyone is using it” is true. “Everyone trusts it fully” is not.
In fact, juniors who can code without AI AND wield AI well in conceptual-inquiry mode are becoming scarcer and more valuable than ever. Interviews and take-home assignments increasingly test your fundamentals with AI disabled — this approach is aligned with where hiring is actually heading.
“Won’t I be rated low if my output is slower?”
On short-term metrics (PR count, commit count, tickets closed), a peer leaning on AI may temporarily rate higher. But as discussed in The “almost right” code trap, that cost comes back in 3-5 years as review load, bug rates, and maintenance cost.
Long-term metrics — design ability, debugging, legacy code maintenance, mentoring — can only be measured in the muscle you built by hand. Spending your first few years on fundamentals is, in financial terms, “an investment that compounds.” The feedback loop discussed in The AI deskilling paradox hits hardest during the junior phase.
“Won’t I fail to even qualify for jobs without AI skills?”
In the JetBrains 2025 survey, 68% of developers expect “AI proficiency to become a job requirement”9. That prediction is likely correct. But the definition of “AI proficiency” is still unsettled.
“Using AI well for conceptual questions,” “reviewing AI output critically,” “fixing it when AI gets it wrong” — all of these are also AI proficiency, and hand-coding practice is actually what sharpens them. Hand-coding isn’t anti-AI. It’s the foundation for using AI correctly.
Closing — an investment in your 5-year-out self
Read calmly, the data is consistent. The Anthropic RCT’s 17-point gap, ICER 2024’s “illusion of competence,” Gerlich’s negative correlation with critical thinking, the Nara College replication — learning degradation under AI assistance has been confirmed repeatedly across independent studies. Read alongside the opposing view in AI as a “skill equalizer”, and the boundary of “when and where it helps” starts to come into focus.
The productivity-focused guide’s claim that “AI done well is fast” and this article’s claim that “AI overused makes understanding shallow” aren’t in conflict. Both are true. The split depends on what you’re optimizing for.
A junior’s first few years are a window for building assets that compound. Sacrifice a little short-term productivity, and use hand-coding to build debugging, reading, and design muscle — and five years in, you’ll be the real engineer who can be trusted to judge AI-generated code. Writing code yourself is not a waste of time. In the AI era, it’s the smartest long-term investment you can make.
The companion productivity-focused guide lays out the approach if you’re optimizing for short-term output instead. Combined with the overall tradeoff breakdown, pick what fits your stage.
Footnotes
Demirer, M., Cui, Z., Musolff, L., Jaffe, S., Peng, S., & Salz, T. (2024). “The Effects of Generative AI on High Skilled Work: Evidence from Three Field Experiments with Software Developers.” SSRN Working Paper ID 4945566. Article version: MIT Sloan, November 4, 2024. +27-39% for less-experienced developers, +8-13% for experienced ones. ↩︎ ↩︎2
Shen, J. H., & Tamkin, A. (2026). “How AI assistance impacts the formation of coding skills.” Anthropic, published January 29, 2026. 52 participants (mostly junior), learning-RCT on a new Python library (Trio). Quiz scores: AI group 50% vs. hand-coding group 67% (Cohen’s d=0.738, p=0.01). Largest gap on debugging. Cluster analysis: top scorers are “conceptual inquiry only” type, bottom scorers are “code delegation” type. https://www.anthropic.com/research/AI-assistance-coding-skills, preprint arXiv:2601.20245 ↩︎ ↩︎2 ↩︎3 ↩︎4 ↩︎5 ↩︎6 ↩︎7
Prather, J., et al. (2024). “The Widening Gap: The Benefits and Harms of Generative AI for Novice Programmers.” ICER ‘24. Observation + eye-tracking study of 21 students. Reported “illusion of competence” and multiple metacognitive difficulties (interruption, mislead, false sense of progression). https://arxiv.org/abs/2405.17739 ↩︎ ↩︎2 ↩︎3
Karpicke, J. D., & Roediger, H. L. (2008). “The Critical Importance of Retrieval for Learning.” Science, 319(5865), 966–968. DOI: 10.1126/science.1152408. Classic study showing retrieval practice beats rereading for long-term retention. ↩︎
Kawamura, T., & Uchida, S. (2025). “Effects of programming with generative AI on learning outcomes.” Nara College of Technology. AI group finished faster with lower variance, but no significant difference on comprehension tests. https://www.jsise.org/wp-content/uploads/2025/02/2024_kansai_p09.pdf ↩︎
Bjork, E. L., & Bjork, R. A. (2011). “Making Things Hard on Yourself, But in a Good Way: Creating Desirable Difficulties to Enhance Learning.” UCLA Bjork Learning and Forgetting Lab. https://bjorklab.psych.ucla.edu/wp-content/uploads/sites/13/2016/04/EBjork_RBjork_2011.pdf ↩︎
Gerlich, M. (2025). “AI Tools in Society: Impacts on Cognitive Offloading and the Future of Critical Thinking.” Societies, 15(1), 6. n=666. AI usage vs. cognitive offloading r=+0.72; offloading vs. critical thinking r=−0.75. Younger participants show higher AI dependence and lower critical thinking scores. https://www.mdpi.com/2075-4698/15/1/6. A correction concerning Table 4 was issued in September 2025 as Societies 15(9), 252; the author states the main scientific conclusions are unchanged. ↩︎
Stack Overflow. (2025). “2025 Developer Survey.” 84% overall use or plan to use AI; trust level is 60% (sharp drop from 2023-24). https://survey.stackoverflow.co/2025/ai ↩︎
JetBrains. (2025). “The State of Developer Ecosystem 2025.” n=24,534 across 194 countries. 68% expect “AI proficiency to become a job requirement.” https://devecosystem-2025.jetbrains.com/artificial-intelligence ↩︎