The Cost Structure of Evidence-Based Writing — The 100 Hours AI Saves and the 200 Hours It Can't

Posted Feb 3, 2026

20 min read

AI-Generated Content

This article was generated by AI. The accuracy of the content is not guaranteed, and we accept no responsibility for any damages resulting from use of this article. By continuing to read, you agree to the Terms of Use.

Target audience: Engineers interested in evidence-based information sharing
Prerequisites: Basic experience with AI tools (ChatGPT, Claude, etc.)
Reading time: 17 minutes

Overview

“With AI, you can write evidence-based articles without being able to read academic papers” — this expectation is half right and half wrong.

AI dramatically accelerates the search, comprehension, summarization, and citation generation of academic papers. It can compress work that took humans over 100 hours into just a few hours. However, verifying whether AI-generated citations are accurate, whether a paper is appropriate to cite in a given context, and whether effect size interpretations are valid — these still require research literacy on the human side.

This article breaks down the actual cost structure of writing evidence-based articles, clarifying what AI can and cannot accelerate. It then presents a realistic path forward: by abandoning the perfectionism of “it’s useless unless 100% accurate,” evidence becomes far more accessible.

The Actual Cost of a Single Article

Track Record with AI

This blog continuously publishes evidence-based articles using AI. Here are the results from the most recent six articles.

Article	Lines	Citations	Mermaid diagrams	Estimated time
VS Code Native Tabs	259	7	0	20–40 min
Japanese AI Dev Organizations	371	25	3	30–60 min
Science of Education & Learning	616	17	0	30–60 min
Claude Code Skill Delegation	299	8	2	20–40 min
METR Study Limitations	287	9	2	20–40 min
Review Transformation	346	12	4	25–50 min

6 articles total: estimated 2–5 hours (including both JP/EN versions)

Estimates for a Human Writing at Equivalent Quality

What if the same articles were written solo by a human with research literacy?

Let’s break down the time for each stage.

Research Phase

Using academic papers as evidence requires more than just reading abstracts.

According to Tenopir & King’s longitudinal study spanning from 1977, researchers spend an average of 31–48 minutes reading a single paper¹. However, these are figures for domain experts.

Research by Nelms & Segura-Totten (2019) demonstrates a decisive gap between expert and novice paper comprehension². Experts possess complex schemas (knowledge structures) that, as explained by cognitive load theory, reduce the burden on working memory. Novices take several times longer to read the same paper, with shallower understanding.

Including the “search” phase for papers to cite, the actual time to finalize a single citation is:

Stage	Expert	Non-expert
Paper search (skim 5–10 candidates)	30–60 min	1–3 hours
Close reading & comprehension	30–60 min	2–4 hours
Citation context judgment	10–20 min	30–60 min
Total (per citation)	1–2.5 hours	3.5–8 hours

Article #2 (Japanese AI Dev Organizations)

25 citations, cross-disciplinary (economic policy, LLM architecture, organizational theory, cultural psychology):

Stage	Estimate
Paper/report search (scanning 50+ candidates)	8–12 hours
Close reading & comprehension (25 × 1–2 hours)	15–25 hours
Structure & writing (371 lines, 25 sections)	6–10 hours
3 mermaid diagram designs	1–2 hours
English version creation	3–4 hours
Total	33–53 hours

6 Articles Total

Article	Human estimate
VS Code Tabs (primarily technical docs)	5–8 hours
Japanese AI Dev Orgs (cross-disciplinary, 25 citations)	33–53 hours
Science of Education (psychology meta-analysis, 616 lines)	35–57 hours
Skill Delegation (primarily technical docs)	7–12 hours
METR Study Critique (research methodology required)	15–25 hours
Review Transformation (CS papers + industry reports)	20–32 hours
Total	115–187 hours

Cost Comparison

flowchart TB
    subgraph AI["AI Pipeline"]
        direction TB
        A1["6 articles × JP/EN"]
        A2["Total: 2–5 hours"]
        A1 --> A2
    end

    subgraph Human["Human Solo"]
        direction TB
        H1["6 articles × JP/EN"]
        H2["Total: 115–187 hours<br>(15–24 business days)"]
        H1 --> H2
    end

    AI --> Ratio["~30–50x difference"]
    Human --> Ratio

Roughly estimated, the AI pipeline produces evidence-based articles at 30–50x the speed of a solo human. (Note that the AI side represents estimated actual work time, while the human side is an effort estimate — neither are precise measurements.)

This figure comes with an important caveat.

What AI Is Actually Shortening

In an experiment published in Science by Noy & Zhang (2023), 453 professionals performed writing tasks with ChatGPT, resulting in a 40% reduction in task time and 18% improvement in quality³. The effect was particularly large for lower-skilled participants, compressing productivity gaps.

What AI dramatically shortens is the following working time:

Paper search: Comprehensively listing relevant papers from keywords
Summarization & extraction: Presenting key points in structured form
Translation: Converting English papers to Japanese, expanding articles to JP/EN
Structuring: Logically organizing multiple sources
Draft generation: Composing coherent text

These are fundamentally information processing tasks — AI’s strong suit.

Not Just “Writing” but “Scan Range” Is Compressed

Often overlooked, AI isn’t just shortening writing time. The biggest impact is compressing the information scan range.

When a human writes an evidence-based article from scratch, the typical process looks like:

Search by keywords, listing 50+ candidate papers
Skim 20–30 of those at the abstract level
Close-read the most relevant 10–15
End up citing only 5–10

In this process, the time spent reading 40+ papers that weren’t cited doesn’t directly show in the final output. Of course, the act of judging “not relevant” has value, but the bulk of effort goes into this elimination work.

When verifying AI-generated output, the work is fundamentally different:

Confirm that the 10 citations AI chose actually exist via DOI/databases
Verify that cited content matches the original paper’s claims
Judge whether the citation is appropriate in context

The required skill level is the same, but the scan range shrinks to less than 1/5. This is the structural reality of “AI can’t write papers perfectly, but verifying its output is overwhelmingly faster.”

As a practical technique, you can also have AI maintain a list of “papers consulted but not cited during research.” This lets you check why AI didn’t select specific papers and catch potential oversights.

What AI Can’t Skip

The Verification Wall

The problem is that humans need the ability to verify the citations and claims AI generates.

In a study published in the Journal of Medical Internet Research, Chelli et al. (2024) measured hallucination rates of LLM-generated academic citations⁴:

Model	Hallucination rate
GPT-3.5	39.6%
GPT-4	28.6%
Bard	91.4%

Additionally, Buchanan et al. (2024) found in their economics domain verification that even GPT-4 produced over 20% fabricated citations⁵. The rate of fabricated citations increased significantly when prompts shifted from general topics to specific questions.

In other words, when AI writes “according to this paper…”, more than 1 in 5 may reference a paper that doesn’t exist.

However, note that these are figures for single-shot generation.

Hallucination Rates Can Be Reduced — But Not to Zero

The studies above all measured results from generating citations with an LLM just once. In practice, building a multi-stage pipeline where AI reviews its own output can dramatically lower the effective hallucination rate.

For example:

AI generation — Output article and citations (hallucination rate: 20–40% at this stage)
AI verification — Use a separate prompt (or different model) to check “do these citations exist?” and “are claims consistent with citations?”
Automated verification — Mechanically confirm paper existence via DOI search or Google Scholar API
Human meta-review — Final human judgment

This blog uses a three-stage process: generation → review → fact-check. The single-shot hallucination rate doesn’t carry through unchanged to the final output.

The key point is that even with multiple stages, it won’t reach zero. There’s always the risk that AI generates a fabricated paper and another AI validates it as “correct.” This is precisely why human verification at the final stage — especially DOI searches and database existence checks — remains essential.

The capabilities needed for this final verification:

Confirming paper existence (DOI search, database cross-referencing)
Judging whether cited content matches the original paper’s claims
Evaluating whether citing the paper in a given context is appropriate
Verifying that effect sizes and statistical indicators are correctly interpreted

The Statistical Literacy Wall

Statistics comprehension is particularly critical.

A study by Lytsy et al. (2022) published in the Upsala Journal of Medical Sciences is striking⁶. When doctoral students and statisticians/epidemiologists were asked about p-value interpretation:

Only 10.7% of doctoral students answered correctly
Even among statisticians and epidemiologists, only 12.5%

Even those who specialize in statistics can’t correctly interpret p-values.

A survey by Haller & Krauss (2002), cited in Gigerenzer (2004), showed similar results⁷. Among 44 psychology students, zero answered all questions correctly. Among 39 faculty not teaching statistics, only 4 did, and among 30 faculty teaching statistics (professors, lecturers, and TAs), only 6 could correctly answer all p-value questions.

When AI cites “significant at p < 0.05” from a paper, judging whether that claim is contextually valid requires understanding the correct interpretation of p-values, the relationship to effect sizes, and the influence of sample size. This judgment is difficult to delegate to AI.

The Cost of Acquiring Research Literacy

Three Levels

Research literacy is a skill that builds incrementally — there are no shortcuts.

flowchart TB
    L0["Level 0: AI + Common Sense<br>No investment needed"]
    L1["Level 1: Can Find Papers<br>20–40 hours"]
    L2["Level 2: Can Read Papers<br>+100–200 hours"]
    L3["Level 3: Can Cite Correctly<br>+100–200 hours"]
    L0 --> L1
    L1 --> L2
    L2 --> L3
    L3 --> Total["Total: 220–440 hours"]

Level 0 is the starting point. Ask AI questions and judge direction by the consistency of multiple responses. No learning investment is needed, but you can’t detect fabricated citations or statistical misuse. Most people are here without realizing it.

Level 1: Can “Find” Papers (20–40 hours)

Using academic databases: Google Scholar, PubMed, IEEE Xplore, etc.
Judging quality by citation count and journal impact factor
Distinguishing preprints (arXiv, etc.) from peer-reviewed papers
Choosing appropriate search keywords

Engineers already have strong search skills, so learning what to search for enables relatively quick progress.

Level 2: Can “Read” Papers (cumulative 120–240 hours)

This is the biggest hurdle. ACRL’s “Framework for Information Literacy for Higher Education” identifies six threshold concepts required for information literacy acquisition⁸.

Key learning items:

Paper structure (IMRaD format) and efficient reading techniques
Statistical literacy: p-values, effect sizes (Cohen’s d, r, odds ratios), confidence intervals, the difference between statistical significance and practical significance — this alone requires one introductory textbook (40–60 hours)
Reading meta-analyses: heterogeneity (I²), publication bias (funnel plots), forest plot interpretation
Evaluating research design: evidence level differences among RCTs, quasi-experiments, observational studies, and qualitative research

For reference, US graduate programs require 12–18 credits for a research methods certificate⁹. The “Level 2” envisioned here corresponds to a basic subset of such graduate coursework — completing the entire program is not necessary.

Level 3: Can “Cite Correctly” (cumulative 220–440 hours)

Contextual judgment: Evaluating whether research findings can be applied to a different context
Stating limitations: Distinguishing “has been proven” from “results have been reported”
Avoiding secondary citations: The habit of checking original sources rather than relying on secondary sources
Searching for counterevidence: The intellectual honesty to seek and acknowledge contradicting research, not just papers supporting your position

This level is about “judgment” rather than “knowledge,” and can only be developed through practice.

The Structural Problem of Self-Study

In graduate school, feedback comes from advisors and peer review. When self-studying, there’s no one to point out misunderstandings — a structural problem.

As the Lytsy et al. (2022) results show, even doctoral students can’t correctly interpret p-values. Without formal education, misunderstandings risk persisting for even longer.

The Trap of “Useless Unless 100% Accurate”

After reading this far, you might feel that “evidence is useless without a 220–440 hour learning investment.” But this is the trap of perfectionism.

Before questioning AI’s accuracy, how accurate are “human-written articles” — the comparison target?

Is “Written by Humans” Actually Accurate?

In a large-scale survey of 14 US newspapers and 4,800 articles, Maier (2005) found that 48% of newspaper articles contained factual errors¹⁰. Articles containing any kind of error reached 61%. The factual error rate has not meaningfully improved from the first survey in 1936 (approximately 50%) over 70 years. These are figures after publication — having passed through editors’ checks and proofreading.

In scientific papers as well, an analysis by Fang et al. (2012) published in PNAS found that 67.4% of retracted papers were due to misconduct (fabrication 43.4%, duplication 14.2%, plagiarism 9.8%)¹¹. Retractions due to honest error accounted for only 21.3%.

The critical point here is fairness of comparison. The 20–40% cited as AI hallucination rates are single-shot generation figures — without review or correction. However, in AI-assisted article creation, a pipeline of generation → AI review → fact-check is the de facto standard. Human articles also go through editorial processes before publication. For a fair comparison, both should be compared at the quality of their published output.

Target	Process	Post-publication error rate
Human newspaper articles	Reporter → Editor → Proofreader	48–61% (factual errors)
Human academic papers	Author → Peer review → Publication	67% of retractions due to misconduct
AI articles (single-shot)	Generation only	20–40% (citation hallucinations)
AI articles (multi-stage pipeline)	Generation → AI review → Verification	No quantitative data (significantly reduced from single-shot)

What’s also frequently overlooked is bias through deference (sontaku). Human articles can contain selective omission of information driven by sponsor considerations, organizational dynamics, or political positions. This doesn’t appear in error rate surveys because it isn’t “factual error,” but it’s equally or more harmful in distorting readers’ judgment.

AI has no incentive for this kind of intentional omission. Of course, if a human instructs “write an article favorable to this product,” AI will write biased content too — but that’s human bias manifesting through AI, not a problem inherent to AI. The errors AI spontaneously generates are random hallucinations, fundamentally different in nature from systematic bias. Random errors are easier to detect; systematic omissions are hard to catch.

“AI makes mistakes” is a valid criticism. But the human comparison target makes mistakes at equal or greater rates — and in forms that are harder to detect.

Required Accuracy Varies by Use Case

That said, required accuracy differs fundamentally by use case.

flowchart TB
    subgraph High["High Accuracy Required"]
        direction TB
        H1["Academic paper writing<br>Conducting meta-analyses"]
        H2["Required literacy: Level 3<br>220–440 hours"]
        H1 --> H2
    end

    subgraph Mid["Moderate Accuracy"]
        direction TB
        M1["Evidence-based<br>blog articles"]
        M2["Required literacy: Level 1–2<br>20–240 hours"]
        M1 --> M2
    end

    subgraph Low["Directional Correctness Sufficient"]
        direction TB
        L1["Career decisions<br>Technology selection"]
        L2["Required literacy: Level 0–1<br>AI + common sense"]
        L1 --> L2
    end

Writing academic papers requires correctly understanding effect size interpretation and meta-analysis methodology. Level 3 literacy is essential for this.

However, for decisions like “Should we adopt AI coding tools?” or “How should I think about my career direction?”, what matters is whether the direction is correct.

“Noy & Zhang (2023) reported a 40% reduction in task time and 18% quality improvement with AI writing tools” — whether the effect size is precise to the decimal point doesn’t matter here. Knowing the direction that “AI writing tools produce significant productivity gains” is sufficient as decision-making input.

Evidence “Usage” Also Has Gradients

Use case	Required accuracy	Required investment
Meta-analysis / academic papers	Precise effect size interpretation, research design evaluation	220–440 hours
Evidence-based blog articles	Claim-citation consistency, basic statistical understanding	20–240 hours
Technology selection / decision reference	Directional validity, no major contradictions	AI + common sense
Personal learning / career decisions	Directional understanding, multi-source consistency	AI + common sense

What matters is being aware of which accuracy level you’re operating at. Treating AI-generated research results as precise meta-analysis is dangerous. But judging that “multiple studies point in the same direction, so the direction is probably valid” is rational.

Perfectionism Blocks Evidence Use

“If you can’t read papers accurately, you shouldn’t engage with evidence” — this mindset ultimately justifies decisions made without evidence.

In reality:

Referencing evidence even just for direction is better than deciding without evidence
Even without 100% accuracy, simply knowing that “multiple studies show productivity gains” is superior to “deciding based solely on surrounding impressions”
If you’re aware of the precision limitations, even imperfect evidence has substantial value

Note that this blog itself is a practical example of this philosophy. Articles go through an AI generation → AI review → fact-check pipeline, but since the creator’s learning is the primary purpose, academic-paper-level complete verification is not performed. The approach prioritizes directional correctness and uses evidence within practical limits.

Where Literacy Still Matters

Having abandoned perfectionism, there are still situations where research literacy becomes important.

When You Can’t Take AI Citations at Face Value

As mentioned, during single-shot generation, LLMs fabricate papers with a 20–40% probability. In the following situations, verification ability is essential:

Published article citations — You have a responsibility to make them reader-verifiable
Organizational decision-making evidence — When directional errors carry large costs
Fields with conflicting research — Risk that AI presents only one perspective

Incremental Investment Is Realistic

A graduated approach that deepens as needed is most efficient. Following Levels 0–3 described above, here are concrete first steps.

Level	Action you can take today
0 (AI + common sense)	Ask the same question to multiple AIs and judge direction by response consistency
1 (Search and select)	Build a habit of checking paper existence and DOI on Google Scholar
2 (Reading and evaluation)	Start with Udemy’s free statistics literacy course¹² and practice by verifying AI output
3 (Academic level)	Invest when needed. No need to aim for this from the start

AI Itself Accelerates Learning

Paradoxically, dialogue with AI can also accelerate research literacy acquisition.

Have it explain paper structure
Ask for intuitive explanations of statistical concepts
Discuss “Is it appropriate to cite this paper in this context?”

However, final verification must use AI-independent means like database existence checks and DOI searches. Asking AI “Is this citation correct?” may just get you an AI confirming a fabricated paper as “correct.”

Summary

Organizing the cost structure of evidence-based writing:

What AI can shorten (working time):

Paper search & summarization: hours → minutes
Structuring & draft generation: days → tens of minutes
Multilingual expansion: hours → minutes
Total: 30–50x productivity improvement

What AI can’t skip (learning time):

Ability to find papers: 20–40 hours
Ability to read papers: 100–200 hours
Judgment to cite correctly: 100–200 hours
Total: 220–440 hours of learning investment

However, 220–440 hours aren’t always necessary.

To begin with, even human newspaper articles that went through editorial processes contain factual errors in 48% and errors of any kind in 61%. Comparing fairly against AI’s post-pipeline quality, the premise that “human-written means accurate” doesn’t hold.

If you’re using evidence as reference for career decisions or technology selection, using AI-presented evidence as “directional indicators” is sufficient. If writing blog articles, being able to filter out fabricated AI citations with Level 1 literacy (20–40 hours) alone makes a significant difference.

Perfectionism — the belief that it’s useless unless 100% accurate — is the greatest enemy of evidence. Referencing evidence directionally while being aware of precision limitations always leads to better decisions than deciding without evidence at all.

And the return on investment in research literacy has grown dramatically with AI’s emergence. A 20-hour investment enables citation verification; a 200-hour investment enables writing multiple evidence-based articles per month. You don’t have to aim for perfection. Starting within your current capabilities is the most important first step.

For more on this topic, see these related articles:

The True Value of AI: Multi-Dimensional Value Assessment Beyond Time Savings - A framework for evaluating AI value beyond time reduction
Blog Writing Guide for Engineers Who Struggle with Articulation - Practical methods for organizing thoughts through AI dialogue
The Truth Behind Experts’ “Seeming” AI Delegation - Meta-knowledge hidden in expert AI delegation
Effective Learning Methods Based on Scientific Evidence - The evidence foundation of learning science

References

References corresponding to citation numbers in the text are listed in numerical order.

Additional References (not cited by number in text)

Measuring Total Reading of Journal Articles - King, D.W. et al. (2006). D-Lib Magazine. Trends in researchers’ annual paper reading volume. [Reliability: High]
Scientists Reading Fewer Papers for First Time in 35 Years - Scientific American (2014). Changes in scientists’ reading patterns based on Tenopir & King’s research. [Reliability: Medium-High]
The impact of generative AI on academic reading and writing: a synthesis of recent evidence (2023–2025) - Frontiers in Education (2025). Synthesis review of AI’s impact on academic reading and writing. [Reliability: High]
Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations - Greenland, S. et al. (2016). European Journal of Epidemiology, 31(4), 337–350. Comprehensive guide on p-value misinterpretations. [Reliability: High]

On citation accuracy: The research cited in this article has been verified through the following methods:

Confirmation via academic databases (PubMed, Google Scholar, ScienceDirect, etc.)
Verification of paper information on official journal websites
Cross-verification through multiple independent sources (academic media, official institutional announcements, etc.)

For some papers, direct access to full-text PDFs may be restricted, but paper abstracts, DOIs, author information, and key findings have been confirmed through official academic databases and reliable secondary sources.

Electronic Journals and Changes in Scholarly Article Seeking and Reading Patterns - Tenopir, C. & King, D.W. (2008). D-Lib Magazine. Longitudinal study from 1977–2005 measuring researchers’ paper reading time trends. [Reliability: High] ↩︎
Expert–Novice Comparison Reveals Pedagogical Implications for Students’ Analysis of Primary Literature - Nelms, A.A. & Segura-Totten, M. (2019). CBE—Life Sciences Education, 18(4). Expert-novice paper comprehension comparison based on cognitive load theory. [Reliability: High] ↩︎
Experimental evidence on the productivity effects of generative artificial intelligence - Noy, S. & Zhang, W. (2023). Science, 381, 187–192. n=453, pre-registered RCT. [Reliability: High] ↩︎
Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis - Chelli, M. et al. (2024). Journal of Medical Internet Research, 26, e53164. Analysis of 471 citations. [Reliability: High] ↩︎
ChatGPT Hallucinates Non-existent Citations: Evidence from Economics - Buchanan, J., Hill, S. & Shapoval, O. (2024). The American Economist, 69(1), 80–87. Measurement of fabricated citation rates for GPT-3.5/4 in economics. [Reliability: High] ↩︎
Misinterpretations of P-values and statistical tests persists among researchers and professionals working with statistics and epidemiology - Lytsy, P., Hartman, M. & Pingel, R. (2022). Upsala Journal of Medical Sciences. n=139 (75 doctoral students + 64 statisticians). [Reliability: High] ↩︎
Mindless statistics - Gigerenzer, G. (2004). Journal of Socio-Economics, 33(5), 587–606. Survey on p-value misunderstandings and statistical education issues. [Reliability: High] ↩︎
Framework for Information Literacy for Higher Education - Association of College and Research Libraries (2015). Information literacy framework based on six threshold concepts. [Reliability: High] ↩︎
Advanced Research Methods Certificate - Texas A&M University. 12-credit research methods certificate program. 12–18 credits is standard across universities. [Reliability: Medium-High] ↩︎
Accuracy Matters: A Cross-Market Assessment of Newspaper Error and Credibility - Maier, S.R. (2005). Journalism & Mass Communication Quarterly, 82(3), 533–551. Accuracy survey of 14 US newspapers, 4,800 articles. [Reliability: High] ↩︎
Misconduct accounts for the majority of retracted scientific publications - Fang, F.C., Steen, R.G. & Casadevall, A. (2012). PNAS, 109(42), 17028–17033. Analysis of 2,047 retracted publications from PubMed. [Reliability: High] ↩︎
Statistics literacy for non-statisticians - Udemy. Free statistics literacy course for non-statisticians. As an introduction to basic concepts. [Reliability: Medium] ↩︎

AI・Technology

This post is licensed under CC BY 4.0 by the author.