The Cost Structure of Evidence-Based Writing — The 100 Hours AI Saves and the 200 Hours It Can't
This article was generated by AI. The accuracy of the content is not guaranteed, and we accept no responsibility for any damages resulting from use of this article. By continuing to read, you agree to the Terms of Use.
- Target audience: Engineers interested in evidence-based information sharing
- Prerequisites: Basic experience with AI tools (ChatGPT, Claude, etc.)
- Reading time: 17 minutes
Overview
“With AI, you can write evidence-based articles without being able to read academic papers” — this expectation is half right and half wrong.
AI dramatically accelerates the search, comprehension, summarization, and citation generation of academic papers. It can compress work that took humans over 100 hours into just a few hours. However, verifying whether AI-generated citations are accurate, whether a paper is appropriate to cite in a given context, and whether effect size interpretations are valid — these still require research literacy on the human side.
This article breaks down the actual cost structure of writing evidence-based articles, clarifying what AI can and cannot accelerate. It then presents a realistic path forward: by abandoning the perfectionism of “it’s useless unless 100% accurate,” evidence becomes far more accessible.
The Actual Cost of a Single Article
Track Record with AI
This blog continuously publishes evidence-based articles using AI. Here are the results from the most recent six articles.
| Article | Lines | Citations | Mermaid diagrams | Estimated time |
|---|---|---|---|---|
| VS Code Native Tabs | 259 | 7 | 0 | 20–40 min |
| Japanese AI Dev Organizations | 371 | 25 | 3 | 30–60 min |
| Science of Education & Learning | 616 | 17 | 0 | 30–60 min |
| Claude Code Skill Delegation | 299 | 8 | 2 | 20–40 min |
| METR Study Limitations | 287 | 9 | 2 | 20–40 min |
| Review Transformation | 346 | 12 | 4 | 25–50 min |
6 articles total: estimated 2–5 hours (including both JP/EN versions)
Estimates for a Human Writing at Equivalent Quality
What if the same articles were written solo by a human with research literacy?
Let’s break down the time for each stage.
Research Phase
Using academic papers as evidence requires more than just reading abstracts.
According to Tenopir & King’s longitudinal study spanning from 1977, researchers spend an average of 31–48 minutes reading a single paper1. However, these are figures for domain experts.
Research by Nelms & Segura-Totten (2019) demonstrates a decisive gap between expert and novice paper comprehension2. Experts possess complex schemas (knowledge structures) that, as explained by cognitive load theory, reduce the burden on working memory. Novices take several times longer to read the same paper, with shallower understanding.
Including the “search” phase for papers to cite, the actual time to finalize a single citation is:
| Stage | Expert | Non-expert |
|---|---|---|
| Paper search (skim 5–10 candidates) | 30–60 min | 1–3 hours |
| Close reading & comprehension | 30–60 min | 2–4 hours |
| Citation context judgment | 10–20 min | 30–60 min |
| Total (per citation) | 1–2.5 hours | 3.5–8 hours |
Article #2 (Japanese AI Dev Organizations)
25 citations, cross-disciplinary (economic policy, LLM architecture, organizational theory, cultural psychology):
| Stage | Estimate |
|---|---|
| Paper/report search (scanning 50+ candidates) | 8–12 hours |
| Close reading & comprehension (25 × 1–2 hours) | 15–25 hours |
| Structure & writing (371 lines, 25 sections) | 6–10 hours |
| 3 mermaid diagram designs | 1–2 hours |
| English version creation | 3–4 hours |
| Total | 33–53 hours |
6 Articles Total
| Article | Human estimate |
|---|---|
| VS Code Tabs (primarily technical docs) | 5–8 hours |
| Japanese AI Dev Orgs (cross-disciplinary, 25 citations) | 33–53 hours |
| Science of Education (psychology meta-analysis, 616 lines) | 35–57 hours |
| Skill Delegation (primarily technical docs) | 7–12 hours |
| METR Study Critique (research methodology required) | 15–25 hours |
| Review Transformation (CS papers + industry reports) | 20–32 hours |
| Total | 115–187 hours |
Cost Comparison
flowchart TB
subgraph AI["AI Pipeline"]
direction TB
A1["6 articles × JP/EN"]
A2["Total: 2–5 hours"]
A1 --> A2
end
subgraph Human["Human Solo"]
direction TB
H1["6 articles × JP/EN"]
H2["Total: 115–187 hours<br>(15–24 business days)"]
H1 --> H2
end
AI --> Ratio["~30–50x difference"]
Human --> Ratio
Roughly estimated, the AI pipeline produces evidence-based articles at 30–50x the speed of a solo human. (Note that the AI side represents estimated actual work time, while the human side is an effort estimate — neither are precise measurements.)
This figure comes with an important caveat.
What AI Is Actually Shortening
In an experiment published in Science by Noy & Zhang (2023), 453 professionals performed writing tasks with ChatGPT, resulting in a 40% reduction in task time and 18% improvement in quality3. The effect was particularly large for lower-skilled participants, compressing productivity gaps.
What AI dramatically shortens is the following working time:
- Paper search: Comprehensively listing relevant papers from keywords
- Summarization & extraction: Presenting key points in structured form
- Translation: Converting English papers to Japanese, expanding articles to JP/EN
- Structuring: Logically organizing multiple sources
- Draft generation: Composing coherent text
These are fundamentally information processing tasks — AI’s strong suit.
Not Just “Writing” but “Scan Range” Is Compressed
Often overlooked, AI isn’t just shortening writing time. The biggest impact is compressing the information scan range.
When a human writes an evidence-based article from scratch, the typical process looks like:
- Search by keywords, listing 50+ candidate papers
- Skim 20–30 of those at the abstract level
- Close-read the most relevant 10–15
- End up citing only 5–10
In this process, the time spent reading 40+ papers that weren’t cited doesn’t directly show in the final output. Of course, the act of judging “not relevant” has value, but the bulk of effort goes into this elimination work.
When verifying AI-generated output, the work is fundamentally different:
- Confirm that the 10 citations AI chose actually exist via DOI/databases
- Verify that cited content matches the original paper’s claims
- Judge whether the citation is appropriate in context
The required skill level is the same, but the scan range shrinks to less than 1/5. This is the structural reality of “AI can’t write papers perfectly, but verifying its output is overwhelmingly faster.”
As a practical technique, you can also have AI maintain a list of “papers consulted but not cited during research.” This lets you check why AI didn’t select specific papers and catch potential oversights.
What AI Can’t Skip
The Verification Wall
The problem is that humans need the ability to verify the citations and claims AI generates.
In a study published in the Journal of Medical Internet Research, Chelli et al. (2024) measured hallucination rates of LLM-generated academic citations4:
| Model | Hallucination rate |
|---|---|
| GPT-3.5 | 39.6% |
| GPT-4 | 28.6% |
| Bard | 91.4% |
Additionally, Buchanan et al. (2024) found in their economics domain verification that even GPT-4 produced over 20% fabricated citations5. The rate of fabricated citations increased significantly when prompts shifted from general topics to specific questions.
In other words, when AI writes “according to this paper…”, more than 1 in 5 may reference a paper that doesn’t exist.
However, note that these are figures for single-shot generation.
Hallucination Rates Can Be Reduced — But Not to Zero
The studies above all measured results from generating citations with an LLM just once. In practice, building a multi-stage pipeline where AI reviews its own output can dramatically lower the effective hallucination rate.
For example:
- AI generation — Output article and citations (hallucination rate: 20–40% at this stage)
- AI verification — Use a separate prompt (or different model) to check “do these citations exist?” and “are claims consistent with citations?”
- Automated verification — Mechanically confirm paper existence via DOI search or Google Scholar API
- Human meta-review — Final human judgment
This blog uses a three-stage process: generation → review → fact-check. The single-shot hallucination rate doesn’t carry through unchanged to the final output.
The key point is that even with multiple stages, it won’t reach zero. There’s always the risk that AI generates a fabricated paper and another AI validates it as “correct.” This is precisely why human verification at the final stage — especially DOI searches and database existence checks — remains essential.
The capabilities needed for this final verification:
- Confirming paper existence (DOI search, database cross-referencing)
- Judging whether cited content matches the original paper’s claims
- Evaluating whether citing the paper in a given context is appropriate
- Verifying that effect sizes and statistical indicators are correctly interpreted
The Statistical Literacy Wall
Statistics comprehension is particularly critical.
A study by Lytsy et al. (2022) published in the Upsala Journal of Medical Sciences is striking6. When doctoral students and statisticians/epidemiologists were asked about p-value interpretation:
- Only 10.7% of doctoral students answered correctly
- Even among statisticians and epidemiologists, only 12.5%
Even those who specialize in statistics can’t correctly interpret p-values.
A survey by Haller & Krauss (2002), cited in Gigerenzer (2004), showed similar results7. Among 44 psychology students, zero answered all questions correctly. Among 39 faculty not teaching statistics, only 4 did, and among 30 faculty teaching statistics (professors, lecturers, and TAs), only 6 could correctly answer all p-value questions.
When AI cites “significant at p < 0.05” from a paper, judging whether that claim is contextually valid requires understanding the correct interpretation of p-values, the relationship to effect sizes, and the influence of sample size. This judgment is difficult to delegate to AI.
The Cost of Acquiring Research Literacy
Three Levels
Research literacy is a skill that builds incrementally — there are no shortcuts.
flowchart TB
L0["Level 0: AI + Common Sense<br>No investment needed"]
L1["Level 1: Can Find Papers<br>20–40 hours"]
L2["Level 2: Can Read Papers<br>+100–200 hours"]
L3["Level 3: Can Cite Correctly<br>+100–200 hours"]
L0 --> L1
L1 --> L2
L2 --> L3
L3 --> Total["Total: 220–440 hours"]
Level 0 is the starting point. Ask AI questions and judge direction by the consistency of multiple responses. No learning investment is needed, but you can’t detect fabricated citations or statistical misuse. Most people are here without realizing it.
Level 1: Can “Find” Papers (20–40 hours)
- Using academic databases: Google Scholar, PubMed, IEEE Xplore, etc.
- Judging quality by citation count and journal impact factor
- Distinguishing preprints (arXiv, etc.) from peer-reviewed papers
- Choosing appropriate search keywords
Engineers already have strong search skills, so learning what to search for enables relatively quick progress.
Level 2: Can “Read” Papers (cumulative 120–240 hours)
This is the biggest hurdle. ACRL’s “Framework for Information Literacy for Higher Education” identifies six threshold concepts required for information literacy acquisition8.
Key learning items:
- Paper structure (IMRaD format) and efficient reading techniques
- Statistical literacy: p-values, effect sizes (Cohen’s d, r, odds ratios), confidence intervals, the difference between statistical significance and practical significance — this alone requires one introductory textbook (40–60 hours)
- Reading meta-analyses: heterogeneity (I²), publication bias (funnel plots), forest plot interpretation
- Evaluating research design: evidence level differences among RCTs, quasi-experiments, observational studies, and qualitative research
For reference, US graduate programs require 12–18 credits for a research methods certificate9. The “Level 2” envisioned here corresponds to a basic subset of such graduate coursework — completing the entire program is not necessary.
Level 3: Can “Cite Correctly” (cumulative 220–440 hours)
- Contextual judgment: Evaluating whether research findings can be applied to a different context
- Stating limitations: Distinguishing “has been proven” from “results have been reported”
- Avoiding secondary citations: The habit of checking original sources rather than relying on secondary sources
- Searching for counterevidence: The intellectual honesty to seek and acknowledge contradicting research, not just papers supporting your position
This level is about “judgment” rather than “knowledge,” and can only be developed through practice.
The Structural Problem of Self-Study
In graduate school, feedback comes from advisors and peer review. When self-studying, there’s no one to point out misunderstandings — a structural problem.
As the Lytsy et al. (2022) results show, even doctoral students can’t correctly interpret p-values. Without formal education, misunderstandings risk persisting for even longer.
The Trap of “Useless Unless 100% Accurate”
After reading this far, you might feel that “evidence is useless without a 220–440 hour learning investment.” But this is the trap of perfectionism.
Before questioning AI’s accuracy, how accurate are “human-written articles” — the comparison target?
Is “Written by Humans” Actually Accurate?
In a large-scale survey of 14 US newspapers and 4,800 articles, Maier (2005) found that 48% of newspaper articles contained factual errors10. Articles containing any kind of error reached 61%. The factual error rate has not meaningfully improved from the first survey in 1936 (approximately 50%) over 70 years. These are figures after publication — having passed through editors’ checks and proofreading.
In scientific papers as well, an analysis by Fang et al. (2012) published in PNAS found that 67.4% of retracted papers were due to misconduct (fabrication 43.4%, duplication 14.2%, plagiarism 9.8%)11. Retractions due to honest error accounted for only 21.3%.
The critical point here is fairness of comparison. The 20–40% cited as AI hallucination rates are single-shot generation figures — without review or correction. However, in AI-assisted article creation, a pipeline of generation → AI review → fact-check is the de facto standard. Human articles also go through editorial processes before publication. For a fair comparison, both should be compared at the quality of their published output.
| Target | Process | Post-publication error rate |
|---|---|---|
| Human newspaper articles | Reporter → Editor → Proofreader | 48–61% (factual errors) |
| Human academic papers | Author → Peer review → Publication | 67% of retractions due to misconduct |
| AI articles (single-shot) | Generation only | 20–40% (citation hallucinations) |
| AI articles (multi-stage pipeline) | Generation → AI review → Verification | No quantitative data (significantly reduced from single-shot) |
What’s also frequently overlooked is bias through deference (sontaku). Human articles can contain selective omission of information driven by sponsor considerations, organizational dynamics, or political positions. This doesn’t appear in error rate surveys because it isn’t “factual error,” but it’s equally or more harmful in distorting readers’ judgment.
AI has no incentive for this kind of intentional omission. Of course, if a human instructs “write an article favorable to this product,” AI will write biased content too — but that’s human bias manifesting through AI, not a problem inherent to AI. The errors AI spontaneously generates are random hallucinations, fundamentally different in nature from systematic bias. Random errors are easier to detect; systematic omissions are hard to catch.
“AI makes mistakes” is a valid criticism. But the human comparison target makes mistakes at equal or greater rates — and in forms that are harder to detect.
Required Accuracy Varies by Use Case
That said, required accuracy differs fundamentally by use case.
flowchart TB
subgraph High["High Accuracy Required"]
direction TB
H1["Academic paper writing<br>Conducting meta-analyses"]
H2["Required literacy: Level 3<br>220–440 hours"]
H1 --> H2
end
subgraph Mid["Moderate Accuracy"]
direction TB
M1["Evidence-based<br>blog articles"]
M2["Required literacy: Level 1–2<br>20–240 hours"]
M1 --> M2
end
subgraph Low["Directional Correctness Sufficient"]
direction TB
L1["Career decisions<br>Technology selection"]
L2["Required literacy: Level 0–1<br>AI + common sense"]
L1 --> L2
end
Writing academic papers requires correctly understanding effect size interpretation and meta-analysis methodology. Level 3 literacy is essential for this.
However, for decisions like “Should we adopt AI coding tools?” or “How should I think about my career direction?”, what matters is whether the direction is correct.
“Noy & Zhang (2023) reported a 40% reduction in task time and 18% quality improvement with AI writing tools” — whether the effect size is precise to the decimal point doesn’t matter here. Knowing the direction that “AI writing tools produce significant productivity gains” is sufficient as decision-making input.
Evidence “Usage” Also Has Gradients
| Use case | Required accuracy | Required investment |
|---|---|---|
| Meta-analysis / academic papers | Precise effect size interpretation, research design evaluation | 220–440 hours |
| Evidence-based blog articles | Claim-citation consistency, basic statistical understanding | 20–240 hours |
| Technology selection / decision reference | Directional validity, no major contradictions | AI + common sense |
| Personal learning / career decisions | Directional understanding, multi-source consistency | AI + common sense |
What matters is being aware of which accuracy level you’re operating at. Treating AI-generated research results as precise meta-analysis is dangerous. But judging that “multiple studies point in the same direction, so the direction is probably valid” is rational.
Perfectionism Blocks Evidence Use
“If you can’t read papers accurately, you shouldn’t engage with evidence” — this mindset ultimately justifies decisions made without evidence.
In reality:
- Referencing evidence even just for direction is better than deciding without evidence
- Even without 100% accuracy, simply knowing that “multiple studies show productivity gains” is superior to “deciding based solely on surrounding impressions”
- If you’re aware of the precision limitations, even imperfect evidence has substantial value
Note that this blog itself is a practical example of this philosophy. Articles go through an AI generation → AI review → fact-check pipeline, but since the creator’s learning is the primary purpose, academic-paper-level complete verification is not performed. The approach prioritizes directional correctness and uses evidence within practical limits.
Where Literacy Still Matters
Having abandoned perfectionism, there are still situations where research literacy becomes important.
When You Can’t Take AI Citations at Face Value
As mentioned, during single-shot generation, LLMs fabricate papers with a 20–40% probability. In the following situations, verification ability is essential:
- Published article citations — You have a responsibility to make them reader-verifiable
- Organizational decision-making evidence — When directional errors carry large costs
- Fields with conflicting research — Risk that AI presents only one perspective
Incremental Investment Is Realistic
A graduated approach that deepens as needed is most efficient. Following Levels 0–3 described above, here are concrete first steps.
| Level | Action you can take today |
|---|---|
| 0 (AI + common sense) | Ask the same question to multiple AIs and judge direction by response consistency |
| 1 (Search and select) | Build a habit of checking paper existence and DOI on Google Scholar |
| 2 (Reading and evaluation) | Start with Udemy’s free statistics literacy course12 and practice by verifying AI output |
| 3 (Academic level) | Invest when needed. No need to aim for this from the start |
AI Itself Accelerates Learning
Paradoxically, dialogue with AI can also accelerate research literacy acquisition.
- Have it explain paper structure
- Ask for intuitive explanations of statistical concepts
- Discuss “Is it appropriate to cite this paper in this context?”
However, final verification must use AI-independent means like database existence checks and DOI searches. Asking AI “Is this citation correct?” may just get you an AI confirming a fabricated paper as “correct.”
Summary
Organizing the cost structure of evidence-based writing:
What AI can shorten (working time):
- Paper search & summarization: hours → minutes
- Structuring & draft generation: days → tens of minutes
- Multilingual expansion: hours → minutes
- Total: 30–50x productivity improvement
What AI can’t skip (learning time):
- Ability to find papers: 20–40 hours
- Ability to read papers: 100–200 hours
- Judgment to cite correctly: 100–200 hours
- Total: 220–440 hours of learning investment
However, 220–440 hours aren’t always necessary.
To begin with, even human newspaper articles that went through editorial processes contain factual errors in 48% and errors of any kind in 61%. Comparing fairly against AI’s post-pipeline quality, the premise that “human-written means accurate” doesn’t hold.
If you’re using evidence as reference for career decisions or technology selection, using AI-presented evidence as “directional indicators” is sufficient. If writing blog articles, being able to filter out fabricated AI citations with Level 1 literacy (20–40 hours) alone makes a significant difference.
Perfectionism — the belief that it’s useless unless 100% accurate — is the greatest enemy of evidence. Referencing evidence directionally while being aware of precision limitations always leads to better decisions than deciding without evidence at all.
And the return on investment in research literacy has grown dramatically with AI’s emergence. A 20-hour investment enables citation verification; a 200-hour investment enables writing multiple evidence-based articles per month. You don’t have to aim for perfection. Starting within your current capabilities is the most important first step.
Related Articles
For more on this topic, see these related articles:
- The True Value of AI: Multi-Dimensional Value Assessment Beyond Time Savings - A framework for evaluating AI value beyond time reduction
- Blog Writing Guide for Engineers Who Struggle with Articulation - Practical methods for organizing thoughts through AI dialogue
- The Truth Behind Experts’ “Seeming” AI Delegation - Meta-knowledge hidden in expert AI delegation
- Effective Learning Methods Based on Scientific Evidence - The evidence foundation of learning science
References
References corresponding to citation numbers in the text are listed in numerical order.
Additional References (not cited by number in text)
Measuring Total Reading of Journal Articles - King, D.W. et al. (2006). D-Lib Magazine. Trends in researchers’ annual paper reading volume. [Reliability: High]
Scientists Reading Fewer Papers for First Time in 35 Years - Scientific American (2014). Changes in scientists’ reading patterns based on Tenopir & King’s research. [Reliability: Medium-High]
The impact of generative AI on academic reading and writing: a synthesis of recent evidence (2023–2025) - Frontiers in Education (2025). Synthesis review of AI’s impact on academic reading and writing. [Reliability: High]
Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations - Greenland, S. et al. (2016). European Journal of Epidemiology, 31(4), 337–350. Comprehensive guide on p-value misinterpretations. [Reliability: High]
On citation accuracy: The research cited in this article has been verified through the following methods:
- Confirmation via academic databases (PubMed, Google Scholar, ScienceDirect, etc.)
- Verification of paper information on official journal websites
- Cross-verification through multiple independent sources (academic media, official institutional announcements, etc.)
For some papers, direct access to full-text PDFs may be restricted, but paper abstracts, DOIs, author information, and key findings have been confirmed through official academic databases and reliable secondary sources.
Electronic Journals and Changes in Scholarly Article Seeking and Reading Patterns - Tenopir, C. & King, D.W. (2008). D-Lib Magazine. Longitudinal study from 1977–2005 measuring researchers’ paper reading time trends. [Reliability: High] ↩︎
Expert–Novice Comparison Reveals Pedagogical Implications for Students’ Analysis of Primary Literature - Nelms, A.A. & Segura-Totten, M. (2019). CBE—Life Sciences Education, 18(4). Expert-novice paper comprehension comparison based on cognitive load theory. [Reliability: High] ↩︎
Experimental evidence on the productivity effects of generative artificial intelligence - Noy, S. & Zhang, W. (2023). Science, 381, 187–192. n=453, pre-registered RCT. [Reliability: High] ↩︎
Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis - Chelli, M. et al. (2024). Journal of Medical Internet Research, 26, e53164. Analysis of 471 citations. [Reliability: High] ↩︎
ChatGPT Hallucinates Non-existent Citations: Evidence from Economics - Buchanan, J., Hill, S. & Shapoval, O. (2024). The American Economist, 69(1), 80–87. Measurement of fabricated citation rates for GPT-3.5/4 in economics. [Reliability: High] ↩︎
Misinterpretations of P-values and statistical tests persists among researchers and professionals working with statistics and epidemiology - Lytsy, P., Hartman, M. & Pingel, R. (2022). Upsala Journal of Medical Sciences. n=139 (75 doctoral students + 64 statisticians). [Reliability: High] ↩︎
Mindless statistics - Gigerenzer, G. (2004). Journal of Socio-Economics, 33(5), 587–606. Survey on p-value misunderstandings and statistical education issues. [Reliability: High] ↩︎
Framework for Information Literacy for Higher Education - Association of College and Research Libraries (2015). Information literacy framework based on six threshold concepts. [Reliability: High] ↩︎
Advanced Research Methods Certificate - Texas A&M University. 12-credit research methods certificate program. 12–18 credits is standard across universities. [Reliability: Medium-High] ↩︎
Accuracy Matters: A Cross-Market Assessment of Newspaper Error and Credibility - Maier, S.R. (2005). Journalism & Mass Communication Quarterly, 82(3), 533–551. Accuracy survey of 14 US newspapers, 4,800 articles. [Reliability: High] ↩︎
Misconduct accounts for the majority of retracted scientific publications - Fang, F.C., Steen, R.G. & Casadevall, A. (2012). PNAS, 109(42), 17028–17033. Analysis of 2,047 retracted publications from PubMed. [Reliability: High] ↩︎
Statistics literacy for non-statisticians - Udemy. Free statistics literacy course for non-statisticians. As an introduction to basic concepts. [Reliability: Medium] ↩︎