Post
JA EN

Why ADRs Are Being Reappraised in the AI Era: AI Manages the Ledger, Humans Verify the Why

Why ADRs Are Being Reappraised in the AI Era: AI Manages the Ledger, Humans Verify the Why
  • Audience: Readers who finished How to Use ADRs in DDD Development and now want to understand the reasoning and the evidence behind “why ADRs, in the age of AI.” This is for people interested in the background, the structure, and the evidence—not the step-by-step practice.
  • Prerequisites: You know what an ADR (Architecture Decision Record) is. A grasp of basic DDD concepts helps.
  • Reading time: about 18 min

Overview

How to use ADRs in DDD development is covered as a procedure in a separate article (the practical guide). This article is the backstory—an examination of why ADRs are being reappraised right now, specifically in the context of AI-assisted development: the structure, the evidence, and the limits.

There are three key points. (1) Paper ADRs tended to be documents that “got written but never read”—but AI reads them. For the first time, the reasons behind decisions are reliably referenced. (2) AI doesn’t just reference ADRs; it can run the management itself—filing them, generating first drafts, handling supersede chains, checking consistency. (3) But once AI starts mass-producing decisions, human verification becomes the bottleneck. That’s why AI needs to write ADRs in a form that’s easy to review.

And the heart of this article comes at the end: the strength of the evidence. Most of the support for these claims is either very recent 2026 preprints or analogies from adjacent areas like PR descriptions and code review. Research that directly and at scale validates ADRs-plus-AI is still thin. This article is a design argument—”running it this way is coherent”—not a report of empirically demonstrated effects. I want to draw that line up front.

The “unread document” gets flipped

ADRs have long been considered, in DORA terms, “an elite-team practice.” Their value is acknowledged, but they were often left sitting unread after being written. Recording the reasons behind a decision is the right instinct—but a record that’s never referenced is just a dead file.

AI flips this dynamic. A framing discussed on Zenn puts it crisply: “paper ADRs were often left unread, but AI reads all of them and answers instantly1. The decision ledger that humans skimmed past, AI references reliably. Ask “why did we go with X?” and the AI pulls up the relevant ADR and answers. The same article also says: “a design decision truly comes alive not when it is written, but when it is referenced afterward.” And AI performs that referencing far more dutifully than humans do.

Chris Swan likewise predicts that ADRs will become boilerplate—standard equipment—when using AI coding assistants2. An ADR is an ideal format for an LLM: “structured enough to capture the key points, but still natural language.” He further notes that in agent swarm development—where multiple AI agents are coordinated like a team—several actors need to share the same decision ledger, which is precisely the scenario ADRs were originally designed to support.

But “reads all of them” isn’t literal

Here, a caveat is already required. The rhetoric of “AI reads all of them” is not technically accurate. A study evaluating automated ADR generation with LLMs3 reports that handing over all past ADRs (All-History) is not the best approach: narrowing to the most recent 3–5 strikes the best balance of quality and efficiency, and an elaborate retrieval-based selection (RAFG) produced no statistically significant difference in ordinary operation. What dominates is not “model size” but “what you supply as context (context engineering).”

So the correct reading of “AI reads them” is: “unlike a human, it doesn’t leave them sitting—it always references what’s needed.” Loading everything into context on every turn is not optimal. From here, the operating picture follows.

Push and pull: how to feed AGENTS.md vs. ADRs

The text an AI reads comes in two kinds with different natures.

flowchart TB
    AGENTS["AGENTS.md / CLAUDE.md<br>current constraints (What)"]
    ADR["ADR ledger<br>the why & rejected options (Why)"]
    AI["AI coding"]
    AGENTS -->|"push: supplied automatically every turn"| AI
    ADR -->|"pull: pulled only when needed"| AI

AGENTS.md (or CLAUDE.md) is push-type. Place it at the repo root and it loads automatically every turn, continuously feeding the AI the “current constraints (What)”—the forbidden terms in the ubiquitous language, the invariants, and so on.

ADRs are pull-type. You don’t supply them constantly; you pull in “why we did it that way (Why)” only when it’s needed. As noted above, continuously pushing every ADR actually lowers quality3.

Mixing the two causes accidents. Write the decision history into AGENTS.md and the file bloats, squeezing context; conversely, write current constraints into ADRs and they can’t keep pace with the code and go stale. Present-tense constraints get pushed; the reasons behind decisions get pulled—this separation is the foundation of ADR practice in the AI era.

AI “manages” the ledger: how far has it come?

AI’s role doesn’t stop at reading (referencing) ADRs. We’ve reached the point where AI can run the management of the ledger itself.

  • Filing and first-draft generation: Feeding a PR’s diff (the changes, the commit messages) to an LLM, generating an ADR draft, and committing it back to the same branch—this workflow is already implemented4.
  • Standardization and metadata: An extension proposal called Agent Decision Records (AgDR), which records autonomous decisions made by AI agents, sets the author to the AI and makes the model identifier, session ID, and timestamp required metadata, automated via hooks or skills5.
  • Division-of-labor agents: A single monolithic prompt is insufficient because of context limits and the dispersed nature of architectural knowledge. AgenticAKM, which coordinates specialized agents for extraction, retrieval, generation, and verification, reports that it generated better ADRs than a naive single prompt in a user study across 29 repositories6.

At this point, an ADR shifts from “a document that gets neglected because it’s a chore to write” to “a ledger that AI maintains automatically.” Routine work—filing, numbering, the cross-links of supersedes, consistency checks—is faster and more complete when AI does it.

But there’s a line AI must not cross. What AI can write from a diff is limited to “what changed.” “Why this option was chosen and the others rejected” is a human judgment that does not appear in the diff; have AI write it and it fills the gap with guesses—there’s a risk that a plausible-but-false Why slips past review and gets pinned into the ledger. So the division of labor is this: management to the AI; strategic decisions and “verifying the validity of the Why” to humans.

The human-review bottleneck, and how to relieve it

Once you settle on “management by AI, verification by humans,” human verification becomes the new bottleneck. The more decisions AI mass-produces, the more ADRs humans must check for validity. In evaluating AI-generated output, “human review is the gold standard, but it is costly and slow” has been pointed out repeatedly in the medical-documentation domain (which is exactly why mitigations like LLM-as-a-Judge, where a separate LLM does first-pass evaluation, are being studied)7.

The reason human review jams up is cognitive load. A classic large-scale code-review study (SmartBear / Cisco, 2,500 reviews, 3.2 million lines) showed that once the review target exceeds 200–400 lines, defects start slipping through sharply8. Cognitive-load research in Empirical Software Engineering likewise reports that when the target is large or complex, reviewers spend their time on “navigation between files,” and their assessment becomes mechanical (i.e., rubber-stamping) rather than analytical8. That’s about code, but the structure is the same for documents: a verbose ADR is “read, but its problems are missed.”

The way to relieve it is “have AI write in a form that’s easy to review.” There’s evidence for this. A study analyzing AI-generated PR descriptions (18,256 of them) reports that descriptions written by AI and then completed by a human had shorter review times and were more likely to be merged9, and a separate study showed that differences in AI agents’ description styles correlate with differences in reviewer response time and merge outcomes10. Meanwhile, left to their own devices LLMs tend to drift toward verbosity (verbosity bias), and simply asking for “be concise” is unreliable—conciseness is more dependably enforced through structure (a template)11.

A prime example of that “structure” is Olaf Zimmermann’s Y-statement12. It’s a format that compresses a decision’s context, chosen option, rejected options, and trade-offs into a single sentence—and the “Y” in the name stands for “why”: the structure makes the why unavoidable. It was born from the requirement of “can you fit each decision onto one slide?”—a format designed to avoid verbosity. Have AI write in this form, and a human can judge the validity of the Why by reading just the one line of “against (rejected options).” It focuses the reviewer’s eyes on a single point. I leave the concrete instructions on how to have AI write it to the practical guide.

How strong is the evidence—honestly

Let me honestly lay out how strong the evidence supporting the argument so far actually is. This is the most important section of the article.

ClaimSupporting evidenceStrength
Humans miss things when the review target is large/verboseSmartBear/Cisco (large-scale), cognitive-load research8Strong (but it’s code review; applying it to ADRs is by analogy)
Structured / AI-generated-plus-human-completed descriptions are faster to reviewAnalysis of 18,256 PR descriptions9, description-style study10Medium–strong (but it’s PR descriptions, not ADRs)
AI division-of-labor agents produce better ADRsAgenticAKM, user study across 29 repos6Medium (small-scale, recent 2026 research)
Narrowing the context beats handing over everything for generation qualityContext Matters3Medium (2026 preprint)
LLMs get verbose when left aloneBody of output-length-control research11Medium–strong
ADRs are reappraised / become boilerplate in the AI eraKosk1, Chris Swan2Practitioner wisdom (observation and prediction by practitioners)
The Y-statement lets you write conciselyZimmermann12Practitioner wisdom (established format, but no RCT)

As you can see, no research yet directly and at scale validates ADRs-plus-AI. The strong evidence comes from adjacent areas—code review and PR descriptions—and applying it to ADRs is by analogy. The research closest to ADRs (AgenticAKM, Context Matters) consists of very recent 2026 preprints and workshop papers, and they’re small in scale. The rest rests on practitioner observation and established practitioner wisdom like the Y-statement.

On top of that, a structural gap remains. The risk noted earlier—”AI fills in rejection reasons by guessing, and a false Why gets pinned into the ledger”—has no technical solution beyond a human verifying it in practice. The more you entrust ADR management to AI, the more the quality of that verification determines the trustworthiness of the whole system.

So the claim of this article is, at most, the design argument that “running it this way is coherent.” The effect—that “having AI manage ADRs makes development faster / raises quality”—has not been measured in a controlled experiment. Please receive it with that distinction in mind.

Summary

ADRs are being reappraised in the AI era because of three reversals.

  1. The unread document becomes a read document. AI references the decision ledger dutifully. But it doesn’t “read all of it”—the accurate picture is supplying push (current constraints) and pull (reasons when needed) separately.
  2. AI manages the ledger. AI runs filing, generation, supersedes, and consistency checks, while humans hold on to only strategic judgment and verifying the Why.
  3. Human verification becomes the bottleneck, so solve it with how things are written. Have AI write in a structured, concise form (like the Y-statement) and focus human eyes on the single point of Why-verification.

But—the evidence supporting all of this is centered on analogies from adjacent areas and very recent preprints; direct empirical validation of ADRs-plus-AI is thin. The “fabricated Why” gap also remains. The view that ADRs, far from going obsolete in the AI era, are being reappraised, is itself plausible—but it is still at the stage of a “coherent design argument,” not a “demonstrated effect.” For how to actually use them in practice, see the practical guide with that premise in mind.

Check out other articles related to this theme:

References

References corresponding to the citation numbers in the body, listed in numerical order.

Other references (not cited by number in the body)

  1. A Prescription for SDD Fatigue: Recording Architectural Decisions with ADRs in the AI Era - Kosk, Zenn (2026). “Paper ADRs went unread, but AI reads all of them and answers instantly”; “a design decision comes alive when it is referenced.” [Reliability: Medium] (practitioner blog) ↩︎ ↩︎2

  2. Using Architecture Decision Records (ADRs) with AI coding assistants - Chris Swan (2025-07-10). ADRs are an ideal format for LLMs; from DORA elite practice to standard boilerplate; especially effective in agent swarms. [Reliability: Medium] (practitioner blog) ↩︎ ↩︎2

  3. Context Matters: Evaluating Context Strategies for Automated ADR Generation Using LLMs - Aviral Gupta, Rudra Dhar, Daniel Feitosa, Karthik Vaidhyanathan, EASE 2026 research track (arXiv:2604.03826). The most recent 3–5 ADRs beat handing over all history; retrieval-based (RAFG) showed no significant difference in linear operation; context engineering dominates over model size. [Reliability: High] (preprint, peer-reviewed conference track) ↩︎ ↩︎2 ↩︎3

  4. From Stale Docs to Living Architecture: Automating ADRs with GitHub + LLM - Iraj Hedayati, Medium (2025-09-14). A workflow where an LLM generates an ADR first draft from a PR diff and a human reviews it. [Reliability: Medium] (practitioner blog) ↩︎

  5. Agent Decision Records (AgDR) - me2resh, GitHub (2025). An extension standard for recording autonomous AI-agent decisions as ADRs. Author = AI, required metadata (model identifier, session ID, timestamp), automated via hooks/skills. [Reliability: Medium] ↩︎

  6. AgenticAKM: Enroute to Agentic Architecture Knowledge Management - Rudra Dhar, Karthik Vaidhyanathan, Vasudeva Varma, AGENT ‘26 (arXiv:2602.04445). Coordinating specialized agents for extraction, retrieval, generation, and verification to generate ADRs from code. Reported “better ADRs” in a user study across 29 repositories. Note: the ACM proceedings DOI is unresolved as of writing, so arXiv is the primary reference. [Reliability: High] (peer-reviewed workshop paper; small-scale but empirical) ↩︎ ↩︎2

  7. Evaluating clinical AI summaries with large language models as judges - npj Digital Medicine (2025). In evaluating AI-generated documents, “human review is the gold standard but costly and slow.” LLM-as-a-Judge as a mitigation. [Reliability: Medium–high] (medical context; applying to ADRs is by analogy) ↩︎

  8. Code Review at Cisco Systems (SmartBear) / Do explicit review strategies improve code review performance? (cognitive-load research) - SmartBear/Cisco (2,500 reviews, 3.2M lines; defects spike past 200–400 lines) + Empirical Software Engineering (the more complex the change, the higher the cognitive load and the more mechanical the review). [Reliability: High] (code review; applying to ADRs is by analogy) ↩︎ ↩︎2 ↩︎3

  9. Generative AI for Pull Request Descriptions: Adoption, Impact, and Developer Interventions - Tao Xiao, Hideaki Hata, Christoph Treude, Kenichi Matsumoto, PACMSE (2024, DOI: 10.1145/3643773). Analysis of 18,256 AI-generated PR descriptions. Descriptions written by AI and completed by a human had shorter review times and were more likely to merge. Note: these are PR descriptions, not ADRs. [Reliability: High] ↩︎ ↩︎2

  10. How AI Coding Agents Communicate: A Study of Pull Request Description Characteristics and Human Review Responses - Kan Watanabe et al. (2026, arXiv:2602.17084). Differences in AI agents’ description styles correlate with differences in reviewer engagement, response time, and merge outcomes. Note: at the abstract level; specifics of the optimal form need checking against the body. [Reliability: Medium–high] ↩︎ ↩︎2

  11. Concise Thoughts: Impact of Output Length on LLM Reasoning (and other output-length-control research) - (2024-). LLMs have a verbosity bias and follow length instructions poorly. Conciseness is better enforced through structure than through “be concise.” [Reliability: Medium–high] ↩︎ ↩︎2

  12. Architecture Decision Record Template: Y-Statements - Olaf Zimmermann (SATURN 2012). An ADR format that compresses a decision into a single sentence (context, chosen option, rejected options, trade-offs). The “Y” stands for “why.” [Reliability: Medium–high] ↩︎ ↩︎2

This post is licensed under CC BY 4.0 by the author.