[Proposal] Multi-Tier Harness Engineering — A 3-Layer Model and a Data-First Investment Strategy
This article was generated by AI. The accuracy of the content is not guaranteed, and we accept no responsibility for any damages resulting from use of this article. By continuing to read, you agree to the Terms of Use.
- Intended readers: software engineers, engineering managers, AI enablement leads, and platform engineers building agent infrastructure across an organization
- Assumed background: hands-on use of coding agents such as Claude Code, Codex, or Cursor in real work
- Reading time: about 15 minutes
About this article: This is a design proposal grounded in publicly available sources, not a report from the author’s own large-scale implementation. Drawing on OpenAI’s published case study and related research from inside and outside Japan, it presents a framework for designing harnesses around AI coding agents at the organizational scale. Because it is a proposal, the final sections deliberately stress-test the costs, counter-evidence, and conditions under which it should not be adopted.
Overview
In the early days of agent adoption, a single CLAUDE.md is usually enough. Once an organization runs three or five projects in parallel and grows past ten engineers, a familiar set of symptoms appears: each project’s harness (the bundle of mechanisms that steer an agent) evolves on its own, consistency erodes, no shared assets accumulate, and governance has nowhere to live.
OpenAI’s 2026 internal write-up1 reports five months of work, roughly one million lines of code and around 1,500 PRs, with zero hand-written lines of code. That outcome is attributed less to Codex’s raw intelligence and more to a centrally engineered harness — golden principles, custom linters, a fixed architecture, and observability — wrapped around it.
This article responds to that gap with a design proposal: “multi-tier harness engineering.” The proposal is built from three claims:
- Structural claim: split the harness into three layers — organizational harness (RAG plus governance), shared harness (platform-provided), and project harness (domain-specific)
- Investment claim: do not build all three layers evenly. Concentrate investment in the data layer (RAG) and in organization-specific rules for external tools, and keep generic workflow implementations thin
- Implementation claim: encode external-tool rules as CLI tools (e.g.
mycorp-pr) so they can be hard-enforced. AI itself is collapsing the cost of building those CLIs, which keeps shifting the ROI calculation in their favor
The proposal is not unconditional. Platform engineering has well-documented anti-patterns — the “golden cage,” the central-team bottleneck2. Industry reports estimate that more than 70% of enterprise RAG deployments fail to reach production3. AI coding agents themselves face an organization-level productivity paradox: 92.6% adoption with only ~10% organizational productivity gains4.
Article structure: §1 reframes the harness problem and shows why single-tier operation fails at organizational scale; §2 lays out the 3-layer model; §3 — the heart of this article — presents the investment strategy; §4 gives an implementation playbook; §5 evaluates the proposal from multiple angles; §6 names its limits.
1. The Harness Problem — Why Single-Tier Operation Breaks at Scale
The term “harness” gained currency in early 2026 through OpenAI’s and Martin Fowler’s writing15. The literal sense is the harness on a draft horse; here it stands for the collection of mechanisms that steer the output of a coding agent and let it self-correct.
Fowler organizes the picture this way5. Coding agents already ship with an “inner harness” — the system prompt, code-search machinery, and so on. On top of that, users build an outer harness with two purposes:
- raise the probability that the agent gets it right on the first try (prompts, rules, context supply)
- run a feedback loop that surfaces and self-corrects problems before they reach production (linters, tests, verifier agents)
In Japan, nogataka frames harness work as a three-stage evolution: CLAUDE.md → AGENTS.md/rules → harness engineering6. Only at the third stage, when execution, verification, and memory are integrated, can violations be detected structurally. cvusk contributes the implementation side: an Explore→Plan→Implement workflow, a minimal-tool philosophy, and the LangChain case in which leaving the model untouched and optimizing only the harness moved the team from 30th to 5th on Terminal Bench 2.07.
These discussions share a hidden assumption: the harness is treated as a per-project artifact. At organizational scale that assumption produces three distinct failure modes.
Pattern 1: Knowledge silos — Each project’s “this is how we write things in our domain” calcifies as tacit knowledge that never crosses repository boundaries. Agents see only the context of the repo they are pointed at, so they cannot draw on the organization’s broader experience.
Pattern 2: No shared assets — Skills, hooks, templates, and verifier linters scatter across projects. A 2026 platform-engineering trends survey8 reports that 73% of platform teams have already integrated AI assistants into developer workflows, but without a distribution mechanism for organizational context, every team keeps reinventing the same wheel.
Pattern 3: Governance has nowhere to live — Security, compliance, and ethics rules need to be enforceable across the organization, but CLAUDE.md carries no enforcement weight. Gartner predicts that by 2026 more than 70% of enterprise generative-AI projects will require structured retrieval pipelines (RAG) to mitigate hallucination and compliance risk9.
Read the OpenAI case1 in reverse and the architecture that avoids these failures comes into view: a centrally engineered harness applied as a constraint to every agent execution. Background Codex tasks periodically scan for drift and open refactor PRs automatically.
2. Structure of the Proposal — A 3-Layer Model
Split a harness into two layers (organization vs. project) and a real organization always grows a shared layer in the middle: backend-team common, tech-stack common, security-team review skills — wider than a single project but narrower than the whole company. This article proposes three layers.
flowchart TB
subgraph ORG["Organizational Harness — Knowledge Supply & Governance"]
direction TB
ORG1["RAG Platform<br>internal docs, ADRs, incident history"]
ORG2["Minimal-Access Skills<br>RAG search, citation, source attribution"]
ORG3["Must-Level Linters<br>security & compliance"]
end
subgraph SHARED["Shared Harness — Platform-Provided"]
direction TB
SH1["Generic Workflows<br>EPI, review, PR creation"]
SH2["Org-Convention CLIs<br>mycorp-pr, mycorp-deploy"]
SH3["Tech-Stack Commons<br>templates, hooks, CI"]
end
subgraph PRJ["Project Harness — Configuration & Domain Specifics"]
direction TB
PRJ1["Domain Rules<br>business-logic constraints"]
PRJ2["Workflow Configuration<br>exploration paths, verify commands"]
PRJ3["Memory Management<br>progress.md, etc."]
end
PRJ ==>|mandatory reference| ORG
PRJ -->|selective import| SHARED
PRJ -.->|promotion PR| SHARED
SHARED -.->|promotion PR| ORG
2-1. Organizational Harness (Outermost) — Knowledge Supply and Governance
The organizational harness has two functions: a supply mechanism for organizational knowledge and an enforcement mechanism for organizational norms.
RAG platform: In 2026, RAG is no longer “document search.” It is now framed as a knowledge runtime that integrates retrieval, verification, reasoning, access control, and audit logging10. The dominant pattern is agentic RAG, where the agent itself calls RAG as part of its reasoning loop10. The platform becomes a cross-cutting search layer over internal ADRs, incident postmortems, design documents, and API catalogs. Building access control and audit logging in from day one is also how the layer satisfies compliance requirements.
Must-level linters: Reserve this slot for “things no part of the organization may break” — security, compliance, and ethics. The OpenAI case1 shows custom lints whose error messages embed the corrective steps for the agent to follow, structuring the self-correction loop directly inside the lint output. Linters at the organizational layer should be written for agents, not just for humans.
Skills at the organizational layer should be kept minimal — RAG search, attributed source presentation, secret detection, and not much more. Push richer domain logic into the lower layers.
2-2. Shared Harness (Middle Layer) — Platform-Provided
The middle layer is the platform-engineering layer. In Team Topologies terms, a platform team provides low-cognitive-load self-service to stream-aligned teams. In the agent era this becomes “curate skills, templates, and golden paths centrally; let each project install and consume them.”
QCon London 2026 surfaced “bounded agency” as a design principle for AI-era team design11 — agents’ authority should be deliberately constrained by rules and guardrails. The shared harness is the layer where bounded agency is implemented in code.
Claude Code supports this middle layer at the OS level. Skills come in personal, project, and organization-wide flavors, with organization-level Skills released on December 18, 202512. Plugin marketplaces are git-hosted via marketplace.json13, and SKILL.md is being standardized across more than 30 agents14 — meaning investment in the shared layer accumulates as an organizational asset rather than a vendor lock-in.
What belongs in the middle layer is “anything reusable across projects but not severe enough to enforce organization-wide”:
- Generic workflows: Explore→Plan→Implement, code review, ADR drafting, PR creation, deployment
- Org-convention CLIs for external tools: CLIs that hard-enforce your company’s conventions on top of GitHub, Slack, Jira (see §3-3)
- Tech-stack commons: API skeletons, naming conventions, testing strategy, shared hooks and CI
A company is not limited to one shared layer. It can run backend-shared, frontend-shared, security-team-shared layers in parallel. The “Learning to Share” paper15 argues, on the research side, that for parallel agent systems selective sharing — neither “share everything” nor “share nothing” — is the efficient policy, which lines up well with how shared layers should be designed.
2-3. Project Harness (Innermost) — Configuration and Domain Specifics
The innermost layer should hold only domain-specific knowledge and the configuration values needed to drive the shared workflows.
- Domain rules: business-logic invariants, industry-specific regulatory handling
- Workflow configuration: exploration paths, plan-template content, verification commands (the workflow body itself is inherited from the shared layer)
- Memory management:
progress.md, scratchpads, cross-session memory
A useful intuition: the innermost layer should feel thin. Most of what tries to live there can be pushed up. If the project harness feels heavy, that is usually a sign that the shared layer is underdeveloped.
2-4. Layer Relationships — Composition, Not Inheritance
Drawing the 3-layer model as a single inheritance chain (Project → Shared → Org) misrepresents real organizations. In practice, the project harness composes the organizational and shared harnesses in parallel (composition, not inheritance).
- Organizational harness → flows through every project (mandatory): the RAG platform and Must-level linters bypass the shared layer; every project obeys them directly. Internal RAG queries also originate directly from each project.
- Shared harness → selectively imported (optional): backend-shared skills are “use them if they fit.” A project whose nature does not match can decline.
- Promotion PRs flow both ways: a generic skill born in a project gets promoted to the shared layer; a convention that has stabilized in the shared layer gets promoted to the organizational layer.
A simple framing: the organizational harness is “infrastructure,” the shared harness is “platform service.” Infrastructure is not optional; platform services are.
3. Investment Strategy — Where to Concentrate the Bet
This is the core of the article. If you build all three layers from scratch evenly, you tend to finish the build only to discover that vendor features have already overtaken your shared- and project-layer code. The right move is to decide upfront which layer you invest heavily in and which layers you keep deliberately thin.
3-1. Per-Layer Longevity Assessment
Evaluating each layer by “how long will a hand-rolled implementation stay valuable” gives a usable axis.
| Layer / element | Longevity of a hand-rolled implementation | Rationale |
|---|---|---|
| Organizational RAG + data | High | Organization-specific data is an asset vendors cannot absorb. Access control, PII handling, and regulatory pressure also persist as external constraints. |
| External-tool org conventions (your GitHub/Slack/Jira house rules) | High | “We squash-merge,” “PRs need two CODEOWNERS approvals” — vendors do not know any of this. The underlying primitives (PRs, issues, channels) are also stable on a decade timescale. |
| Domain rules (business logic, industry regulation) | High | Vendors do not replace these. |
| Must-level lint (security, compliance) | Medium | Partially absorbed by agent-side security features; the org-specific portion remains. |
| Company-specific templates | Medium-high | The portions specialized to your naming conventions and stack persist. |
| Project-specific configuration (exploration paths, verify commands) | Medium | The truly project-specific parts persist. |
| Generic workflows (EPI, review, PR creation) | Low | Skill marketplaces (including Anthropic’s official one) are scaling rapidly12. “Install and use” is likely to become standard within one or two years. |
| External-tool API wrappers (raw GitHub API calls etc.) | Low | Replaced by official integrations. Claude Code already ships GitHub integration13. |
The strategy follows directly: invest heavily in the three “high” rows (data, external-tool org conventions, domain rules); keep the two “low” rows (generic workflows, API wrappers) thin.
3-2. The Data Layer (RAG) — Justify Investment Through a Dual ROI
The data layer is the longest-lived investment, but justifying it on agent ROI alone is hard — organization-level productivity gains from AI agents top out near 10%4. It is worth being explicit: this article does not claim that the 3-layer model automatically closes that 10% gap. As §5-2 argues, the right move before investing is to measure your own organization’s productivity gap. This section is a conditional claim: if you adopt the 3-layer model, the data layer is the highest-leverage starting point.
Crucially, the same RAG platform delivers value directly to humans.
- Internal knowledge search engine: a “search you can talk to” that replaces or augments existing wikis and search. Used daily for new-hire onboarding, post-incident research, and looking up past design decisions.
- A learning sparring partner: “Walk me through the background of this architectural decision” or “summarize past similar implementations” — a force multiplier for junior engineers.
- A decision-support partner: when stuck on a design question, you can consult it against internal ADRs and incident history.
If you design the human interface in from day one, the organizational RAG layer carries independent ROI as a knowledge-management platform, not just as agent infrastructure. The opposite framing — “this is for the agents only” — means nobody feels the value until the agents themselves pay off, which usually means the platform team’s budget gets cut first. Microsoft’s strategy16 points the same direction: a shift from search-based to governed agent workflows.
Three implementation pillars:
- Concentrate the first budget on the data-ingest backbone: build the continuous ingestion pipelines for ADRs, postmortems, design docs, API catalogs, Slack and email archives first. This part does not become obsolete.
- Build access control and audit logging in from day one: regulatory pressure (e.g. EU AI Act) is increasing3. Retrofitting these is expensive.
- Run a search UI / Slack bot / CLI alongside: book ROI before the agents themselves start delivering.
3-3. External-Tool Org Conventions — Hard-Enforce Them via CLIs
Encoding GitHub, Slack, and Jira house rules “as a skill file” is too weak. A skill is a request the agent interprets, and it can be ignored. The proposal here: encode org conventions as CLI tools — mycorp-pr create, mycorp-deploy, mycorp-incident open — that hard-enforce label assignment, CODEOWNERS checks, pre-deploy approvals, and incident-channel creation.
Four advantages of the CLI implementation:
- Reusable across agents: Claude Code, Codex, Cursor — they can all call the same CLI. You do not have to wait for skill formats to standardize; the CLI is a working abstraction now.
- True hard enforcement: the CLI can refuse to create an unlabeled PR. A skill has to persuade the agent; a CLI either passes or rejects.
- Humans use the same tool: the same dual-ROI structure as §3-2’s RAG. When a human types
mycorp-pr createlocally, the same guardrails apply. CLIs are a natural interface for both humans and agents. - Observable: exit codes, structured logs, and JSON output integrate cleanly with audit infrastructure.
Division of labor with skills: skills teach the agent when and why to call a CLI; the CLI mechanically enforces the convention. Skill files stay small while the CLI carries the heavy logic. The thinner the skill, the cheaper the eventual migration to a vendor’s official skill.
The classic objection — “building a CLI is a heavy investment” — weakens in the AI era. You can scaffold the first version with Claude Code or Codex and tune it to your conventions, so time-to-first-version is much shorter than writing it from scratch. Maintenance cost is comparable to maintaining a generic GitHub-API wrapper as a skill. Once you weigh in the longevity benefits — “you can hard-enforce,” “humans can use it too,” “it is reusable across multiple agents” — the ROI math comes out in favor of the CLI more often than it used to.
A caveat on hard enforcement: the CLI only delivers it in combination with a CI/audit-side prohibition on bypassing it (e.g. a GitHub Actions check that blocks PRs not created via mycorp-pr). Just providing the CLI leaves a back door open.
This shares a lineage with Spotify’s Honk system7, which achieved strong control by limiting agents to three tools. Pin the interface to a stable CLI abstraction, and the implementation underneath stays free to change.
3-4. Design Generic Workflows Thin, with “Official Replacement” as the Default Assumption
When you place generic workflows like EPI, code review, ADR drafting, and PR creation into the shared layer, design them on the explicit assumption that they will eventually be replaced by official Anthropic / OpenAI / Cursor skills.
- Cleanly separate company-specific parts (naming conventions, internal templates, hooks into organizational knowledge) from generic parts (mode transitions, plan-document structure)
- Have a migration plan from day one: when the official skill ships, throw away the generic part and keep only the company-specific glue
- Treat these as “interim implementations to be re-evaluated next quarter for official replacement” rather than “we’ll write our own because we can today” (the cadence here reflects how quickly official skills are maturing1214)
§4-2 makes this concrete using the EPI workflow as a worked example.
3-5. The “Where to Hand-Build” Boundary
| Invest in hand-built (long-lived) | Defer to official / OSS (will be absorbed) |
|---|---|
| Organization-specific data ingestion and access control | LLM models, agent runtimes |
CLI-ification of external-tool org conventions (mycorp-pr etc.) | Generic external-tool API wrappers (official GitHub integration etc.) |
| Linters for company-specific naming and architecture | Standard per-language linters (ESLint, Ruff, etc.) |
| Domain-specific skills | Generic workflows (EPI, review, PR creation) |
| Regulatory handling (industry-specific compliance) | Cross-platform standards (SKILL.md, AGENTS.md) |
Drawing this line up front is what lets you avoid the worst-ROI pattern: scratch-building all three layers.
4. Implementation Playbook
4-1. Incremental Build Order — Project → Shared → Org
Trying to build all three layers at once tends to fail. The principle is to build out the layer where pain is showing, one stage at a time.
Step A: Mature the project harness
Get CLAUDE.md in shape and try Explore→Plan→Implement following cvusk’s guide7. Use progress.md to secure cross-session memory (addressing nogataka’s gap6). Even without an organizational RAG or shared layer, a per-project outer harness is useful on its own.
Step B: Stand up a shared harness when duplication appears
When you start copy-pasting the same skills and conventions across repositories, that is the cue to build a shared layer. Concretely:
- Create an internal
marketplace.jsonrepository13 - Build out the generic workflows first: EPI, code review, ADR drafting, PR creation, deployment (see §4-2)
- In parallel, AI-scaffold the first version of your external-tool org-convention CLIs (
mycorp-pretc.) (see §3-3) - Allow per-team / per-stack shared layers (backend-shared, frontend-shared, data-team-shared). This is consistent with the bounded-agency principle from Team Topologies11
- Stand up a promotion-PR flow. A heuristic like “if a skill is referenced by three or more projects, propose promoting it to the shared layer” is enough. The JP Morgan “friendly FOMO” pattern11 — natural diffusion through shared success rather than mandate — tends to work better in large enterprises
Step C: Build the organizational harness when compliance pressure appears
Once the organization passes ten to thirty people and security/compliance risk becomes concrete, work on the organizational layer:
- Design the RAG platform as a knowledge runtime10: ingest internal ADRs, postmortems, design docs, API catalogs. Integrate access control and audit logging from the start.
- Run Must-level linters at the organizational layer: limit to three categories — security, compliance, ethics. Write the error messages for agents1.
- Audit infrastructure: aggregate execution logs, RAG access logs, and lint-violation logs.
The OpenAI case1 did not start finished either; the golden principles were polished iteratively through the cycle of “scheduled scan → violation detected → automated PR.”
4-2. Worked Example — Layering the Explore→Plan→Implement Workflow
The abstract layer model can be hard to translate into “how do I actually slice this for my organization?” Here it is concretely on the EPI workflow7.
EPI has three stages: Explore (understand the codebase), Plan (write an editable plan), Implement (implement, test, verify). It looks project-specific, but the skill itself is a generic pattern that depends on neither the tech stack nor the domain, and runs with the same skeleton in a Spotify-style backend or a healthcare-SaaS frontend. What changes between projects is configuration, not workflow.
| Element | Layer | Content |
|---|---|---|
| Mode-transition logic (Explore→Plan→Implement) | Shared | Agent state management, allowed tools per mode |
| Plan-document structure (section layout) | Shared | Template: “summary / impact / steps / verification” |
| How to invoke the post-implementation verification hook | Shared | Run verification command, re-inject result into the agent |
| Exploration path (which directory to read from) | Project | Is it src/, app/, or packages/foo/ |
| Content of the plan template | Project | Domain-specific checks (PII handling, billing flow, etc.) |
| Verification command | Project | npm test, pytest, go test ./... |
| Organizational knowledge consulted during exploration | Org (via RAG) | Past ADRs, similar-implementation incident history, design principles |
The principle that falls out of this decomposition:
Put “what” (the workflow body) in the shared layer; leave only “how this project does it concretely” (configuration) in the project layer.
The same logic applies to code review, ADR drafting, PR creation, and workflow-shaped skills generally. Organizations whose project harnesses are full of duplicated EPI-style workflows should make promotion to the shared layer their first improvement target.
That said, as §3-4 noted, design these generic workflows as “interim implementations that need to last only three months”, on the assumption that an official skill will replace them.
4-3. Operations — Evolve the Harness Itself with Agents
A 3-layer harness, once built, will rot without active maintenance. The way to keep it alive is to fold harness maintenance into agent work itself.
- Weekly review: open at least one update PR per layer per week. Have an agent aggregate the last week’s violation patterns and propose adding, removing, or relaxing rules
- Promotion / demotion decisions: do an inventory once per quarter. Triggers like “shared skills referenced by three or more projects” or “organizational linters that have not fired in six months”
- Observability dashboard: skill invocation counts, frequent violations, RAG query patterns
Once design → operate → improve becomes a semi-automated PDCA loop driven by agents, the harness shifts from “thing you built” to “asset that grows.”
5. Multi-Angle Evaluation
5-1. Expected Benefits (Public Evidence)
- Faster onboarding: golden-path success metrics include shorter onboarding time, higher deployment frequency, and improved developer satisfaction17
- Reusability: Anthropic’s official marketplace and others distribute many skills12, and
SKILL.mdruns across Claude Code, Cursor, Codex, Copilot, and similar platforms14 — meaning vendor lock-in is weak at the skill layer (the RAG platform and org CLIs still demand bespoke implementation) - Governance: aligns with RAG-as-knowledge-runtime10, Microsoft’s governed-agent-workflow strategy16, and emerging compliance requirements like EU AI Act3
- Platform engineering effectiveness: Gartner predicts that by 2026, 80% of large software organizations will have established platform engineering teams, up from 45% in 20222. Golden-path implementation guidance consistently identifies natural adoption (rather than mandate) as the success condition17
5-2. Risks and Counter-Evidence
Bottlenecking the platform team: the canonical failures are “ticket-based workflow,” “template over-dependence,” and “Golden Cage”218. Once the central team is buried in manual requests, the platform itself becomes the new bottleneck.
“Them vs. Us” and organizational fracture: a clean adversarial line tends to form between the central platform and the “feature factory”2. This is a culture problem, not a technology problem.
Enterprise RAG operations are hard to land: industry reporting puts the failure-to-production rate above 70% for enterprise RAG3. The cause is rarely the retrieval algorithm; it is governance — missing document ownership, no query-time access control, no PII handling, no freshness management3.
Hallucinations do not vanish under RAG: Stanford HAI’s research on legal AI reports hallucination rates of 58–82% on complex legal queries19. Generative-AI citation fabrication is observed in the RAG community as well; one industry survey reports an 81% citation-fabrication rate in legal use cases20. Poisoned-document attacks like BadRAG21 and TrojanRAG22 are documented in the academic literature. Organizational RAG carries the brand of “trusted central source,” which can produce a psychological side effect that lowers the verification bar on its outputs.
Agent sprawl and marketplace fragmentation: Gartner predicts that by the end of 2026, 40% of enterprise applications will embed task-specific AI agents, up from less than 5% in 202523. The recommended pattern is “a small number of well-governed super-agents,” not “100 agents from 100 vendors”24. Open the shared-harness marketplace too far and you get duplication, waste, and security gaps.
The organization-level productivity paradox: this is the heaviest counter-evidence. 92.6% of developers use AI, but organization-level productivity gains land around 10%4. METR’s study finds that when developers felt “20% faster with AI,” they were measured at 19% slower25. Polishing the harness does not automatically lift organizational productivity. Before deciding to invest, the prior step is to measure where your organization’s AI productivity gap actually is.
Centralized vs. distributed: centralization gives “consistency, compliance, efficiency”; distribution gives “speed, relevance, creativity”26. The trade-off is real, and the recent consensus is that a federated model — central enforcement of ethics and safety rules with local execution autonomy — is the practical answer26. The 3-layer model in this article can be read as one implementation of that federated pattern, but the authority ratio between the organizational layer and the team layer varies enormously between organizations.
5-3. Adoption Conditions and Withdrawal Criteria
| Organization size | Recommended stance | Adoption signals |
|---|---|---|
| 1–2 people, 1 project | Project harness only. A single CLAUDE.md is enough. | No pain from “needing to share knowledge with another project” |
| 3–10 people, 1–2 projects | Time to mature the project harness. | You’re copy-pasting the same skill into another repo |
| 10–30 people, 3+ projects | Stand up the shared harness layer. Start the org layer with a minimal Must-lint set. | Security violations or compliance risk are surfacing |
| 30–100 people, multiple teams | The 3-layer model comes into its own. Stand up a platform team. Introduce organizational RAG. | Knowledge silos are showing up in hiring and retention conversations |
| 100+ people | Dedicated staffing for the organizational RAG platform and continuous operation of the middle layer become non-negotiable. | Regulatory compliance (e.g. EU AI Act) is a board-level concern |
Signals to back out:
- The platform team is buried in ticket processing → bottlenecked
- Voluntary adoption rate is below 50% → the harness itself is the problem (Golden Cage)
- Hallucinations do not drop after introducing organizational RAG → the governance layer is not actually being operated
- Hostility between the central team and feature teams → a culture problem, not a technology problem
- Developers in surveys report no felt productivity gain from AI → measure first, then invest in the 3-layer model
Harness design is continuous investment, not infrastructure-and-forget. “Just build all three layers” is a failure pattern; the principle is incremental build-out from the layer where pain is showing.
6. Limits and Caveats
- The author has not implemented this framework in a large organization. It is a design proposal informed by OpenAI’s published case and related research, not a case study.
- The 3-layer split is a simplification for the sake of discussion. Real organizations may run several middle layers in parallel (backend-shared, frontend-shared, security-shared). Drawing those lines is itself organizational politics.
- Mixed evidence quality. The cited evidence mixes peer-reviewed papers (METR etc.), official surveys, and industry-press articles. Reliability levels for each citation are spelled out in the references.
- Politics and culture are not solved by harness design. If psychological safety is low and people cannot push back on rules, tightening governance turns into suffocation.
- This article reflects what is known as of May 2026. Coding agents’ built-in harnesses are evolving fast. A year from now, half the items “you currently have to build as an organizational harness” may already be absorbed into the agent itself. Re-read on a six-month cadence.
Conclusion
This article is a design proposal for multi-tier harness engineering. The structure of the three claims:
| Proposal | Content | Counter-evidence / risk |
|---|---|---|
| Structure | Split into three layers: organizational harness (RAG + governance) / shared harness (platform-provided) / project harness (domain-specific) | Central-team bottlenecking2 / organizational fracture |
| Investment | Invest heavily in the data layer, external-tool org conventions, and domain rules; keep generic workflows and API wrappers thin | >70% of enterprise RAG fails to reach production3, 10% organization-level productivity paradox4 |
| Implementation | Hard-enforce external-tool conventions via CLIs (mycorp-pr etc.). AI-accelerated CLI development lowers the entry barrier | Get the skill / CLI / hook boundary wrong and you create a maintenance nightmare |
The one-line core message:
Split into three layers, then bet heavily on what vendors cannot absorb (organizational data, organizational conventions, domain rules), and keep what they will absorb (generic workflows, API wrappers) thin.
Whether to adopt this depends on organizational size. Consider it past 10 people and 3 projects; the 3-layer structure pays off at 30 people and multiple teams; dedicated staffing becomes non-negotiable past 100 people. But “just build all three layers” is a failure pattern — the call is to build incrementally from the layer where pain is showing.
One action you can try today: open your project’s CLAUDE.md and tag each item with one of Org (organization-wide), Shared (team-wide), or Project (specific to this project). For each item, ask whether the cost of running this centrally exceeds the cost of duplicating it in every project. Thirty minutes will surface both the layer boundaries and the adoption decision.
Related Articles
- Bringing Context Engineering to the Organization — From Individual Craft to Organizational Capability — the RAG layer of the organizational harness is one implementation of “organizational context supply”
- EM × AI Skill Design — Separate Rules, Skills, and Hooks — the building blocks of each layer in detail
- LLM Knowledge Limits and the Skill / Rule Boundary — criteria for “what to put in RAG vs. what to write as a skill”
- Tips for Delegating Well to AI with Claude Code Skills — implementation guide for skill design
- Engineers’ Five-Layer Context — relationship between the context hierarchy and the 3-layer model in this article
References
Additional Reading (Not Numerically Cited)
- Harness Engineering - first thoughts - Martin Fowler (2026). The memo that became the basis for the formal article above. 【Reliability: High】
- Unlocking the Codex harness: how we built the App Server - OpenAI (2026). A follow-up addressing the implementation side of Codex’s built-in harness. 【Reliability: High】
Harness engineering: leveraging Codex in an agent-first world - OpenAI (2026). The internal case study describing five months, ~1M lines of code, and ~1,500 PRs delivered with zero hand-written code. Details the design of golden principles, custom linters, and a fixed architecture. 【Reliability: High】 ↩︎ ↩︎2 ↩︎3 ↩︎4 ↩︎5 ↩︎6
9 Platform Engineering Anti-Patterns That Kill Adoption - Jellyfish (2025). Catalogs anti-patterns including ticket-based workflows, template over-dependence, the Golden Cage, and Them-vs-Us dynamics. The Gartner forecast “by 2026, 80% of large software organizations will have established platform engineering teams (up from 45% in 2022)” can also be confirmed via Gartner’s official commentary27. 【Reliability: Medium】 ↩︎ ↩︎2 ↩︎3 ↩︎4 ↩︎5
70% of Enterprise RAG Deployments Fail Before Production. Here’s What Kills Them. - Gabriel Anhaia, dev.to (2025). Industry observation that more than 70% of enterprise RAG deployments fail to reach production, with causes traced to governance issues (document ownership, access control, etc.). 【Reliability: Medium】 ↩︎ ↩︎2 ↩︎3 ↩︎4 ↩︎5 ↩︎6
93% of Developers Use AI. Why Is Productivity Only 10%? - ShiftMag (2026). Reports the “AI productivity paradox”: 92.6% of developers use AI assistants but organization-level productivity gains stay around 10%. 【Reliability: Medium】 ↩︎ ↩︎2 ↩︎3 ↩︎4
Harness engineering for coding agent users - Martin Fowler (April 2026). Frames the outer-harness concept around two purposes: raising first-attempt accuracy and running self-correction feedback loops. 【Reliability: High】 ↩︎ ↩︎2
Introduction to Harness Engineering — the AI Agent Control Paradigm Coming After CLAUDE.md - nogataka (March 2026). Frames harness work as a three-stage evolution (
CLAUDE.md→AGENTS.md/rules → harness engineering) and lays out the limits of each stage. 【Reliability: Medium】 ↩︎ ↩︎2AI Agents — A Practical Guide to Harness Engineering - cvusk (February 2026). Introduces the Explore→Plan→Implement workflow, context strategies, the LangChain Terminal Bench 2.0 case (30th → 5th place via harness optimization alone), and the Spotify Honk system. 【Reliability: Medium】 ↩︎ ↩︎2 ↩︎3 ↩︎4
Platform Engineering Trends 2026: 11 Key Shifts - LeanOps (2026). Survey reporting that 73% of platform teams have already integrated AI assistants into developer workflows. 【Reliability: Medium】 ↩︎
Gartner forecast (cited via multiple secondary sources): in 2026, more than 70% of enterprise generative-AI projects will require structured retrieval pipelines (RAG) to mitigate hallucination and compliance risk. 【Reliability: Needs verification】 ↩︎
The Next Frontier of RAG: How Enterprise Knowledge Systems Will Evolve (2026-2030) - NStarX (2026). Reframes RAG as a “knowledge runtime” — a knowledge orchestration layer integrating retrieval, verification, reasoning, access control, and audit — and explains the shift to agentic RAG as the dominant pattern. 【Reliability: Medium】 ↩︎ ↩︎2 ↩︎3 ↩︎4
Team Topologies as the ‘Infrastructure for Agency’ with AI - InfoQ / QCon London (March 2026). Introduces the bounded-agency principle, the JP Morgan friendly-FOMO case, and a knowledge-diffusion model. 【Reliability: Medium-High】 ↩︎ ↩︎2 ↩︎3
The Complete Guide to Building Skills for Claude - Anthropic (2026). Explains personal / project / organization skill levels and the organization-skill provisioning capability for Team and Enterprise plans (released December 18, 2025). 【Reliability: High】 ↩︎ ↩︎2 ↩︎3 ↩︎4
Create and distribute a plugin marketplace - Claude Code Documentation (2026). Documents
marketplace.json-based plugin distribution and the/plugin marketplace addreference flow. 【Reliability: High】 ↩︎ ↩︎2 ↩︎3AI Agent Configuration Guide 2026: SKILL.md, Rules, and Configs - Agensi (2026). Explains the cross-platform standardization of Skills (Claude Code / Codex / Cursor / Gemini CLI / Copilot, 30+ platforms). 【Reliability: Medium】 ↩︎ ↩︎2 ↩︎3
Learning to Share: Selective Memory for Efficient Parallel Agentic Systems - arXiv:2602.05965 (February 2026). Selective-memory-sharing mechanism for parallel agents. Achieves both runtime reduction and performance preservation on AssistantBench / GAIA. 【Reliability: Medium-High】 ↩︎
Enterprise AI Knowledge Management 2026: Microsoft’s Shift from Search to Governed Agent Workflows - Windows News (2026). Reports Microsoft’s strategic shift from search-based knowledge management to governed agent workflows. 【Reliability: Medium】 ↩︎ ↩︎2
What are golden paths? A guide to streamlining developer workflows - Platform Engineering (2025). Explains the purpose of golden paths, the importance of measuring developer satisfaction, and success metrics including onboarding time and deployment frequency. 【Reliability: Medium-High】 ↩︎ ↩︎2
Platform Building Antipatterns: Slow, Low, and Just for Show - Daniel Bryant, Syntasso (2025). Three representative anti-patterns in platform construction. 【Reliability: Medium】 ↩︎
Hallucinating Law: Legal Mistakes with Large Language Models are Pervasive - Matthew Dahl et al., Stanford HAI (2024). Systematic survey of hallucination rates in legal AI. Reports 58–82% hallucination rates on complex legal queries. 【Reliability: High】 ↩︎
7 Enterprise RAG Audit Failures You Should Know - Generation RAG (2026). Industry article citing observations from the RAG community, including a reported 81% citation-fabrication rate in legal use cases. 【Reliability: Medium】 ↩︎
BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models - Jiaqi Xue et al., arXiv:2406.00083 (2024). Proposes a poisoned-document injection attack against RAG databases that manipulates retrieval results. 【Reliability: High】 ↩︎
TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models - Pengzhou Cheng et al., arXiv:2405.13401 (2024). Proposes a backdoor attack via RAG. The attacker injects malicious text into the knowledge database, creating a hidden backdoor between retrieval and the retriever. 【Reliability: High】 ↩︎
Gartner Predicts 40 Percent of Enterprise Apps Will Feature Task-Specific AI Agents By 2026, Up From Less Than 5 Percent in 2025 - Gartner Press Release (August 26, 2025). Official forecast that task-specific AI agents will be embedded in 40% of enterprise applications. 【Reliability: High】 ↩︎
Agent Sprawl Is the New IT Sprawl, Here’s How to Control It - Dataiku (2026). Frames agent duplication, GPU resource waste, and security gaps as the “sprawl” problem and articulates the principle of “governing a small number of super-agents.” 【Reliability: Medium】 ↩︎
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity - METR (2025). The “AI productivity gap” study showing that experienced OSS developers felt 20% faster with AI while measured at 19% slower. 【Reliability: High】 ↩︎
Centralized vs. Federated vs. Decentralized AI Governance - Sonika Sharma, InfosecTrain (2025). Compares the three governance models — centralized, federated, decentralized. The 3-layer model in this article can be read as one implementation of the federated pattern. 【Reliability: Medium】 ↩︎ ↩︎2
Platform Engineering Empowers Developers to be Better, Faster, Happier - Gartner Experts. Gartner’s official commentary on the platform-engineering trend. 【Reliability: High】 ↩︎