Brownfield Experiment 1a: Report

.1. Experiment Design

.1.1. Background

The Spec-Driven Development workflow (https://llm-coding.github.io/Semantic-Anchors/spec-driven-development) has demonstrated that LLMs can generate maintainable code from specifications. The documentation artifacts produced in this workflow (PRD, Specification, arc42) appear to capture what Peter Naur described as the "theory" of a program [naur85] — the mental model that, according to Naur, cannot be fully documented. Whether or not Naur was right about human programmers, the Spec-Driven workflow shows that for LLM-generated code, this theory CAN be externalized in structured documentation.

.1.2. Research Question

The open question is Brownfield. Legacy software typically has no specification, few tests, and insufficient architecture documentation. Can an LLM extract the necessary documentation from legacy code and thus enable further development using the Spec-Driven workflow?

Answering this directly by applying an LLM to real legacy software is difficult: the quality of the generated documentation is hard to assess without a ground truth to compare against. The evaluation itself would be time-consuming and subjective.

.1.3. The Trick

This experiment uses a shortcut. We take an LLM-generated Greenfield project where we can assume that Spec, Tests, and arc42 documentation are of high quality. We transform it into a simulated legacy project by deleting this documentation, then ask an LLM to reconstruct it from the code alone. Because the original Greenfield documentation exists, we can objectively assess the quality of the generated output by comparing the two.

Since we know in advance that not everything can be extracted from code (decisions, rationale, business context), we instruct the LLM to maintain a list of Open Questions. This reveals precisely which information is genuinely missing from the code — and what a Brownfield project would need to provide before an LLM can work on it productively.

Because the experiment is reproducible (same code, same prompt, deterministic comparison), the extraction prompt can be improved step by step.

.1.4. Method

Take a Greenfield project with complete documentation (PRD, Specification, arc42, ADRs)
Create a branch and delete all documentation files and the project’s CLAUDE.md
In a fresh LLM session (no prior knowledge of the project), provide only the prompt below
Let the LLM read the code and generate the full documentation set
Compare the generated documentation against the originals

.1.5. Subject Project

Bausteinsicht — an architecture-as-code CLI tool that provides bidirectional synchronization between a JSONC architecture model and draw.io diagrams.

Language: Go (~13,000 lines of code)
Tests: 39 test files with ~400 tests (unit, integration, property-based, benchmarks)
Original documentation: 47 files, ~13,800 lines (PRD, 8 Use Cases, 5 ADRs, 12 arc42 chapters, tutorials, security review, E2E test plan)
Repository: https://github.com/docToolchain/Bausteinsicht

The original documentation was not human-written. It was LLM-generated following the Spec-Driven Development workflow from a requirements conversation.

.1.6. Branch Preparation

On branch brownfield, the following files were deleted:

src/docs/ (all subdirectories: PRD, spec, arc42, security, manual, announcements, E2E reports)
CLAUDE.md (project conventions, quality goals, package structure)

Kept intact: all source code, tests, test data, Makefile, go.mod, examples, templates, README.

.1.7. Prompt

The prompt uses Semantic Anchors (established methodology terms like "Cockburn", "arc42", "Nygard ADR", "Pugh Matrix") instead of spelling out format definitions. 69 lines total.

# Reverse-Engineer Project Documentation

You have access to a software project's codebase. The project has no
documentation. Your task is to create the full documentation set from
the source code.

Write all artifacts into `src/docs/`. All documentation in **English**,
**AsciiDoc format** (.adoc). Diagrams as **PlantUML** (embedded in
AsciiDoc). Reference workflow:
https://llm-coding.github.io/Semantic-Anchors/spec-driven-development

**Important:** Do not use `git log` or `git blame`. Work from the
current state of the code only.

## Artifacts to produce

Work through these in order. Each artifact builds on the previous one.

### 1. PRD

File: `src/docs/PRD/PRD-001.adoc`

Product Requirements Document with Vision, Problem Statement, Target
Audience, Functional Requirements (FR-IDs), Non-Functional Requirements
(NFR-IDs), Future Considerations, and Open Questions. Derive everything
from code, CLI UX, error messages, test scenarios, and go.mod
dependencies.

### 2. Specification

| Artifact | File | Format |
|----------|------|--------|
| Use Cases | `src/docs/spec/01_use_cases.adoc` | Cockburn format (UC-IDs, Business Rules as BR-IDs). Include PlantUML activity diagram per Use Case covering all flows. |
| CLI Specification | `src/docs/spec/02_cli_specification.adoc` | Derive from Cobra command definitions, flags, integration tests. |
| Data Models | `src/docs/spec/03_data_models.adoc` | Domain structs, JSON/JSONC schemas, file formats. Examples from test fixtures. |
| Acceptance Criteria | `src/docs/spec/04_acceptance_criteria.adoc` | Gherkin (Given/When/Then), referencing UC-IDs. Derive from test names and assertions. |
| Sync Specification | `src/docs/spec/05_sync_specification.adoc` | If sync logic exists: algorithm, conflict resolution, state management, edge cases. |

### 3. Architecture Documentation

**arc42** with all 12 chapters. Master file: `src/docs/arc42/arc42.adoc`

Chapter files in `src/docs/arc42/chapters/`. Visualization with
**C4 model** diagrams (Context, Container, Component levels in PlantUML).

Architecture decisions as **Nygard ADRs** in
`src/docs/arc42/ADRs/ADR-NNN-Title.adoc`. Each ADR includes a
**Pugh Matrix** (weighted, -1/0/+1 scale) evaluating at least 2-3
alternatives against quality goals.

### 4. Open Questions List

File: `src/docs/OPEN_QUESTIONS.adoc`

**This is the most important artifact.**

For every piece of information you could NOT determine from the code,
create an entry:

    === OQ-NNN: <Question>

    Category:: <Business Context | Design Rationale | Quality Goals |
               Stakeholder Context | Future Direction |
               Domain Knowledge>
    Confidence:: <Low | Medium | High>
    Your Best Guess:: <what you think the answer might be>
    Why You Can't Be Sure:: <what's missing from the code>
    What Would Help:: <what information would answer this>

Be thorough. Every assumption you made while writing PRD, Spec, and
arc42 that you couldn't verify from code alone should appear here.

## How to work

1. Explore codebase structure, read go.mod, main entry point, CLI
   commands
2. Read core domain types and interfaces
3. Read tests and test fixtures — richest source of behavioral
   specification
4. Build your mental model, then write artifacts in order
5. For every statement: "Can I prove this from code, or am I guessing?"
   If guessing, add an Open Question.

## Quality bar

- Every claim must be traceable to code. If you can't point to the
  source, it's an Open Question.
- Prefer "I don't know" over a plausible guess.
- Completeness matters: if the code does it, the documentation should
  cover it.

.1.8. Evaluation Method

The generated artifacts in src/docs/ were compared against the originals from the main branch. Comparison was performed per artifact type (PRD, Spec files, arc42 chapters, ADRs) and per information category (functional requirements, design rationale, quality goals, etc.). Assessment is qualitative (good / partial / poor) based on manual review of content, not automated metrics.

.2. Results at a Glance

.3. Results at a Glance

Metric	Original	Generated	Assessment
Total lines of docs	~13,800	3,850	28% of original
PRD: Functional Requirements	7 FRs	21 FRs	Generated more granular
PRD: Non-Functional Requirements	4 NFRs	13 NFRs	Generated significantly more comprehensive
Use Cases	8 (UC-1..8)	9 (UC-001..009)	+1 (Validate as separate UC)
Acceptance Criteria	40 Gherkin scenarios	69 numbered ACs	Generated more testable
arc42 chapters	12 (+ reviews, diagrams)	12 (text only)	Structurally equivalent
ADRs	5 (incl. rejected)	6 (all Accepted)	Different topic selection
Open Questions	—	33 questions	New artifact
PlantUML diagrams	8	11	Generated has more
Glossary	2 entries (placeholder)	31 entries	Generated complete

Metric

Original

Generated

Assessment

Total lines of docs

~13,800

3,850

28% of original

PRD: Functional Requirements

7 FRs

21 FRs

Generated more granular

PRD: Non-Functional Requirements

4 NFRs

13 NFRs

Generated significantly more comprehensive

Use Cases

8 (UC-1..8)

9 (UC-001..009)

+1 (Validate as separate UC)

Acceptance Criteria

40 Gherkin scenarios

69 numbered ACs

Generated more testable

arc42 chapters

12 (+ reviews, diagrams)

12 (text only)

Structurally equivalent

ADRs

5 (incl. rejected)

6 (all Accepted)

Different topic selection

Open Questions

—

33 questions

New artifact

PlantUML diagrams

Generated has more

Glossary

2 entries (placeholder)

31 entries

Generated complete

.4. What the LLM did well

.4.1. Technical accuracy

Every claim is traceable to code. The LLM cites function names (sanitizeID, applyRelSwap, StripJSONC), test functions (TestInitCreatesFiles), constants (MaxElementDepth = 50, MaxModelFileSize = 10 MiB), and security codes (SEC-001, SEC-016). The original references no code.

.4.2. Finer granularity

The original has 7 coarse Functional Requirements. The LLM produced 21 FRs in logical groups (Model, CLI, Sync, Views, Validation, Errors). Acceptance Criteria are 69 instead of 40 and reference test function names, making them directly verifiable against code.

.4.3. Security documentation

The original mentions security only in passing. The LLM extracted 6 SEC codes with enforcement details, documented path traversal validation, and formalized security as an NFR. A positive surprise.

.4.4. Formalized sync specification

The original describes the sync algorithm narratively. The LLM formalized the three-way diff as a truth table (M==S / D==S combinations), systematically listed 12 edge cases, and extracted layout constants (gap=60, min scope=400x300) from code.

.4.5. Complete glossary

The original had only a placeholder (2 example terms). The LLM correctly defined 31 domain terms.

.5. What the LLM could not reconstruct

.5.1. Business context and vision

The original starts with a clear Problem Statement: "Structurizr and LikeC4 have limitation X, Y, Z." The LLM doesn’t know the competitors and cannot derive the strategic positioning. The vision remains generic.

Insight: Code says WHAT was built, not WHY and not AGAINST WHOM.

.5.2. Design rationale

The LLM wrote 6 ADRs, but with different topics than the original:

Original ADR	Generated ADR
ADR-001: DSL Format (JSONC vs TypeScript vs Custom)	ADR-001: JSONC as DSL (correct, but fewer alternatives)
ADR-002: Implementation Language (Go vs Python vs Kotlin)	ADR-002: Cobra CLI Framework (different topic!)
ADR-003: Risk Classification (Vibe-Coding Risk Radar)	— missing entirely —
ADR-004: Sequence Diagram Export (rejected)	ADR-004: Conflict Policy (different topic!)
ADR-005: Auto-Layout Engine	ADR-005: etree XML Library (different topic!)
—	ADR-003: Three-Way Diff (new topic)
—	ADR-006: Embedded Templates (new topic)

Original ADR

Generated ADR

ADR-001: DSL Format (JSONC vs TypeScript vs Custom)

ADR-001: JSONC as DSL (correct, but fewer alternatives)

ADR-002: Implementation Language (Go vs Python vs Kotlin)

ADR-002: Cobra CLI Framework (different topic!)

ADR-003: Risk Classification (Vibe-Coding Risk Radar)

— missing entirely —

ADR-004: Sequence Diagram Export (rejected)

ADR-004: Conflict Policy (different topic!)

ADR-005: Auto-Layout Engine

ADR-005: etree XML Library (different topic!)

—

ADR-003: Three-Way Diff (new topic)

—

ADR-006: Embedded Templates (new topic)

The LLM can see THAT Go was chosen, but not WHY Python and Kotlin were rejected. The Pugh Matrices in the generated document evaluate plausible but partly different alternatives than those actually evaluated. This aligns with [garcia24]: when given ADR context, LLMs can generate reasonable decisions, but reconstructing context from code alone is a harder, unsolved problem.

Insight: Code is the result of decisions, not the decision itself. ADR context is fundamentally not derivable from code.

.5.3. Quality goals and their prioritization

The original has three prioritized Quality Goals: Learnability (30-min onboarding), IDE Support (JSON Schema), LLM Friendliness. The LLM identified 6 Quality Goals but the prioritization is missing.

Insight: Tests show what IS tested, not what SHOULD BE tested.

.5.4. Stakeholder context

The original defines three stakeholders (Architect, Developer, LLM Agent) with their concerns. The LLM derives stakeholders from CLI UX, but cannot reconstruct skill levels, expectations, and concerns.

.5.5. Aspirational features

UC-7 "Drill-Down Navigation" (zoom-based navigation on a single draw.io page) is described in the original but not fully implemented in code. The LLM did not mention it — it can only document what exists, not what was planned.

Insight: Aspirational features (planned but not implemented) vanish completely during reverse engineering.

.5.6. Narrative documents

Four files in the original have no counterpart:

Missing document Lines Why not derivable

Missing document	Lines	Why not derivable
`06_tutorial.adoc`	266	Requires didactic preparation
`07_template_guide.adoc`	322	UX/design knowledge, not in code
`07_trust_model.adoc`	55	Strategic decision
`E2E-Test-Plan.adoc`	409	Test design, not test code

06_tutorial.adoc

266

Requires didactic preparation

07_template_guide.adoc

322

UX/design knowledge, not in code

07_trust_model.adoc

Strategic decision

E2E-Test-Plan.adoc

409

Test design, not test code

.5.7. Performance metrics

The original documents: Startup <10ms, Sync <100ms, Binary 10-15MB. The LLM found benchmarks but no thresholds — because thresholds are decisions, not code facts.

.5.8. Architecture reviews

The original contains ATAM reviews (808 lines), LASR reviews, and review updates. These are historical artifacts not derivable from code.

.6. arc42: Chapter-by-Chapter Assessment

Ch.	Title	Derivable?	Rating	Detail
1	Introduction and Goals	partial	⚠️	Quality Goals found (6 instead of 3), but prioritization missing. Stakeholders derived from CLI UX, but concerns and skill levels missing. Competitor comparison completely gone.
2	Architecture Constraints	good	✅	Generated even better: 15 constraints instead of 5, with Go version, CGO_ENABLED=0, 6 platform targets. More specific and operationally useful.
3	Context and Scope	partial	⚠️	C4 Context diagram correctly generated. But original has detailed communication partner matrix with 6 interfaces — generated only 7 OS-level channels. Abstraction level is wrong. This confirms the "granularity mismatch" finding from [cabrera26].
4	Solution Strategy	partial	⚠️	5 strategic decisions correctly identified. But design patterns missing. Original explains HOW strategy addresses quality goals — generated stays at WHAT.
5	Building Block View	good	✅	More detailed than original: 8 components with responsibility statements instead of 4 coarse blocks. Level 2 decomposition correct (model: 5, sync: 9, drawio: 5). Consistent with ArchAgent’s F1=0.966 for structural recovery [archagent26].
6	Runtime View	partial	⚠️	5 scenarios with sequence diagrams (original: 4). Bonus: comment preservation and conflict resolution. But: LLM-Driven Modification scenario completely missing — aspirational, not in code.
7	Deployment View	poor	❌	Performance metrics completely missing. No installation instructions. No embedded resources concept. Only generic "static binary, goreleaser" description.
8	Crosscutting Concepts	mixed	⚠️	Security better (6 SEC codes). Test discipline more detailed. But: error handling, logging, version management, and configuration discovery completely missing. The ECSA 2025 study confirms that LLMs "struggle with complex abstractions such as class relationships and fine-grained design patterns" [ecsa25].
9	Architecture Decisions	different	⚠️	6 ADRs instead of 5, all with Pugh Matrix. But different topics. Code shows WHAT was decided, not WHY.
10	Quality Requirements	different	⚠️	12 requirements instead of 6, evidence-based. But original has scenarios in stimulus/response format (ISO 25010) — generated has only a table.
11	Risks and Technical Debt	good	✅	8 risks instead of 4, 6 technical debts. But "Non-Risks" section missing. ATAM review reference missing.
12	Glossary	very good	✅✅	31 terms fully defined vs. 2 placeholders in original. Clear winner.

Ch.

Title

Derivable?

Rating

Detail

Introduction and Goals

partial

⚠️

Quality Goals found (6 instead of 3), but prioritization missing. Stakeholders derived from CLI UX, but concerns and skill levels missing. Competitor comparison completely gone.

Architecture Constraints

good

✅

Generated even better: 15 constraints instead of 5, with Go version, CGO_ENABLED=0, 6 platform targets. More specific and operationally useful.

Context and Scope

partial

⚠️

C4 Context diagram correctly generated. But original has detailed communication partner matrix with 6 interfaces — generated only 7 OS-level channels. Abstraction level is wrong. This confirms the "granularity mismatch" finding from [cabrera26].

Solution Strategy

partial

⚠️

5 strategic decisions correctly identified. But design patterns missing. Original explains HOW strategy addresses quality goals — generated stays at WHAT.

Building Block View

good

✅

More detailed than original: 8 components with responsibility statements instead of 4 coarse blocks. Level 2 decomposition correct (model: 5, sync: 9, drawio: 5). Consistent with ArchAgent’s F1=0.966 for structural recovery [archagent26].

Runtime View

partial

⚠️

5 scenarios with sequence diagrams (original: 4). Bonus: comment preservation and conflict resolution. But: LLM-Driven Modification scenario completely missing — aspirational, not in code.

Deployment View

poor

❌

Performance metrics completely missing. No installation instructions. No embedded resources concept. Only generic "static binary, goreleaser" description.

Crosscutting Concepts

mixed

⚠️

Security better (6 SEC codes). Test discipline more detailed. But: error handling, logging, version management, and configuration discovery completely missing. The ECSA 2025 study confirms that LLMs "struggle with complex abstractions such as class relationships and fine-grained design patterns" [ecsa25].

Architecture Decisions

different

⚠️

6 ADRs instead of 5, all with Pugh Matrix. But different topics. Code shows WHAT was decided, not WHY.

Quality Requirements

different

⚠️

12 requirements instead of 6, evidence-based. But original has scenarios in stimulus/response format (ISO 25010) — generated has only a table.

Risks and Technical Debt

good

✅

8 risks instead of 4, 6 technical debts. But "Non-Risks" section missing. ATAM review reference missing.

Glossary

very good

✅✅

31 terms fully defined vs. 2 placeholders in original. Clear winner.

.6.1. Summary

Well derivable from code (4 chapters):

Ch. 2 (Constraints) — technical facts directly from go.mod, Makefile, CI
Ch. 5 (Building Block View) — package structure IS the architecture
Ch. 11 (Risks) — error handling and edge-case tests reveal risks
Ch. 12 (Glossary) — domain terms from struct names and package names

Partially derivable (6 chapters):

Ch. 1 (Goals) — quality goals yes, prioritization and stakeholder concerns no
Ch. 3 (Context) — system boundary yes, communication partner details no
Ch. 4 (Strategy) — decisions yes, strategy-to-quality-goal mapping no
Ch. 6 (Runtime) — implemented scenarios yes, aspirational ones no
Ch. 8 (Concepts) — some yes (security, testing), others no (error handling, logging)
Ch. 10 (Quality) — requirements yes, scenario format no

Poorly derivable (2 chapters):

Ch. 7 (Deployment) — performance budgets and installation details are decisions
Ch. 9 (Decisions/ADRs) — code shows results, not the decision process

.7. Open Questions: Quality as a Brownfield Checklist

The LLM generated 33 open questions in 8 categories.

Assessment	Count	Percent
Valid (genuinely not derivable from code)	31	79%
Partially valid (inferable from code but not tested)	6	15%
Should be closed (already answered)	2	5%

Strengths:

Missing documentation correctly identified (schema file, user manual, tutorial, trust model)
Design rationale systematically recognized as a gap
"What Would Help" provides concrete action items

Gaps — what should have been asked:

No question about open-source sustainability (who maintains this?)
No question about the competitive landscape (which tools does this compete with?)
No question about test coverage strategy (what is "good enough"?)
No question about CI/CD platform support
No question about the Node.js exclusion (stated in CLAUDE.md with rationale)

.8. What Brownfield Projects Need for the Dark Factory

.8.1. Derivable from code (LLM can do this itself)

Functional requirements (WHAT the system does)
Data models and interfaces
CLI specification (commands, flags, exit codes)
Acceptance criteria (from tests)
Crosscutting concepts (error handling, security, atomicity)
Glossary (domain terms)
Building block view (package structure, dependencies)

.8.2. NOT derivable from code (must be documented)

Business context: Why does this project exist? Against whom? For whom?
Design rationale: Why was alternative A chosen over B? (ADR context)
Quality goal prioritization: What is most important and why?
Stakeholder concerns: Who uses it, what is their skill level, what do they expect?
Aspirational features: What is planned but not yet implemented?
Performance budgets: What thresholds apply?
Tutorials and guides: Didactic preparation requires humans
Review results: Historical assessments and their consequences

.8.3. The Brownfield Preparation Checklist

Before a legacy project can enter the Dark Factory, it needs at minimum:

A Problem Statement with competitive context (1 page)
ADR context sections for the top 5 decisions (1 paragraph "why" each)
Prioritized Quality Goals (top 3 with rationale)
Stakeholder profiles (who uses it, what they can do, what they expect)
A "Not Implemented Yet" list (planned features)
Performance budgets (measurable thresholds)

Everything else the LLM can reconstruct on its own — and in some areas does it better than the original (more FRs, more ACs, better security documentation, complete glossary).

.9. Where the Generated Documentation is Genuinely Better

The generated docs are not just longer — in five areas they are substantively better. This reveals spec drift: things that were built but never documented.

.9.1. Security: integrated instead of separated

The original relegates security to a separate review document. The generated version integrates security directly into the specification with traceable SEC-IDs (SEC-001 through SEC-018) woven through PRD, CLI spec, and acceptance criteria. A maintainer reading NFR-005 (Security — path containment) can immediately find the test (TestRootCmd_RejectsModelPathTraversal) and the enforcement code (root.go:validatePathContainment).

The original PRD has zero security NFRs. The code has six security mechanisms. That gap is spec drift.

.9.2. Acceptance criteria: test-traceable instead of prose

The original uses Gherkin scenarios with no connection to actual tests. The generated version cites test function names inline: AC-001-01: … // test: TestInitCreatesFiles. An architect can verify each criterion against the test suite. The original is write-only documentation — readable by humans, unverifiable by machines.

.9.3. Sync algorithm: formalized instead of narrative

The original describes the three-way diff in prose across multiple sections. The generated version formalizes it as a truth table (M==S / D==S combinations), lists 12 edge cases in a structured table, and names the exact functions. The truth table is verifiable; the prose is interpretable.

.9.4. Building block view: actionable instead of descriptive

The original uses passive voice and generic descriptions. The generated version uses active, verb-first responsibility statements with explicit contracts: patch.go | Byte-range patch operations on raw JSONC. PatchSave, PatchInsert. Preserves comments and indentation.

.9.5. NFRs: actual requirements instead of aspirational

The original has 4 NFRs written before implementation. The generated version found 12 — including security constraints, robustness bounds (10 MiB file limit, depth 50), benchmark mandates, and quality gates (gosec, nilaway, govulncheck) that were implemented but never added to the PRD. These are real requirements that govern the project’s CI pipeline.

.9.6. Spec drift is a structural property

All five areas share the same root cause: the spec was generated from a requirements conversation before implementation, and the code evolved beyond it. The original documentation was not human-written — it was LLM-generated following the Spec-Driven Development workflow. Yet spec drift happened anyway: during implementation, the LLM added security hardening, validation rules, edge cases, and performance tooling that were never part of the original requirements conversation.

This means spec drift is not a discipline problem. It is a structural property of the workflow: the implementation LLM discovers requirements that the specification LLM could not anticipate. Security constraints emerge from code review. Edge cases emerge from testing. Performance bounds emerge from benchmarks. None of these feed back into the spec automatically. The SDD paper [sdd26] defines this as "any divergence between declared system intent and observed system behavior" and identifies it as a core challenge for AI-assisted development.

.10. Implications for the Dark Factory Workflow

.10.1. Specs need periodic reconciliation

The Dark Factory workflow is spec-first: write PRD, write spec, generate code. But the experiment shows that even in a Greenfield project with rigorous documentation, the spec drifts from the code within weeks.

The fix is a spec reconciliation step: periodically run the Brownfield reverse-engineering prompt against the current code and diff the output against the existing spec. The diff reveals:

New requirements implemented but not documented (security NFRs, validation rules)
Changed behavior that diverged from the original spec
Dead spec — requirements still documented but no longer in the code

.10.2. When to reconcile

Three natural trigger points:

Before a release — ensure the spec matches what ships
After a security review — security hardening often adds undocumented constraints
Before onboarding — new team members (human or LLM) need accurate specs

The reconciliation is cheap: one LLM run, one diff. The cost of NOT doing it is higher: LLM agents working from stale specs produce code that contradicts the actual codebase.

.10.3. The workflow becomes a loop

The original Dark Factory workflow is linear: Spec → Code → Ship. With reconciliation it becomes a loop:

Spec (human: WHY) -> Code (LLM) -> Reconcile (LLM: WHAT changed?) -> Update Spec -> ...

The human writes the WHY once and maintains it. The LLM keeps the WHAT synchronized with the code. This division of labor matches what the experiment showed: humans are better at rationale, LLMs are better at completeness.

.11. Implications for Semantic Anchors

.11.1. Semantic Anchors work as prompt compression

The anchored prompt (69 lines) produced 3,850 lines of documentation with correct Cockburn format, all 12 arc42 chapters, and Pugh Matrices in the ADRs. The terms "Cockburn", "arc42", "Nygard ADR", "Pugh Matrix", "Gherkin", "C4 model" triggered the full knowledge from training data without the prompt spelling out what those formats contain.

This is empirical evidence that Semantic Anchors work. A single well-chosen term activates a complete methodology in the LLM’s weights. No definition needed, no examples needed. The anchor IS the definition. A systematic literature review [slr25] covering 18 papers on software architecture and LLMs found no prior study examining this compression effect.

.11.2. Semantic Anchors define where human effort belongs

The experiment divides documentation into two categories:

Category	Example	Human needed?
What the system does	FRs, data models, CLI spec, acceptance criteria	No — LLM derives from code
Why it was built this way	Business context, ADR rationale, quality goal priorities	Yes — not in code

.11.3. Connection to Eichhorst’s Principle

In Shannon’s noisy channel model, the documentation that an LLM cannot derive from code is exactly the channel capacity that must be transmitted. Business context and design rationale are the signal. Code is not a channel for this signal — code is the output, not the decision.

The Brownfield Preparation Checklist (6 items above) defines the minimum information that must travel through the documentation channel before an LLM can work productively on a legacy codebase. Everything below this threshold means the LLM operates with insufficient channel capacity — it will guess at rationale, invent stakeholder concerns, and miss aspirational features. The error rate climbs exactly as Eichhorst’s Principle predicts.

.12. Prompt Improvements After Experiment 1a

Three weaknesses in the prompt identified and fixed (in both prompt variants):

Problem Root cause Prompt change

Problem	Root cause	Prompt change
UC-7 Drill-Down completely overlooked	LLM documents only what is implemented. Aspirational features (traces: TODOs, unused interfaces, partial implementations) are lost.	PRD section: "Look for TODOs, commented code, unused interfaces, and partially implemented features. Document them as 'Planned but not implemented'."
ADR context guessed instead of flagged	LLM writes plausible "why" for decisions even though it cannot derive this from code. Wrong rationale is worse than "unknown".	ADR section: "Look for clues in code comments and naming patterns. If concrete evidence exists, use it. If not, flag as Open Question."
Performance budgets ignored	LLM found benchmarks but derived no thresholds. Thresholds are decisions, not code facts.	Deployment chapter: "Derive performance thresholds from benchmarks if possible. If no pass/fail thresholds, flag as Open Question."
Open Questions not assignable	The generated Open Questions list has no indication of who in the organization can answer each question. Without role assignment, the list sits as a monolith rather than actionable work items.	Added `Ask::` field to OQ template with roles: Product Owner, Architect, Developer, Domain Expert, Operations.

UC-7 Drill-Down completely overlooked

LLM documents only what is implemented. Aspirational features (traces: TODOs, unused interfaces, partial implementations) are lost.

PRD section: "Look for TODOs, commented code, unused interfaces, and partially implemented features. Document them as 'Planned but not implemented'."

ADR context guessed instead of flagged

LLM writes plausible "why" for decisions even though it cannot derive this from code. Wrong rationale is worse than "unknown".

ADR section: "Look for clues in code comments and naming patterns. If concrete evidence exists, use it. If not, flag as Open Question."

Performance budgets ignored

LLM found benchmarks but derived no thresholds. Thresholds are decisions, not code facts.

Deployment chapter: "Derive performance thresholds from benchmarks if possible. If no pass/fail thresholds, flag as Open Question."

Open Questions not assignable

The generated Open Questions list has no indication of who in the organization can answer each question. Without role assignment, the list sits as a monolith rather than actionable work items.

Added Ask:: field to OQ template with roles: Product Owner, Architect, Developer, Domain Expert, Operations.

.13. Improved Prompt (v2)

Based on the three weaknesses identified above, the prompt was revised. Changes are marked with // NEW comments. This is the recommended version for future experiments.

# Reverse-Engineer Project Documentation

You have access to a software project's codebase. The project has no
documentation. Your task is to create the full documentation set from
the source code.

Write all artifacts into `src/docs/`. All documentation in **English**,
**AsciiDoc format** (.adoc). Diagrams as **PlantUML** (embedded in
AsciiDoc). Reference workflow:
https://llm-coding.github.io/Semantic-Anchors/spec-driven-development

**Important:** Do not use `git log` or `git blame`. Work from the
current state of the code only.

## Artifacts to produce

Work through these in order. Each artifact builds on the previous one.

### 1. PRD

File: `src/docs/PRD/PRD-001.adoc`

Product Requirements Document with Vision, Problem Statement, Target
Audience, Functional Requirements (FR-IDs), Non-Functional Requirements
(NFR-IDs), Future Considerations, and Open Questions. Derive everything
from code, CLI UX, error messages, test scenarios, and go.mod
dependencies.

// NEW: aspirational features
Look for TODOs, commented code, unused interfaces, and partially
implemented features. Document them as "Planned but not implemented" in
Future Considerations. These are easy to miss but critical — they
represent intent that only exists as traces in the code.

### 2. Specification

| Artifact | File | Format |
|----------|------|--------|
| Use Cases | `src/docs/spec/01_use_cases.adoc` | Cockburn format (UC-IDs, Business Rules as BR-IDs). Include PlantUML activity diagram per Use Case covering all flows. |
| CLI Specification | `src/docs/spec/02_cli_specification.adoc` | Derive from Cobra command definitions, flags, integration tests. |
| Data Models | `src/docs/spec/03_data_models.adoc` | Domain structs, JSON/JSONC schemas, file formats. Examples from test fixtures. |
| Acceptance Criteria | `src/docs/spec/04_acceptance_criteria.adoc` | Gherkin (Given/When/Then), referencing UC-IDs. Derive from test names and assertions. |
| Sync Specification | `src/docs/spec/05_sync_specification.adoc` | If sync logic exists: algorithm, conflict resolution, state management, edge cases. |

### 3. Architecture Documentation

**arc42** with all 12 chapters. Master file: `src/docs/arc42/arc42.adoc`

Chapter files in `src/docs/arc42/chapters/`. Visualization with
**C4 model** diagrams (Context, Container, Component levels in PlantUML).

Architecture decisions as **Nygard ADRs** in
`src/docs/arc42/ADRs/ADR-NNN-Title.adoc`. Each ADR includes a
**Pugh Matrix** (weighted, -1/0/+1 scale) evaluating at least 2-3
alternatives against quality goals.

// NEW: ADR rationale guidance
For ADRs: you can usually determine WHAT was decided from the code,
but rarely WHY alternatives were rejected. Look for clues in code
comments, naming patterns (e.g. `ModelWinsResolver` implies other
resolvers were considered), and interface designs that hint at
alternatives. If you find concrete evidence for the rationale, use it.
If not, flag the reasoning as Open Question rather than guessing a
plausible rationale. A wrong "why" is worse than an honest "unknown."

// NEW: performance budgets
For Chapter 7 (Deployment View): derive performance thresholds from
benchmark tests if possible. If benchmarks exist but define no
pass/fail thresholds, flag the missing budgets as Open Questions.

### 4. Open Questions List

File: `src/docs/OPEN_QUESTIONS.adoc`

**This is the most important artifact.**

For every piece of information you could NOT determine from the code,
create an entry:

    === OQ-NNN: <Question>

    Category:: <Business Context | Design Rationale | Quality Goals |
               Stakeholder Context | Future Direction |
               Domain Knowledge>
    // NEW: role assignment
    Ask:: <Product Owner | Architect | Developer | Domain Expert |
           Operations>
    Confidence:: <Low | Medium | High>
    Your Best Guess:: <what you think the answer might be>
    Why You Can't Be Sure:: <what's missing from the code>
    What Would Help:: <what information would answer this>

Be thorough. Every assumption you made while writing PRD, Spec, and
arc42 that you couldn't verify from code alone should appear here.

## How to work

1. Explore codebase structure, read go.mod, main entry point, CLI
   commands
2. Read core domain types and interfaces
3. Read tests and test fixtures — richest source of behavioral
   specification
4. Build your mental model, then write artifacts in order
5. For every statement: "Can I prove this from code, or am I guessing?"
   If guessing, add an Open Question.

## Quality bar

- Every claim must be traceable to code. If you can't point to the
  source, it's an Open Question.
- Prefer "I don't know" over a plausible guess.
- Completeness matters: if the code does it, the documentation should
  cover it.

.14. Threats to Validity and Future Work

.14.1. No static analysis

ArchAgent [archagent26] achieves F1=0.966 by combining static analysis (dependency graphs, call graphs) with LLM synthesis. Our experiment uses a pure LLM approach: the model reads source files sequentially with no pre-computed structural information. A preprocessing step exporting dependency graphs, call graphs, or AST summaries could improve the Building Block View and Runtime View, where the LLM currently misses cross-package relationships.

.14.2. Zero-shot prompting (nuanced)

The largest architecture view study [cabrera26] shows that few-shot prompting reduces clarity failures by 9.2%. The user stories paper [userstories25] demonstrates that "a single example lets an 8B model match 70B performance." Our prompt provides no examples.

However, the impact varies by artifact type. Strong Semantic Anchors like "arc42", "Cockburn Use Cases", or "Nygard ADR" carry their definition in the LLM’s training data — books, conference talks, and thousands of documented examples. A few-shot example for arc42 would be redundant and might even constrain the output by biasing towards the example rather than the anchor’s full semantics. The experiment confirms this: all 12 arc42 chapters were generated in the correct structure without examples.

Where few-shot examples would likely help is for non-standard formats that have no anchor in the training data: our Open Questions list (OQ-NNN with Category, Confidence, Best Guess fields) and the Reconciliation Report (NEW/CHANGED/DEAD categories) are custom formats. A single example entry would reduce ambiguity about the expected output structure.

.14.3. Git history fully blocked

We block git log and git blame because commit messages reference specification IDs from the original documentation. However, commit messages also contain design rationale ("chose X because Y", "rejected approach Z due to performance"). The SDD paper [sdd26] identifies commit history as a valuable signal channel. A more nuanced approach would allow git history but filter out spec-ID references, preserving the rationale signal while blocking the spec-structure signal.

.14.4. Single-shot, no self-reflection

The ECSA 2025 study [ecsa25] uses a Self-Reflection mechanism where the LLM reviews its own output. AgenticAKM [agenticakm26] shows that agentic approaches (iterative refinement with tool use) significantly improve ADR quality over simple LLM calls. Our prompt is a single-shot task with no feedback loop. An agentic workflow where the LLM generates, reviews, and refines its documentation could improve quality, particularly for ADRs and Quality Requirements.

.14.5. Single LLM, single run

The referenced papers test multiple LLMs (GPT-4, GPT-3.5, Claude, Gemini, Flan-T5) and find significant quality differences between models. Our experiment uses one model (Claude) in one session. This means our results are specific to Claude’s capabilities and may not generalize. A multi-model comparison (same prompt, same codebase, different LLMs) would strengthen the findings. Additionally, a single run provides no statistical significance — repeating the experiment would reveal variance in output quality.

.14.6. Qualitative evaluation only

The papers use formal metrics: F1 scores, precision, recall, BLEU scores. Our evaluation is manual and qualitative ("good / partial / poor"). For a publication, we would need a formal evaluation framework — for example, counting requirement coverage (what percentage of original FRs appear in the generated output) and measuring factual accuracy (what percentage of generated claims are correct).

.14.7. Only one project

Bausteinsicht is a well-structured Go CLI tool with clear package boundaries, comprehensive tests, and a single-binary architecture. Results may differ for projects with less clean architecture, fewer tests, dynamic languages, or distributed systems. The Brownfield Preparation Checklist should be validated against projects of different sizes, languages, and architectural styles.

References

[naur85] Peter Naur. "Programming as Theory Building." Microprocessing and Microprogramming, 15(5):253-261, 1985. Argues that programming is not primarily about producing code but about building a "theory" — a mental model of how the problem domain maps to the solution. This theory, Naur claims, cannot be fully captured in documentation and dies when the original developers leave. Our experiment tests this claim in the context of LLM-generated code.
[cabrera26] Cabrera et al. "LLM-based Automated Architecture View Generation: Where Are We Now?" arXiv:2603.21178, March 2026. Largest study (340 repos, 4,137 generated views). Key finding: LLMs "consistently exhibit granularity mismatches, operating at the code level rather than architectural abstractions." 22.6% clarity failure rate, 50% level-of-detail success rate. https://arxiv.org/abs/2603.21178
[garcia24] Dhar, Vaidhyanathan, Varma. "Can LLMs Generate Architectural Design Decisions? — An Exploratory Empirical study." arXiv:2403.01709, ICSA 2024. Evaluates GPT-4 and GPT-3.5 generating ADR Decision sections given Context. Finds LLMs can generate reasonable decisions but "further research is required to attain human-level generation." Key difference to our work: they provide Context, we reconstruct both from code. https://arxiv.org/abs/2403.01709
[ecsa25] "Automated Software Architecture Design Recovery from Source Code Using LLMs." ECSA 2025, Springer. Evaluates 4 LLMs on class diagrams, design patterns, architectural styles. Finds LLMs "struggle with complex abstractions such as class relationships and fine-grained design patterns." (URL not verified)
[archagent26] "ArchAgent: Scalable Legacy Software Architecture Recovery with LLMs." arXiv:2601.13007, January 2026. Agent-based framework combining static analysis with LLM synthesis. Achieves F1=0.966 for structural recovery, outperforming DeepWiki (F1=0.860). Validates that building block views are well-derivable from code. https://arxiv.org/abs/2601.13007
[slr25] "Software Architecture Meets LLMs: A Systematic Literature Review." arXiv:2505.16697, May 2025. Analyzed 18 papers. Identifies "generating source code from architectural design, cloud-native computing, and checking conformance" as underexplored areas. Full arc42 reverse-engineering is not covered by any of the 18 papers. https://arxiv.org/abs/2505.16697
[sdd26] Piskala. "Spec-Driven Development: From Code to Contract in the Age of AI Coding Assistants." arXiv:2602.00180, February 2026. Defines spec drift as "any divergence between declared system intent and observed system behavior." Proposes spec-first workflows for AI coding assistants. https://arxiv.org/abs/2602.00180
[hatahet25] Hatahet et al. "Generating Software Architecture Description from Source Code using Reverse Engineering and Large Language Model." arXiv:2511.05165, November 2025. Semi-automated approach for component and state machine diagrams from C++ code. https://arxiv.org/abs/2511.05165
[userstories25] Ouf, Li, Zhang, Guizani. "Reverse Engineering User Stories from Code using Large Language Models." arXiv:2509.19587, September 2025. Achieves F1=0.8 for user story recovery from C++ snippets up to 200 NLOC. Function-level granularity only. https://arxiv.org/abs/2509.19587
[draft25] "DRAFT-ing Architectural Design Decisions using LLMs." arXiv:2504.08207, April 2025. Two-phase approach: offline fine-tuning + online RAG for ADR generation. https://arxiv.org/abs/2504.08207
[contextmatters26] "Context Matters: Evaluating Context Strategies for Automated ADR Generation Using LLMs." arXiv:2604.03826, April 2026. Finds small recency windows (Last-K, 3-5 records) yield near-optimal ADR generation quality. https://arxiv.org/abs/2604.03826
[agenticakm26] "AgenticAKM: Enroute to Agentic Architecture Knowledge Management." arXiv:2602.04445, February 2026. Agentic approach significantly improves ADR quality over simple LLM calls. https://arxiv.org/abs/2602.04445
[fuchss25] Fuchss et al. "Enabling Architecture Traceability by LLM-based Architecture Component Name Extraction." ICSA 2025. F1=0.86 with GPT-4o for linking architecture docs to code. Part of the ARDoCo project at KIT. Complementary to our work: traces existing docs to code, rather than generating docs from code. (URL not verified)

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.