Proving the Generative Cycle: From Hypothesis to Working Loop

We claimed an AI could turn natural language into valid page compositions and that an orchestration engine could drive the loop. Then we built the test harness and ran it. Here are the results.

Cover Image for Proving the Generative Cycle: From Hypothesis to Working Loop

Part of the Inner Agility thesis. This is the receipts — 25 test cases proving the generative cycle works, with explicit documentation of what it does NOT yet prove. Read the thesis →

In our Orchestrate post, we described an event-driven AI engine that closes the bidirectional loop between workflow stories and page compositions. The architecture was concrete — hook files, skills injection, chained workflows, a permission layer. But it was a vision. We had not run it.

Now we have.

This post is the receipts. We built two proofs of concept in a single session, tested them with real API calls against Claude Sonnet, and watched the generative cycle close for the first time. Here is what we found.

The Riskiest Assumption

The Orchestrate vision describes an elaborate system: file watchers, webhook endpoints, AI-powered hooks, chained workflows, quality gates, per-customer fork management. All of that is plumbing. The load-bearing claim underneath all of it is simpler:

Can an LLM, given a component catalog as context, reliably produce valid page compositions from natural language?

If the answer is no, no amount of orchestration infrastructure saves the vision. If the answer is yes, everything else is engineering. We tested the hard part first.

Defining What Did Not Exist

The Orchestrate post described the composition format as "TBD." Before we could test anything, we had to define it. The format needed to be diffable, validatable, and AI-interpretable — the three properties the vision doc identified as non-negotiable.

We settled on a flat JSON structure:

{
  "version": "1.0.0",
  "page": {
    "route": "/budgets",
    "title": "Budget Overview",
    "workflowOrigin": {
      "workflowId": "wf3-monthly-review",
      "stepId": 4,
      "useCase": "UC-BDG-1"
    }
  },
  "layout": { "type": "stack", "direction": "vertical", "gap": "lg" },
  "sections": [
    {
      "id": "budget-insights",
      "component": {
        "name": "BudgetInsightsCard",
        "package": "@pfm-platform/budgets-ui-mui"
      },
      "props": { "insights": "{{budgetInsights.insights}}" },
      "dataHooks": [
        { "hook": "useBudgetSummary", "package": "@pfm-platform/budgets-feature" }
      ]
    }
  ]
}

The design decisions were deliberate. Sections are a flat array, not a nested tree, because LLMs produce more reliable output with flat structures. Props use template variables ({{auth.userId}}) so compositions are data-independent. Every data hook is explicit — the AI must name the hook and its package, which means we can validate it programmatically. And workflowOrigin maintains the bidirectional link: a composition always knows which story produced it.

We also built a hook registry — a complete catalog of all 85 React hooks across all 11 PFM domains, with their packages, layers, and types. This was the missing piece in the component catalog. Without it, the AI could select the right component but had no way to know which data hook to wire it to.

The Condensed Catalog

The full component-catalog.json is 238KB — roughly 60,000 tokens. Sending that as context with every API call is expensive and slow. We wrote a condenser that strips descriptions, optional props, and metadata, keeping only component names, packages, required props, and hook exports. The result: 49KB, approximately 12,000 tokens. A 79% reduction.

This matters for two reasons. First, cost: at 60K tokens per request, a test suite of 75 API calls gets expensive fast. At 12K tokens, it is manageable. Second, attention: LLMs perform worse as context grows. A 12K-token catalog fits comfortably within the model's working memory. The question was whether the condensed format retained enough information for the AI to produce correct output.

The Scorecard

We built a test harness with 25 test cases across two directions:

Direction A — Use Case to Composition (15 cases): Given a natural language description, produce a page composition JSON. Cases ranged from easy ("sign up page") to hard ("tag management workspace with merge dialog, usage report, and filtered transaction list") to adversarial ("build a page using React Three Fiber").

Direction B — Composition Diff to Use Case (10 cases): Given a before/after JSON diff, reverse-engineer the user's intent into a workflow story. Cases included single-component additions, full page restructures, and adversarial inputs like empty diffs and out-of-catalog components.

Each output was validated against four automated levels:

LevelWhat It Checks
L1Parses as JSON, conforms to Zod schema
L2Every component name exists in the catalog
L3Every package matches the catalog entry
L4Required props present, no hallucinated props, hooks exist and packages are correct

We ran Direction A against Claude Sonnet with the condensed catalog. The results:

DifficultyL1 SchemaL2 ComponentsL3 PackagesL4 Props/Hooks
Easy (3)100%100%100%100%
Medium (3)100%100%100%100%
Hard (3)100%100%100%100%
Ambiguous (3)100%100%100%100%
Adversarial (3)0%*0%*0%*0%*

The adversarial cases — Gantt charts, stock tickers, React Three Fiber — returned empty sections with explanatory notes. The AI correctly identified these as out-of-domain and refused to hallucinate components. The 0% is a schema strictness issue (we require at least one section), not an AI failure. The model did the right thing.

Direction B hit 100% on JSON parse rate and 100% on intent match across all 10 cases, including the adversarial ones.

Every non-adversarial case passed all four validation levels. The condensed catalog at 12K tokens was sufficient. The AI did not hallucinate component names, did not invent packages, and correctly wired data hooks to their proper layers.

Wiring the Orchestration Engine

The fidelity test proved the AI can do the work. The next question: can the orchestration engine handle it?

The orchestration engine is a TypeScript CLI that monitors the file system for changes and triggers AI-powered hooks. Its architecture maps directly to what the Orchestrate vision described: hook files with YAML frontmatter defining glob patterns and permissions, a skills system that injects domain knowledge into the AI's context, and pre/post processing scripts for validation and chain control.

We created three hooks in the je-pfm repository:

Hook A watches generative-cycle/workflow-stories/*.story.md. When a story file changes, the AI reads the condensed catalog via the skills system, reads the composition schema, then generates a page composition and writes it to generative-cycle/compositions/.

Hook B watches generative-cycle/compositions/*.composition.json. When a composition changes, the AI reverse-engineers the intent and writes a workflow story back to generative-cycle/workflow-stories/.

Hook C (the validator) also watches compositions. It reads the catalog, validates the composition against all four levels, and writes a structured report to generative-cycle/validation-results/.

The permissions layer constrains the AI to its own lane. Hook A can write to compositions/ but not workflow-stories/. Hook B can write to workflow-stories/ but not compositions/. Neither can execute shell commands or modify the component packages. The AI can rearrange building blocks. It cannot rewrite them.

Three Bugs, Thirty Minutes

The first run failed immediately. The chain guard script used ES module import syntax, but the orchestration engine runs pre-processing scripts as CommonJS via Node. Fix: rewrite to require().

The second run showed "Thinking" in the orchestration engine's TUI but never completed. The Anthropic client expected Vertex AI model names (claude-sonnet-4@20250514), but the direct API expects dashes (claude-sonnet-4-20250514). We had added a direct API client to the orchestration engine to avoid the GCP Vertex requirement, but the model name translation was missing. Fix: add normalizeModelName() that replaces @ with -.

Still stuck. The third bug was the subtlest: executeAiHook() had a hard guard on line 758 that silently returned when gcpProject was empty — before the AI client was ever created. The hook status was set to "thinking" at line 569, before this check, making it look like the AI was working when it had actually bailed. Fix: bypass the GCP guard when ANTHROPIC_API_KEY is available.

Three bugs, thirty minutes, all found and fixed in a single session. We are documenting them because this is what real integration looks like — not the clean architecture diagram, but the ESM/CJS mismatch, the model name format nobody writes down, and the silent guard that swallows your request.

The Chain Stabilized

After the fixes, we started the orchestration engine in headless watch mode (a CLI mode we added — more on that below) and observed the full cycle:

8:59:25 PM ► [usecase-to-composition] Processing started...
8:59:28 PM · read_file(condensed-catalog.json) — 1749 lines
8:59:28 PM · read_file(page-composition.schema.json) — 156 lines
8:59:28 PM · read_file(wf3-budget-overview.story.md) — 20 lines
8:59:37 PM · write_file(wf3-budget-overview.composition.json)
8:59:44 PM ✓ [usecase-to-composition] Completed (42,856 tokens)
8:59:45 PM ○ [composition-to-usecase] Skipped by chain guard
8:59:45 PM ○ [composition-validator] Skipped by chain guard
8:59:59 PM ○ [usecase-to-composition] Skipped — max generations reached

Hook A fired, read the catalog via three read_file tool calls, generated the composition, and wrote it. The composition passed all four validation levels when we ran it through the POC #0 harness. The chain guard — a pre-processing script that hashes file content and enforces a generation counter — stopped the loop after exactly two generations.

The generation counter file confirmed it:

{ "wf3-budget-overview": 2 }

No infinite loop. No manual intervention. The chain ran story to composition to story and stopped.

The Headless CLI

The orchestration engine ships with an Ink/React terminal UI. When we tried to run it programmatically from a script, it crashed: "Raw mode is not supported on the current process.stdin." The TUI requires an interactive terminal.

We added a CLI layer. The codebase was already 95% decoupled — all 22 services, all 12 tools, and the state manager are completely headless. The TUI is just a rendering layer that subscribes to StateManager events. We extracted initHeadless() from Application.start(), created a ConsoleReporter that logs events to stdout instead of rendering React components, and added six subcommands:

  • hooksai watch — headless daemon, same as TUI but outputs to console
  • hooksai run <files> — one-shot hook execution against specific files
  • hooksai list — print hooks, skills, and webhook routes
  • hooksai server — webhook HTTP server only
  • hooksai validate — lint hook file frontmatter
  • hooksai (no args, TTY) — existing Ink TUI, unchanged

No existing behavior was modified. The TUI still works exactly as before. The CLI is an alternative interface to the same services.

What This Does Not Prove

We are being specific about the boundaries.

Latency for interactive use. The composition took about 15 seconds end-to-end (three tool calls plus generation). The Orchestrate vision describes a webhook path where the user types and a preview appears. That needs sub-second response, which may require pre-loading the catalog into the system prompt rather than reading it via tool calls. We have not tested this.

Multi-tenant fork management. The chain guard uses file-level hashing. Customer forks — where each tenant has their own composition overrides — would need fork-aware chain management. The orchestration engine's skills override convention (project overrides global) maps to this, but we have not tested it.

Cost at scale. Each cycle consumed 42,856 tokens. At three hooks per file change, that is roughly 130K tokens per cycle. For a development tool this is acceptable. For a production system processing hundreds of changes per day, the economics need work — likely retrieval-based catalog loading rather than reading the full 49KB each time.

Concurrent edits. We tested with a single file, sequentially. Two users modifying the same composition simultaneously is an open question from the vision doc that we have not addressed.

What Comes Next

The generative cycle is no longer a hand-wave. It is a tested architecture with named files, validated prompts, proven chain stabilization, and measured token costs. The composition format works. The AI respects the catalog boundary. The orchestration engine drives the loop.

The Layers of Resolution post ended with four POCs: Budgets integration, AI chat, page base abstraction, and dynamic page composition. Those remain the immediate next steps. But now they build on a foundation where the orchestration layer — the thing that was "TBD" — has been prototyped and proven.

The POCs ahead:

  • POC #1a — First PFM page in je-pfm (Budgets in the Next.js shell)
  • POC #3a — Page base abstraction with the Red Pencil FAB
  • POC #2b — AI chat generating workflow stories (using the validated prompts from this work)
  • POC #1b — Dynamic page composition from stories (consuming the format we defined here)

Every artifact from this session feeds forward. The composition schema becomes the input format for the dynamic page loader. The validated prompts become the AI chat's system prompt. The condensed catalog becomes the vocabulary for any AI integration. The chain guard pattern becomes the stabilization mechanism for Phase 3.

We published a vision. Now we have receipts.