OptionalcleanStrict completion indicator for the realize phase's two-tier success model.
true only when compiled.type === "success" AND
operationFailures.length === 0 — i.e. the TypeScript compiler reported
no errors and every endpoint produced a provider. A partial realize run
(some functions generated, one endpoint failed) sets success: true but
cleanCompletion: false.
Absent for phases other than realize (where the success model is
all-or-nothing) and for legacy summaries produced before this field was
introduced.
Count of generated elements by category for this phase.
Provides the number of key artifacts generated during the phase execution. Each phase produces different types of elements:
Keys represent element type names (e.g., "actors", "operations") and values indicate the count of each element type generated, offering insights into the scope and complexity of the generated application.
OptionalcompileOptional compile outcome for phases whose success is decided by an
IAutoBeTypeScriptCompileResult (database, interface, test,
realize).
reason answers "why did the agent stop" — but a phase can also fail
because the agent finished and its output did not compile, or because an
exception was thrown inside the compiler itself. In those cases the agent
never hit a reason-eligible event, so success: false showed up with
reason: null and the external reader of the archive ledger had no way
to tell whether they were looking at a compile failure (type: "failure"
with diagnostics), a compiler exception (type: "exception"), or a
still-running phase that just hadn't been recorded yet.
Absent when the phase succeeded or when the underlying history is not a
compile-result-bearing phase (analyze, multiLingual).
OptionalcostPhase-level cost signals derived from the token aggregates.
Surfaces "where did this phase bleed time / money" at sidecar level so a
reviewer does not have to walk aggregates.total.tokenUsage and the
per-sub-loop attempt / validationFailure breakdown manually. The three
fields cover the dominant cost levers observed on real archives:
cacheHitRate — aggregates.total.tokenUsage.input.cached / input.total. 0 when the vendor / SDK did not return any cache hits.
The 2026-05-14 qwen/qwen3.6-flash todo run hit 0% across all phases
(23M input tokens in interface alone, none cached) — that's the
single-biggest perf finding the previous ledger could only surface by
decompressing every aggregates payload.failureRate — aggregates.total.metric.validationFailure / aggregates.total.metric.attempt. Tells a reviewer whether the phase's
retry loop is converging or thrashing. realize on the same run was 99
/ 291 = 34%, but a per-sub-loop view (deferred to a separate field)
showed realizeCorrect hitting 54% — the correction loop itself was
misfiring half the time.inputTokensTotal / inputTokensCached — kept as raw counters so a
reader can compute their own ratios without re-fetching the aggregates
payload.Absent when the phase did not run, when the aggregates payload is missing (legacy archive predating the field), or when no attempts were recorded (rate-zero denominator).
Time taken to complete this phase in milliseconds.
Measures the duration from phase initiation to completion, including all AI interactions, validations, and corrections. This metric helps identify performance characteristics of individual phases.
OptionalerrorCompact excerpt of the error that classified this phase's reason.
Present only when the archiver caught a failure (the path that sets
success: false + reason). Lets an external reader of
summary.json.gz tell what actually went wrong without having to find
and decompress the sibling {phase}.error.json.gz artifact — critical
for reason: "unknown" (the classifier couldn't recognize the error) but
also useful for the other reasons where the message carries the failing
sub-step (e.g. which analyzeWriteSection step exhausted its
function-call retries).
Intentionally compact: message is capped at 1 KB by the archiver so a
runaway message cannot bloat the summary. The full {name, message, stack} stays available in the on-disk {phase}.error.json.gz for deep
forensic work. Absent when the phase succeeded.
OptionallastCompact record of the most recent stagnation abort observed in this phase.
Set when the archiver's StagnationDetector latches and aborts the phase
(reason: "aborted_stagnation"). Carries enough signal — the looping
event type, the repeating signature, and the repeat count — for an
external reader of summary.json.gz or the archive-status sidecar to
recognize the pattern of failure without opening the per-attempt
failures/{started_at}/{phase}.stagnation.json.gz payload.
Closes the operator-side question observed on
qwen/qwen3.6-flash/shopping 2026-05-13: six consecutive failure
subdirectories each carried the same
interfaceOperation|process|target:|errors:1nawc4g signature (empty
target: field — a generator-side regression), but the archive surface
showed only reason: "aborted_stagnation". A reviewer had to decompress
the per-attempt failure folder to spot the empty-target pattern. With
this field the pattern is visible on the canonical summary.
Absent when the phase succeeded or when the abort cause was something other than a stagnation latch (rate-limit, timeout, predicate-false, compile failure).
OptionaloperationRealize-specific per-operation outcome map.
One entry per write attempt (success or failure) so a reviewer sees the
failing endpoint and reason without cross-referencing
realize.functions[] (the successes) against
realize.operationFailures[] (the failures). The 2026-05-14
qwen/qwen3.6-flash todo realize phase wrote 35 successful providers plus
1 failure (patchTodoAppMemberTodosTrash -- unresolved query
placeholder); the prior ledger surfaced only the integer count via
predicateFailure.count. This field carries the full per-operation map
so the reader sees which endpoint failed and why.
Absent for phases other than realize (the per-operation concept only
applies to realize's per-endpoint write pipeline) and for legacy archives
whose history payload did not record the functions / operationFailures
arrays.
OptionalpredicateOptional predicate-side failure indicator.
Populated when the phase finished its agent loop without throwing and
without a compile diagnostic, but its post-execution predicate still
returned false. Today the predicate-false case covers:
interface with missed.length > 0 (resolver could not bind some
operations to method+path)realize with operationFailures.length > 0 (compile passed but one or
more endpoints failed during write — the case observed on
qwen/qwen3.6-flash/todo 2026-05-13 where summary.realize carried
success: false, reason: null, and compile: null so a reader had
no way to tell what went wrong)Without this field an external reader of summary.json.gz cannot tell a
predicate-false phase apart from a phase that simply forgot to record a
reason. The classifyArchiveTerminalState archive-status surfaces
this field as failureReason: predicate:<kind> so the "왜 멈췄는지 외부에서 객관적으로
읽힌다" invariant holds: at least one of reason, compile.kind, or
predicateFailure.kind is non-null whenever success: false.
Absent when the phase succeeded, when the phase failed via a thrown error
path (reason already populated), or when the compile path captured the
failure (compile.kind already populated).
OptionalreasonOptional reason for failure, classified from the captured error or emitted by the harness.
"rate_limit": vendor API refused the request (OpenRouter 403/429 or
"Key limit exceeded" messages)."connection_error": the vendor SDK could not establish or maintain a
network connection (e.g. OpenAI SDK APIConnectionError, getaddrinfo ENOTFOUND, mid-request socket reset). Treated as harness/infra: the
model never had a chance to respond, so the run is excluded from
model-quality scoring rather than being charged with "unknown"."timeout": a single LLM conversation exceeded its per-call timeout
(AutoBeTimeoutError from TimedConversation). Treated as a model
failure, not infra — a model that cannot respond inside the configured
per-call window is unusable for the benchmark."function_call_failed": the model failed to emit the required tool call
even after the archive runner's enforcement prompts. Treated as a model
failure."validation_giveup": Agentica's function-call retry budget (3 by
default) was exhausted because the model could not produce arguments
that satisfy the tool's typia validator (e.g. repeatedly emitting an
empty properties: {} for a non-empty entity DTO). Treated as a
model-quality failure — distinct from "function_call_failed" (no tool
call at all) and from "unknown" (uncategorized error)."unknown": caught error that does not match a known classification."aborted_stagnation": the stagnation detector aborted a sub-loop after
observing a repeating failure signature."aborted_cumulative_loop": the cumulative loop breaker latched because
one sub-loop (or the aggregate across all sub-loops) crossed a hard
observation count cap. Distinct from "aborted_stagnation" because the
stagnation detector requires a single signature to repeat inside a
sliding window; the cumulative breaker catches the case where many
distinct but related corrections accumulate into a runaway loop without
any single signature dominating. Typical trigger: the
preliminaryRewrite/jsonValidateError volume that drained the qwen
weekly key in archive runs from 2026-05."interrupted": an operator or parent process asked the archive run to
stop (SIGINT/SIGTERM or an external abort signal). The phase may have
partial snapshots/checkpoints for diagnosis."archive_timeout": the archive-wide hard cap (see
runWithArchiveTimeout) fired and aborted the in-flight phase. Opt-in
via --archive-timeout-ms / ARCHIVE_TIMEOUT_MS; off by default
because phase-level progress alone is not a reliable signal of runaway.
Distinct from "interrupted" so an operator-initiated abort stays
separable from the timeout safety net."rate_limit_mid_run": the RFC-9 OpenRouter quota tracker observed a
remaining-quota header at or below
AutoBeConfigConstant.OPENROUTER_QUOTA_FLOOR and aborted the run
before the next call could consume more budget. Distinct from
"rate_limit" (preflight 403/429 / "Key limit exceeded") because the
key still had budget when the run started — the trip happened mid-run
from header observation, not from a vendor rejection."upstream_unhealthy": the RFC-9 quota tracker counted cumulative
OpenRouter upstream-error (HTTP-200 body with error.code) responses
beyond AutoBeConfigConstant.OPENROUTER_5XX_RETRY_CAP. Treated as
harness/infra: the upstream provider is degraded enough that further
retries are wasted budget.Absent when the phase succeeded. "aborted_stagnation",
"aborted_cumulative_loop", "interrupted", "archive_timeout",
"rate_limit", "rate_limit_mid_run", "upstream_unhealthy", and
"connection_error" values are harness/infra outcomes and should be
excluded from model-quality scoring; "timeout",
"function_call_failed", "validation_giveup", and "unknown" are
model-quality failures.
OptionalresumabilityOptional resume-readiness report captured at the start of this phase.
When the archiver finds a prior {phase}.checkpoint.json.gz next to the
canonical archive root, it inspects the payload and records here what an
eventual resume implementation could pick up. Absent when the phase
started without a prior checkpoint visible. This field is
observation-only: the archiver does not yet seed agent histories or skip
stages from the checkpoint — see IPhaseResumability.
OptionalruntimeOptional runtime execution digest.
Populated for the debug phase runtime validation. Legacy benchmark archives can also carry this digest on the realize phase when they were run with out-of-band runtime execution enabled. Summarizes the outcome of running the generated backend against its generated e2e test suite (see IAutoBeRealizeTestResult). Absent when runtime execution was not performed or the phase did not reach a runnable state.
OptionalsubPer-event-type cost breakdown within this phase.
Each entry summarizes a single sub-loop (identified by its
AutoBeEvent.Type) by event count and wall-clock duration so an external
reader of summary.json.gz can see where the phase spent its time
without re-decompressing the per-phase snapshot stream. Complements
aggregates (token usage + function-call metrics per event type):
aggregates answers "how much did each sub-loop cost in tokens", while
subLoops answers "how much did each sub-loop cost in real time, and how
many times did it fire".
Sub-loops appear here even when they are not LLM function-call events
(jsonValidateError, preliminaryRewrite, realizeOperationFailure,
etc.), so the field captures the thrashing patterns — repeated validation
errors and outer-retry rounds that aggregates alone misses.
Sorted descending by elapsedMs, then by eventCount, so the costliest
sub-loop is at index 0. Absent on archives produced before this field
existed and on phases with no recorded snapshots; downstream readers must
treat absence as "no information", not "no sub-loops".
Indicates whether the phase produced any usable output.
For the realize phase this is a loose indicator: true when at least one
provider function was generated, regardless of whether the full compile
passed or all endpoints succeeded. Use cleanCompletion for the
strict all-or-nothing check.
For all other phases the distinction does not apply and success carries
the same all-or-nothing semantics it always did.
OptionaltopTop recurring jsonValidateError patterns observed in this phase's
snapshots, sorted descending by count and capped at a small N.
Each entry is one (path, expected) pair the model failed against
repeatedly. Surfaces the model's favorite validation mistakes (e.g.
$input.request.think undefined 16x) at sidecar level so a reviewer does
not have to walk the snapshot stream to find them. Three patterns
observed on the 2026-05-14 qwen/qwen3.6-flash todo realize phase:
17x $input.request undefined - model returned the wrong union variant
entirely.
16x $input.request.think undefined - required field omission.
13x $input.request.draft: Prisma calls represented by placeholders
__AUTOBE_QUERY_IR placeholders.Each of these is a candidate for targeted validation feedback on the agent side; the ledger now points the prompt-engineering team at them without a per-archive snapshot dive.
expected is capped at 200 characters by the summarizer so a runaway
type-union expansion cannot bloat the field. Absent when the phase
recorded no validation errors or when snapshots are unavailable.
State and metrics for an individual development phase.
Captures the execution results and performance characteristics of a specific agent phase in the vibe coding pipeline. This granular data enables detailed analysis of each phase's contribution to the overall development process and helps identify bottlenecks or areas for improvement.