AutoBE
    Preparing search index...

    State and metrics for an individual development phase.

    Captures the execution results and performance characteristics of a specific agent phase in the vibe coding pipeline. This granular data enables detailed analysis of each phase's contribution to the overall development process and helps identify bottlenecks or areas for improvement.

    interface IPhaseState {
        aggregates: AutoBeProcessAggregateCollection;
        cleanCompletion?: boolean;
        commodity: Record<string, number>;
        compile?: IPhaseCompileFailure;
        costSignals?: IPhaseCostSignals;
        elapsed: number;
        error?: IPhaseError;
        lastStagnation?: IPhaseLastStagnation;
        operationOutcomes?: IOperationOutcome[];
        predicateFailure?: IPhasePredicateFailure;
        reason?:
            | "timeout"
            | "rate_limit"
            | "rate_limit_mid_run"
            | "upstream_unhealthy"
            | "connection_error"
            | "function_call_failed"
            | "validation_giveup"
            | "unknown"
            | "aborted_stagnation"
            | "aborted_cumulative_loop"
            | "interrupted"
            | "archive_timeout";
        resumability?: IPhaseResumability;
        runtime?: IPhaseRuntime;
        subLoops?: ISubLoopStat[];
        success: boolean;
        topValidationErrors?: IValidationErrorPattern[];
    }
    Index

    Properties

    cleanCompletion?: boolean

    Strict completion indicator for the realize phase's two-tier success model.

    true only when compiled.type === "success" AND operationFailures.length === 0 — i.e. the TypeScript compiler reported no errors and every endpoint produced a provider. A partial realize run (some functions generated, one endpoint failed) sets success: true but cleanCompletion: false.

    Absent for phases other than realize (where the success model is all-or-nothing) and for legacy summaries produced before this field was introduced.

    commodity: Record<string, number>

    Count of generated elements by category for this phase.

    Provides the number of key artifacts generated during the phase execution. Each phase produces different types of elements:

    • Analyze: actors, documents
    • Database: namespaces, models
    • Interface: operations, schemas
    • Test: functions
    • Realize: functions

    Keys represent element type names (e.g., "actors", "operations") and values indicate the count of each element type generated, offering insights into the scope and complexity of the generated application.

    {actors: 3, documents: 11} //   Analyze phase
    { operations: 34, schemas: 35 } // Interface phase

    Optional compile outcome for phases whose success is decided by an IAutoBeTypeScriptCompileResult (database, interface, test, realize).

    reason answers "why did the agent stop" — but a phase can also fail because the agent finished and its output did not compile, or because an exception was thrown inside the compiler itself. In those cases the agent never hit a reason-eligible event, so success: false showed up with reason: null and the external reader of the archive ledger had no way to tell whether they were looking at a compile failure (type: "failure" with diagnostics), a compiler exception (type: "exception"), or a still-running phase that just hadn't been recorded yet.

    Absent when the phase succeeded or when the underlying history is not a compile-result-bearing phase (analyze, multiLingual).

    costSignals?: IPhaseCostSignals

    Phase-level cost signals derived from the token aggregates.

    Surfaces "where did this phase bleed time / money" at sidecar level so a reviewer does not have to walk aggregates.total.tokenUsage and the per-sub-loop attempt / validationFailure breakdown manually. The three fields cover the dominant cost levers observed on real archives:

    • cacheHitRateaggregates.total.tokenUsage.input.cached / input.total. 0 when the vendor / SDK did not return any cache hits. The 2026-05-14 qwen/qwen3.6-flash todo run hit 0% across all phases (23M input tokens in interface alone, none cached) — that's the single-biggest perf finding the previous ledger could only surface by decompressing every aggregates payload.
    • failureRateaggregates.total.metric.validationFailure / aggregates.total.metric.attempt. Tells a reviewer whether the phase's retry loop is converging or thrashing. realize on the same run was 99 / 291 = 34%, but a per-sub-loop view (deferred to a separate field) showed realizeCorrect hitting 54% — the correction loop itself was misfiring half the time.
    • inputTokensTotal / inputTokensCached — kept as raw counters so a reader can compute their own ratios without re-fetching the aggregates payload.

    Absent when the phase did not run, when the aggregates payload is missing (legacy archive predating the field), or when no attempts were recorded (rate-zero denominator).

    elapsed: number

    Time taken to complete this phase in milliseconds.

    Measures the duration from phase initiation to completion, including all AI interactions, validations, and corrections. This metric helps identify performance characteristics of individual phases.

    error?: IPhaseError

    Compact excerpt of the error that classified this phase's reason.

    Present only when the archiver caught a failure (the path that sets success: false + reason). Lets an external reader of summary.json.gz tell what actually went wrong without having to find and decompress the sibling {phase}.error.json.gz artifact — critical for reason: "unknown" (the classifier couldn't recognize the error) but also useful for the other reasons where the message carries the failing sub-step (e.g. which analyzeWriteSection step exhausted its function-call retries).

    Intentionally compact: message is capped at 1 KB by the archiver so a runaway message cannot bloat the summary. The full {name, message, stack} stays available in the on-disk {phase}.error.json.gz for deep forensic work. Absent when the phase succeeded.

    lastStagnation?: IPhaseLastStagnation

    Compact record of the most recent stagnation abort observed in this phase.

    Set when the archiver's StagnationDetector latches and aborts the phase (reason: "aborted_stagnation"). Carries enough signal — the looping event type, the repeating signature, and the repeat count — for an external reader of summary.json.gz or the archive-status sidecar to recognize the pattern of failure without opening the per-attempt failures/{started_at}/{phase}.stagnation.json.gz payload.

    Closes the operator-side question observed on qwen/qwen3.6-flash/shopping 2026-05-13: six consecutive failure subdirectories each carried the same interfaceOperation|process|target:|errors:1nawc4g signature (empty target: field — a generator-side regression), but the archive surface showed only reason: "aborted_stagnation". A reviewer had to decompress the per-attempt failure folder to spot the empty-target pattern. With this field the pattern is visible on the canonical summary.

    Absent when the phase succeeded or when the abort cause was something other than a stagnation latch (rate-limit, timeout, predicate-false, compile failure).

    operationOutcomes?: IOperationOutcome[]

    Realize-specific per-operation outcome map.

    One entry per write attempt (success or failure) so a reviewer sees the failing endpoint and reason without cross-referencing realize.functions[] (the successes) against realize.operationFailures[] (the failures). The 2026-05-14 qwen/qwen3.6-flash todo realize phase wrote 35 successful providers plus 1 failure (patchTodoAppMemberTodosTrash -- unresolved query placeholder); the prior ledger surfaced only the integer count via predicateFailure.count. This field carries the full per-operation map so the reader sees which endpoint failed and why.

    Absent for phases other than realize (the per-operation concept only applies to realize's per-endpoint write pipeline) and for legacy archives whose history payload did not record the functions / operationFailures arrays.

    predicateFailure?: IPhasePredicateFailure

    Optional predicate-side failure indicator.

    Populated when the phase finished its agent loop without throwing and without a compile diagnostic, but its post-execution predicate still returned false. Today the predicate-false case covers:

    • interface with missed.length > 0 (resolver could not bind some operations to method+path)
    • realize with operationFailures.length > 0 (compile passed but one or more endpoints failed during write — the case observed on qwen/qwen3.6-flash/todo 2026-05-13 where summary.realize carried success: false, reason: null, and compile: null so a reader had no way to tell what went wrong)

    Without this field an external reader of summary.json.gz cannot tell a predicate-false phase apart from a phase that simply forgot to record a reason. The classifyArchiveTerminalState archive-status surfaces this field as failureReason: predicate:<kind> so the "왜 멈췄는지 외부에서 객관적으로 읽힌다" invariant holds: at least one of reason, compile.kind, or predicateFailure.kind is non-null whenever success: false.

    Absent when the phase succeeded, when the phase failed via a thrown error path (reason already populated), or when the compile path captured the failure (compile.kind already populated).

    reason?:
        | "timeout"
        | "rate_limit"
        | "rate_limit_mid_run"
        | "upstream_unhealthy"
        | "connection_error"
        | "function_call_failed"
        | "validation_giveup"
        | "unknown"
        | "aborted_stagnation"
        | "aborted_cumulative_loop"
        | "interrupted"
        | "archive_timeout"

    Optional reason for failure, classified from the captured error or emitted by the harness.

    • "rate_limit": vendor API refused the request (OpenRouter 403/429 or "Key limit exceeded" messages).
    • "connection_error": the vendor SDK could not establish or maintain a network connection (e.g. OpenAI SDK APIConnectionError, getaddrinfo ENOTFOUND, mid-request socket reset). Treated as harness/infra: the model never had a chance to respond, so the run is excluded from model-quality scoring rather than being charged with "unknown".
    • "timeout": a single LLM conversation exceeded its per-call timeout (AutoBeTimeoutError from TimedConversation). Treated as a model failure, not infra — a model that cannot respond inside the configured per-call window is unusable for the benchmark.
    • "function_call_failed": the model failed to emit the required tool call even after the archive runner's enforcement prompts. Treated as a model failure.
    • "validation_giveup": Agentica's function-call retry budget (3 by default) was exhausted because the model could not produce arguments that satisfy the tool's typia validator (e.g. repeatedly emitting an empty properties: {} for a non-empty entity DTO). Treated as a model-quality failure — distinct from "function_call_failed" (no tool call at all) and from "unknown" (uncategorized error).
    • "unknown": caught error that does not match a known classification.
    • "aborted_stagnation": the stagnation detector aborted a sub-loop after observing a repeating failure signature.
    • "aborted_cumulative_loop": the cumulative loop breaker latched because one sub-loop (or the aggregate across all sub-loops) crossed a hard observation count cap. Distinct from "aborted_stagnation" because the stagnation detector requires a single signature to repeat inside a sliding window; the cumulative breaker catches the case where many distinct but related corrections accumulate into a runaway loop without any single signature dominating. Typical trigger: the preliminaryRewrite/jsonValidateError volume that drained the qwen weekly key in archive runs from 2026-05.
    • "interrupted": an operator or parent process asked the archive run to stop (SIGINT/SIGTERM or an external abort signal). The phase may have partial snapshots/checkpoints for diagnosis.
    • "archive_timeout": the archive-wide hard cap (see runWithArchiveTimeout) fired and aborted the in-flight phase. Opt-in via --archive-timeout-ms / ARCHIVE_TIMEOUT_MS; off by default because phase-level progress alone is not a reliable signal of runaway. Distinct from "interrupted" so an operator-initiated abort stays separable from the timeout safety net.
    • "rate_limit_mid_run": the RFC-9 OpenRouter quota tracker observed a remaining-quota header at or below AutoBeConfigConstant.OPENROUTER_QUOTA_FLOOR and aborted the run before the next call could consume more budget. Distinct from "rate_limit" (preflight 403/429 / "Key limit exceeded") because the key still had budget when the run started — the trip happened mid-run from header observation, not from a vendor rejection.
    • "upstream_unhealthy": the RFC-9 quota tracker counted cumulative OpenRouter upstream-error (HTTP-200 body with error.code) responses beyond AutoBeConfigConstant.OPENROUTER_5XX_RETRY_CAP. Treated as harness/infra: the upstream provider is degraded enough that further retries are wasted budget.

    Absent when the phase succeeded. "aborted_stagnation", "aborted_cumulative_loop", "interrupted", "archive_timeout", "rate_limit", "rate_limit_mid_run", "upstream_unhealthy", and "connection_error" values are harness/infra outcomes and should be excluded from model-quality scoring; "timeout", "function_call_failed", "validation_giveup", and "unknown" are model-quality failures.

    resumability?: IPhaseResumability

    Optional resume-readiness report captured at the start of this phase.

    When the archiver finds a prior {phase}.checkpoint.json.gz next to the canonical archive root, it inspects the payload and records here what an eventual resume implementation could pick up. Absent when the phase started without a prior checkpoint visible. This field is observation-only: the archiver does not yet seed agent histories or skip stages from the checkpoint — see IPhaseResumability.

    runtime?: IPhaseRuntime

    Optional runtime execution digest.

    Populated for the debug phase runtime validation. Legacy benchmark archives can also carry this digest on the realize phase when they were run with out-of-band runtime execution enabled. Summarizes the outcome of running the generated backend against its generated e2e test suite (see IAutoBeRealizeTestResult). Absent when runtime execution was not performed or the phase did not reach a runnable state.

    subLoops?: ISubLoopStat[]

    Per-event-type cost breakdown within this phase.

    Each entry summarizes a single sub-loop (identified by its AutoBeEvent.Type) by event count and wall-clock duration so an external reader of summary.json.gz can see where the phase spent its time without re-decompressing the per-phase snapshot stream. Complements aggregates (token usage + function-call metrics per event type): aggregates answers "how much did each sub-loop cost in tokens", while subLoops answers "how much did each sub-loop cost in real time, and how many times did it fire".

    Sub-loops appear here even when they are not LLM function-call events (jsonValidateError, preliminaryRewrite, realizeOperationFailure, etc.), so the field captures the thrashing patterns — repeated validation errors and outer-retry rounds that aggregates alone misses.

    Sorted descending by elapsedMs, then by eventCount, so the costliest sub-loop is at index 0. Absent on archives produced before this field existed and on phases with no recorded snapshots; downstream readers must treat absence as "no information", not "no sub-loops".

    success: boolean

    Indicates whether the phase produced any usable output.

    For the realize phase this is a loose indicator: true when at least one provider function was generated, regardless of whether the full compile passed or all endpoints succeeded. Use cleanCompletion for the strict all-or-nothing check.

    For all other phases the distinction does not apply and success carries the same all-or-nothing semantics it always did.

    topValidationErrors?: IValidationErrorPattern[]

    Top recurring jsonValidateError patterns observed in this phase's snapshots, sorted descending by count and capped at a small N.

    Each entry is one (path, expected) pair the model failed against repeatedly. Surfaces the model's favorite validation mistakes (e.g. $input.request.think undefined 16x) at sidecar level so a reviewer does not have to walk the snapshot stream to find them. Three patterns observed on the 2026-05-14 qwen/qwen3.6-flash todo realize phase:

    1. 17x $input.request undefined - model returned the wrong union variant entirely.

    2. 16x $input.request.think undefined - required field omission.

    3. 13x $input.request.draft: Prisma calls represented by placeholders

      • Direct Prisma calls instead of __AUTOBE_QUERY_IR placeholders.

    Each of these is a candidate for targeted validation feedback on the agent side; the ledger now points the prompt-engineering team at them without a per-archive snapshot dive.

    expected is capped at 200 characters by the summarizer so a runaway type-union expansion cannot bloat the field. Absent when the phase recorded no validation errors or when snapshots are unavailable.