Function Calling Harness
Qwen Meetup Korea20260326-qwen-meetup-korea.pptxβTL;DR
- AutoBeΒ β AI backend auto-generation agent
- Production-grade backend from natural language conversation
- 4 AST types + 4-tier compiler validation + self-healing loops
- Schema specs are the new prompts
- TypiaΒ β The infrastructure that turns 0% into 100%
- A single type automates schema, parser, validator, and feedback generator
- Lenient JSON parsing + schema-based type coercion + precise validation feedback
- Combined with AutoBe to complete harness engineering
- In Praise of Function Calling
- Types eliminate ambiguity; schemas constrain through absence
- Model-neutral, mechanically verifiable, deterministically convergent
- Applicable to all engineering domains with validators β semiconductors, chemical processes, control systems, etc.
- Qwen β Why small models are the best QA engineers
- Smaller models are better at exposing system vulnerabilities
- R&D cost reduction, vendor independence, open ecosystem virtuous cycle
- 6.75% is not failure β itβs the first input to the loop
qwen3-coder-nextscores 6.75% on first-try tool calling- AutoBeβs self-healing harness turns that into 100% compilation success
- If you can verify, you converge
1. Preface
6.75%.
Thatβs the first-try function calling success rate when qwen3-coder-next is asked to generate API data types for a shopping mall backend. 93 out of 100 attempts produce invalid structured output.
This isnβt surprising. NESTFUL (EMNLP 2025)Β measured GPT-4o at 28% accuracy on nested tool call sequences. JSONSchemaBench (ICLR 2025)Β tested constrained decoding frameworks on 10,000 real-world schemas and found 3β41% coverage on the hardest ones. BoundaryML went further, arguingΒ that structured outputs actively degrade model reasoning β that forcing JSON format makes the model dumber. The consensus is clear: function calling works for flat, simple schemas. For anything with recursive nesting or deep structural complexity, donβt bother.
But if you want to make AI output deterministic β parse it, validate it, and correct it in a loop until it converges β there is no alternative to structured output. Free-form text canβt be mechanically verified. Natural language canβt be compiled. Without structure, thereβs no feedback loop, and without a feedback loop, thereβs no guarantee. So we didnβt have the luxury of giving up on function calling. We had to make it work on the exact kind of complex, recursive schemas the industry had written off.
AutoBeΒ is the result. Itβs an open-source AI agent that takes a single natural language conversation and generates a complete backend β requirements analysis, database schema, API specification, E2E tests, and implementation code. Hook up that 6.75% model and what happens? Final compilation success rate: 100%. All four Qwen models.
The answer wasnβt a better model or a smarter prompt. It was a harness β type schemas that constrain outputs, compilers that verify results, and structured feedback that pinpoints exactly where and why something went wrong so the LLM can correct itself. A deterministic loop wrapping a probabilistic model. The engineering outside the model, not inside, that made the difference.
This talk dissects that engineering.
Chapter 2 examines AutoBeβs architecture: a 5-phase pipeline running through 4 AST types and 4-tier compilers, with self-healing loops that systematically correct LLM mistakes.
Chapter 3 delves into Typia, the heart of that structure. The TypeScript compiler analyzes a single type from source code and generates schema, parser, validator, and feedback generator β all automatically. The concrete mechanism that flipped Qwen 3.5βs 0% to 100% lives here.
Chapter 4 steps back to ask a bigger question. Does this pattern work beyond backends? Semiconductors, chemical processes, architecture, control systems β anywhere deterministic validators exist in engineering.
And Chapter 5 answers why this story belongs at Qwen Meetup. Small models arenβt a weakness. Theyβre the harness systemβs best QA engineers.
2. AutoBe β AI Backend Auto-Generation Agent
2.1. What AutoBe Does
AutoBeΒ is an open-source AI agent that automatically generates production-grade backends from natural language. Developed by Wrtn TechnologiesΒ .
βBuild me a shopping mall backend. I need product listings, shopping cart, orders, and payments.β From this single sentence, AutoBe generates everything:
- Requirements analysis (SRS)
- Database schema (ERD)
- API specification (OpenAPI v3.2)
- E2E test code
- Complete implementation code
- Type-safe SDK
Every line of generated code compiles. The result is a fully functional backend built on TypeScript + NestJS.
Todo
| Analyze | actors: 2, documents: 6 | Analyze Phase Token Usage: 377.8K in: 308.6K / out: 69.2K Function Calls: 47 / 47 (100.0%) | |
| Database | namespaces: 2, models: 7 | Database Phase Token Usage: 1.25M in: 1.03M / out: 219.5K Function Calls: 37 / 38 (97.4%) | |
| Interface | operations: 14, schemas: 28 | Interface Phase Token Usage: 11.65M in: 11.40M / out: 242.6K Function Calls: 163 / 203 (80.3%) | |
| Test | functions: 44 | Test Phase Token Usage: 2.90M in: 2.76M / out: 142.2K Function Calls: 102 / 107 (95.3%) | |
| Realize | functions: 24 | Realize Phase Token Usage: 1.90M in: 1.78M / out: 120.8K Function Calls: 71 / 82 (86.6%) |
| Analyze | actors: 2, documents: 6 | Analyze Phase Token Usage: 1.33M in: 1.07M / out: 253.7K Function Calls: 99 / 105 (94.3%) | |
| Database | namespaces: 6, models: 21 | Database Phase Token Usage: 2.71M in: 2.53M / out: 171.5K Function Calls: 60 / 63 (95.2%) | |
| Interface | operations: 62, schemas: 80 | Interface Phase Token Usage: 67.76M in: 66.30M / out: 1.46M Function Calls: 628 / 898 (69.9%) | |
| Test | functions: 183 | Test Phase Token Usage: 25.28M in: 24.14M / out: 1.14M Function Calls: 608 / 624 (97.4%) | |
| Realize | functions: 98 | Realize Phase Token Usage: 11.70M in: 11.03M / out: 661.2K Function Calls: 286 / 320 (89.4%) |
Shopping
| Analyze | actors: 3, documents: 6 | Analyze Phase Token Usage: 3.83M in: 3.29M / out: 541.3K Function Calls: 170 / 197 (86.3%) | |
| Database | namespaces: 10, models: 30 | Database Phase Token Usage: 5.01M in: 4.87M / out: 148.1K Function Calls: 85 / 87 (97.7%) | |
| Interface | operations: 148, schemas: 155 | Interface Phase Token Usage: 160.24M in: 157.56M / out: 2.68M Function Calls: 1322 / 1764 (74.9%) | |
| Test | functions: 429 | Test Phase Token Usage: 84.24M in: 81.16M / out: 3.08M Function Calls: 1403 / 1445 (97.1%) | |
| Realize | functions: 207 | Realize Phase Token Usage: 32.63M in: 31.51M / out: 1.12M Function Calls: 599 / 665 (90.1%) |
Erp
| Analyze | actors: 2, documents: 6 | Analyze Phase Token Usage: 1.45M in: 1.19M / out: 252.5K Function Calls: 109 / 110 (99.1%) | |
| Database | namespaces: 6, models: 22 | Database Phase Token Usage: 2.27M in: 2.16M / out: 109.5K Function Calls: 65 / 71 (91.5%) | |
| Interface | operations: 86, schemas: 112 | Interface Phase Token Usage: 71.18M in: 69.64M / out: 1.54M Function Calls: 822 / 1099 (74.8%) | |
| Test | functions: 260 | Test Phase Token Usage: 25.02M in: 23.83M / out: 1.19M Function Calls: 644 / 725 (88.8%) | |
| Realize | functions: 132, errors: 2 | Realize Phase Token Usage: 14.41M in: 13.44M / out: 974.5K Function Calls: 414 / 453 (91.4%) |
2.2. LLMs Donβt Write Code
Most AI coding agents tell the LLM βwrite this codeβ and save the returned text directly as source files. AutoBe is different.
AutoBe uses function calling. Instead of generating free-form text, the LLM fills in predefined structures β JSON Schema. Itβs filling out a form, not writing on a blank page. Once the LLM fills the form, compilers validate and transform it into actual code. The LLM fills structures; compilers write code.
This approach applies across the entire 5-phase waterfall pipeline.
| Phase | Structure the LLM Fills | Compiler Validation |
|---|---|---|
| Requirements | AutoBeAnalyze β Structured SRS | Structure check |
| Database | AutoBeDatabase β DB schema AST | AutoBeDatabase compiler |
| API Design | AutoBeOpenApi β OpenAPI v3.2 spec | AutoBeOpenApi compiler |
| Testing | AutoBeTest β 30+ expression types | AutoBeTest compiler |
| Implementation | Modularized code (Collector/Transformer/Operation) | TypeScript compiler |
Each AST strictly limits the range of values the LLM can generate β for example, AutoBeDatabaseβs field types are restricted to just 7 options: "boolean" | "int" | "double" | "string" | "uri" | "uuid" | "datetime", making it impossible for the LLM to generate arbitrary types like "varchar".
Over 40 specialized AI agents execute this pipeline. Itβs not a simple straight line β spiral loops run within each phase, automatically repeating generation and correction upon failure. Inter-phase dependencies are managed through the Step Counter pattern β when an upstream phase re-executes, downstream phases are automatically invalidated, triggering cascading regeneration from API specifications through implementation code when a database schema changes.
2.3. 4-Tier Compiler Validation
AutoBeβs 100% compilation guarantee comes from its 4-tier validation system. Each tier validates at a different level of abstraction.
Tier 1: AutoBeDatabase Compiler
Validates the structural integrity of the database AST. Duplicate model/field detection, referential integrity (do foreign keys point to existing models?), naming convention compliance (models in snake_case, relations in camelCase), index validity (do indexed fields exist?), and relationship consistency. Upon passing validation, the AST is transformed into actual DB schema code and compiled again.
// Structure of diagnostic information returned by the compiler
interface IError {
path: "application.files[0]"; // location
table: "shopping_customers" | null; // target model
field: "shopping_customer_id" | null; // target field
message: "detailed error description"; // cause
}Tier 2: AutoBeOpenApi Compiler
Validates the OpenAPI v3.2 specification. Checks consistency with the database schema β whether DTO fields correspond to actual model fields, whether all tables have API operations. Verifies path uniqueness and schema reference validity. Upon passing, generates NestJS templates, DTO types, and module configurations.
Tier 3: AutoBeTest Compiler
Validates the test AST. Verifies that E2E test code composed of 30+ IExpression variants has correct structure and is consistent with the API specification. Upon passing, generates actual TypeScript test code.
Tier 4: TypeScript Compiler
The final validation gate. Compiles in strict mode (strict null checks, no implicit any). Supports incremental compilation β reusing previous compilation results yields 15x performance improvement. Provides file/line/column-level precise diagnostics. Concurrent compilations are limited to 2 via semaphore to prevent system overload.
All four compilers, upon failure, return βwhere, what, and why went wrongβ in structured form. This diagnostic information enables the self-healing loops described in the next section.
2.4. Self-Healing Loops
Compilation failure is not the end. AutoBeβs core mechanism is the self-healing loop.
Generate β Compile β Extract Diagnostics β Correct β Recompile β (repeat until success)- The LLM generates structured data
- The compiler validates it
- On failure, diagnostics with exact locations and causes are extracted
- The Correct agent receives the original code + diagnostics and makes fixes
- Recompiles
- Repeats until success
These loops nest hierarchically. The most complex Realize (implementation) phase has 4 levels of retry:
- Inline retry β immediate single retry after generation
- Correction loop β recursive correction based on compilation error diagnostics
- Outer retry loop β reprocesses failed operations up to 2 times
- Selective reprocessing β if 38 of 40 APIs succeed, only the 2 failures are reprocessed
Successful code is preserved. Only the failed parts are corrected.
On top of this, Typiaβs validation feedback adds precise error correction at the function calling level. AutoBeβs compilers handle final validation at the code level, while Typia handles structural validation at the function calling level. The combination of these two layers is the driving force behind 100% compilation. Typiaβs role is covered in detail in Chapter 3.
2.5. The Forms Are Not Simple
The structures the LLM must fill are far from simple.
During API design, the DTO schema types the LLM generates describe the data structure of API requests/responses β βa productβs price is a positive integer, name is a string, category list is a string array.β The IJsonSchema that defines these types is a recursive union of 10 variants:
export type IJsonSchema =
| IJsonSchema.IConstant
| IJsonSchema.IBoolean
| IJsonSchema.IInteger
| IJsonSchema.INumber
| IJsonSchema.IString
| IJsonSchema.IArray // items: IJsonSchema β recursive
| IJsonSchema.IObject // properties: Record<string, IJsonSchema> β recursive
| IJsonSchema.IReference
| IJsonSchema.IOneOf // oneOf: IJsonSchema[] β recursive
| IJsonSchema.INull;10 variants, infinitely recursive nesting. The first-try function calling success rate for this type is 6.75%.
The testing phase raises complexity another level. E2E test code must express logic like βcall this API, verify the response status is 200, and check that the bodyβs items array length is greater than 0.β The IExpression type that captures this:
export type IExpression =
| IBooleanLiteral | INumericLiteral | IStringLiteral // literals
| IArrayLiteralExpression | IObjectLiteralExpression // compound literals
| INullLiteral | IUndefinedKeyword // null/undefined
| IIdentifier | IPropertyAccessExpression // accessors
| IElementAccessExpression | ITypeOfExpression // access/operations
| IPrefixUnaryExpression | IPostfixUnaryExpression // unary operations
| IBinaryExpression // binary operations
| IArrowFunction | ICallExpression | INewExpression // functions
| IArrayFilterExpression | IArrayForEachExpression // array operations
| IArrayMapExpression | IArrayRepeatExpression // array operations
| IPickRandom | ISampleRandom | IBooleanRandom // random generation
| IIntegerRandom | INumberRandom | IStringRandom // random generation
| IPatternRandom | IFormatRandom | IKeywordRandom // random generation
| IEqualPredicate | INotEqualPredicate // assertions
| IConditionalPredicate | IErrorPredicate; // assertions30+ variants with recursive nesting. Programming language-level complexity that must be handled in a single function call.
2.6. Schema Specs Are Prompts
In conventional AI agents, βpromptsβ are natural language instructions. βYou are a backend development expert. Follow these rules when writing codeβ¦β
In AutoBe, what serves as prompts is not natural language but the schema specs themselves. AutoBeDatabaseβs stance enum tells the model what kinds of tables to create, IJsonSchemaβs 10 variants define how DTOs should be structured, and IExpressionβs 30+ variants specify what grammar test code should follow.
Natural language prompts are ambiguous, interpreted differently by each model, and impossible to verify compliance. Schema specs are unambiguous, model-independent, and mechanically verifiable. This isnβt to say system prompts are useless β itβs that schema specs are a more powerful means of instruction than prompts.
Schema specs are the new prompts.
2.7. Four Qwen Models, All 100%
AutoBe currently tests against four Qwen models. All achieve successful compilation.
| Model | Active Parameters | Characteristics |
|---|---|---|
qwen/qwen3-coder-next | 3B / 80B | Coding-specialized |
qwen/qwen3.5-397b-a17b | 17B / 397B | Largest MoE |
qwen/qwen3.5-122b-a10b | 10B / 122B | Medium MoE |
qwen/qwen3.5-35b-a3b | 3B / 35B | Small MoE |
From 397B to 35B. Even a small model with 3B active parameters generates a complete shopping mall backend. Same schema, same pipeline, same result.
3. Typia β The Infrastructure That Turns 0% into 100%
Chapter 2 described what AutoBe builds β but not how it survives 6.75%. Schema generation, broken JSON recovery, type coercion, precise error feedback β every piece of infrastructure that makes function calling work on complex types despite the industry consensus that it canβt. Who handles all of it?
TypiaΒ . Making function calling reliable on recursive union types required going deeper than runtime libraries can reach. Runtime reflection canβt see TypeScript types β theyβre erased at compilation. Zod-style schema builders choke on recursive unions (Appendix A.4 explains why). The only path was to operate at the compiler level itself β analyze types directly from source code and generate every piece of infrastructure from that single source of truth.
Thatβs what Typia is. A compiler library that directly leverages the TypeScript compilerβs type analyzer to automatically generate JSON Schema, validators, parsers, and feedback generators at compile time. Define one type, and the compiler handles the rest. Itβs the result of choosing to solve the problem at the deepest layer available, because every shallower approach hit a wall.
Letβs examine in detail how it turns qwen3-coder-nextβs 6.75% success rate and qwen3.5βs 0% success rate into 100%.
3.1. From TypeScript Types to Function Calling Schemas
Function calling requires JSON Schema to tell the LLM βgive me data in this structure.β Normally, developers define types, separately write schemas, and keep the two synchronized forever.
Typia automates this process. Define a TypeScript type, and Typia automatically generates validation code and JSON Schema at compile time β not through runtime reflection, but by directly leveraging the TypeScript compilerβs type analyzer.
Letβs see the principle first. When you call typia.is<T>(), type information is analyzed at compile time and transformed into optimized validation code:
Before Compilation: TypeScript
import typia, { tags } from "typia";
interface IMember {
id: string & tags.Format<"uuid">;
email: string & tags.Format<"email">;
age: number &
tags.Type<"uint32"> &
tags.ExclusiveMinimum<19> &
tags.Maximum<100>;
}
const check: boolean = typia.is<IMember>(input);A single line β typia.is<IMember>(input) β transforms at compile time into optimized code containing UUID regex, email regex, integer checks, and range checks. It overcomes TypeScriptβs limitation of erasing type information at runtime through a compiler plugin.
This principle applies directly to function calling. typia.llm.parameters<T>() generates JSON Schema through the same type analysis:
Before Compilation: TypeScript
import typia, { tags } from "typia";
interface IMember {
/**
* Member's age.
*
* Only adults aged 19 or older can register.
* This is the platform's legal age restriction.
*/
age: number & tags.Type<"uint32"> & tags.ExclusiveMinimum<18>;
email: string & tags.Format<"email">;
name: string & tags.MinLength<1> & tags.MaxLength<100>;
}
const schema = typia.llm.parameters<IMember>();JSDoc comments become description fields. The LLM reads these descriptions to decide what values to generate. Type constraints become validation rules. ExclusiveMinimum<18> becomes a β> 18β rule, and Format<"email"> becomes an email format check. A single type definition simultaneously generates LLM guidance and validation rules.
At the class level, typia.llm.application<T>() can schematize an entire API:
import { LlmJson } from "@typia/utils";
import typia from "typia";
class ShoppingOrderController {
/** Creates an order */
create(input: IShoppingOrderCreate): void;
}
const app = typia.llm.application<ShoppingOrderController>();
const func = app.functions[0];
// All public methods have built-in parse() and validate()
const data = func.parse(llmOutput); // broken JSON recovery + type coercion
const result = func.validate(data); // schema violation detection
if (result.success === false) {
const feedback = LlmJson.stringify(result); // LLM-readable feedback generation
}The type is the schema. The constraints the LLM sees and the constraints the validator applies are always identical β because they come from the same source.
This is the key point. The schema generated by the Typia compiler from source code types powers every runtime function that follows. The schema that parse() references when recovering broken JSON and coercing types, the schema that validate() uses as the comparison target when diagnosing errors β theyβre all the same schema, automatically generated from types at compile time. Because itβs compiler output, not manually written, types and schemas can never diverge.
3.2. The Cause of 6.75%: Structural Complexity
The 10 variants of IJsonSchema and 30+ variants of IExpression from Chapter 2. Why is the first-try success rate so low?
Recursive union types cause combinatorial explosion. 10 variants nested 3 levels deep create 1,000 possible paths. With 30 variants, thatβs 27,000. The probability of the LLM choosing the correct path in one try is structurally low.
Moreover, subtle errors are frequent in union types:
- Chose the correct variant but got the type of a sub-field wrong
- Confused variants at recursive depth
- Missing required fields
- Serialized objects as strings (double-stringify)
These errors are βstructurally correct but semantically wrong,β making it difficult to provide accurate feedback with simple JSON Schema validation.
6.75% is the natural result of this structural complexity. The issue isnβt the first try β itβs what happens after failure.
3.3. Lenient JSON Parsing: Recovering Broken JSON
LLMs donβt produce perfect JSON. LLMs are language models that generate text token by token, not JSON generators. They leave brackets unclosed, misplace commas, prepend βHere is your answer:β before JSON, and wrap it in Markdown code blocks.
JSON.parse() rejects all of this. Typiaβs ILlmFunction.parse() handles every case:
| Problem | Example | Handling |
|---|---|---|
| Unclosed brackets | {"name": "John" | Auto-close |
| Trailing commas | [1, 2, 3, ] | Ignore |
| JavaScript comments | {"a": 1 /* comment */} | Remove |
| Unquoted keys | {name: "John"} | Allow |
| Incomplete keywords | {"done": tru | Complete to true |
| Description prefix | Here is your JSON: {"a": 1} | Skip |
| Markdown code blocks | ```json\n{"a": 1}\n``` | Extract inner content |
When you call func.parse() on actual LLM output where these problems occur simultaneously:
import { dedent } from "@typia/utils";
import typia, { ILlmApplication, ILlmFunction, tags } from "typia";
const app: ILlmApplication = typia.llm.application<OrderService>();
const func: ILlmFunction = app.functions[0];
// LLM sometimes returns malformed JSON with wrong types
const llmOutput = dedent`
> LLM sometimes returns some prefix text with markdown JSON code block.
I'd be happy to help you with your order! π
\`\`\`json
{
"order": {
"payment": "{\\"type\\":\\"card\\",\\"cardNumber\\":\\"1234-5678", // unclosed string & bracket
"product": {
name: "Laptop", // unquoted key
price: "1299.99", // wrong type (string instead of number)
quantity: 2, // trailing comma
},
"customer": {
// incomplete keyword + unclosed brackets
"name": "John Doe",
"email": "john@example.com",
vip: tru
\`\`\` `;
const result = func.parse(llmOutput);
if (result.success) console.log(result);
interface IOrder {
payment: IPayment;
product: {
name: string;
price: number & tags.Minimum<0>;
quantity: number & tags.Type<"uint32">;
};
customer: {
name: string;
email: string & tags.Format<"email">;
vip: boolean;
};
}
type IPayment =
| { type: "card"; cardNumber: string }
| { type: "bank"; accountNumber: string };
declare class OrderService {
/**
* Create a new order.
*
* @param props Order properties
*/
createOrder(props: { order: IOrder }): { id: string };
}Thereβs a critical difference. Most JSON repair tools (jsonrepair, dirty-json, LangChainβs parse_partial_json) operate at the string level β cleaning trailing commas, closing brackets, removing Markdown, then passing to JSON.parse(). A double-stringified value "{\"type\":\"card\"}" is already valid JSON (a string), so it passes through as-is. Without a schema, thereβs no way to know it should be an object.
Typiaβs parse() is different. It parses greedily while referencing the schema generated by the compiler from types in Section 3.1. When it encounters a string where the schema expects an object, it recursively calls parse() on that string. Parsing and coercion recursively call each other β a schema-based recursive cycle β naturally unwinding layers of stringification. Double or triple.
parse() performs not just JSON recovery but also schema-based type coercion simultaneously. LLMs frequently get types wrong β "42" (string) where 42 (number) should be, "true" (string) where true (boolean) should be. Simple casting doesnβt solve this. Whether "42" should become a number or stay a string depends entirely on the fieldβs schema, which was automatically generated by the Typia compiler from TypeScript types.
| LLM Output | Schema Expected Type | parse() Coercion Result |
|---|---|---|
"42" | number or integer | 42 |
"true" / "false" | boolean | true / false |
"null" | null | null |
"{\"x\": 1}" | object | { x: 1 } (recursive parsing) |
"[1, 2, 3]" | array | [1, 2, 3] (recursive parsing) |
3.4. Qwen 3.5βs 0% Problem: Double-Stringify
Thereβs an even more dramatic case with the qwen3.5 series.
From Typia documentationΒ βs function calling example:
interface IOrder {
payment: IPayment;
product: {
name: string;
price: number & tags.Minimum<0>;
quantity: number & tags.Type<"uint32">;
};
customer: {
name: string;
email: string & tags.Format<"email">;
vip: boolean;
};
}
type IPayment =
| { type: "card"; cardNumber: string }
| { type: "bank"; accountNumber: string };What the LLM actually returns:
const llmOutput = `
> LLM sometimes returns some prefix text with markdown JSON code block.
I'd be happy to help you with your order! π
\`\`\`json
{
"order": {
"payment": "{\"type\":\"card\",\"cardNumber\":\"1234-5678", // unclosed string & bracket
"product": {
name: "Laptop", // unquoted key
price: "1299.99", // wrong type (string instead of number)
quantity: 2, // trailing comma
},
"customer": {
// incomplete keyword + unclosed brackets
"name": "John Doe",
"email": "john@example.com",
vip: tru
\`\`\`Markdown wrapping, description prefix, unquoted keys, trailing commas, tru (instead of true), unclosed brackets β and payment is double-stringified. Instead of {"type": "card", ...} (object), it generated "{\"type\": \"card\", ...}" (a string containing JSON). Seven problems in a single output.
Double-stringification brings the success rate to 0%. Other errors are intermittent, but anyOf double-stringification is 100% consistent β on every anyOf field, every time. Itβs not a Qwen-only problem either. Anthropicβs Claude exhibits the same behavior with oneOf. Every model family has a blind spot for union types.
Typiaβs parse() handles all of this in a single call β broken JSON recovery, type coercion, double-stringify unwinding. No model change. No prompt tuning. This is how Qwen 3.5 went from 0% to 100%.
3.5. Validation Feedback: Precise Error Feedback
Even after parsing and coercion, values themselves can be wrong. Negative prices, strings that arenβt emails, decimals where integers should be.
Typiaβs ILlmFunction.validate() detects schema violations and tells you exactly where and why something is wrong:
import { LlmJson } from "@typia/utils";
import typia, { ILlmApplication, ILlmFunction, IValidation, tags } from "typia";
const app: ILlmApplication = typia.llm.application<OrderService>();
const func: ILlmFunction = app.functions[0];
// LLM generated invalid data
const input = {
order: {
payment: { type: "card", cardNumber: 12345678 }, // should be string
product: {
name: "Laptop",
price: -100, // violates Minimum<0>
quantity: 2.5, // should be uint32
},
customer: {
name: "John Doe",
email: "invalid-email", // violates Format<"email">
vip: "yes", // should be boolean
},
},
};
// Validate and format errors for LLM feedback
const result: IValidation = func.validate(input);
if (result.success === false) {
const feedback: string = LlmJson.stringify(result);
console.log(feedback);
}
interface IOrder {
payment: IPayment;
product: {
name: string;
price: number & tags.Minimum<0>;
quantity: number & tags.Type<"uint32">;
};
customer: {
name: string;
email: string & tags.Format<"email">;
vip: boolean;
};
}
type IPayment =
| { type: "card"; cardNumber: string }
| { type: "bank"; accountNumber: string };
declare class OrderService {
/**
* Create a new order.
*
* @param props Order properties
*/
createOrder(props: { order: IOrder }): { id: string };
}βThe price inside product inside order should be β₯ 0, but you gave -100.β
LlmJson.stringify() renders these errors as // β inline comments on top of the LLMβs original JSON:
{
"order": {
"payment": {
"type": "card",
"cardNumber": 12345678 // β [{"path":"$input.order.payment.cardNumber","expected":"string"}]
},
"product": {
"name": "Laptop",
"price": -100, // β [{"path":"$input.order.product.price","expected":"number & Minimum<0>"}]
"quantity": 2.5 // β [{"path":"$input.order.product.quantity","expected":"number & Type<\"uint32\">"}]
},
"customer": {
"name": "John Doe",
"email": "invalid-email", // β [{"path":"$input.order.customer.email","expected":"string & Format<\"email\">"}]
"vip": "yes" // β [{"path":"$input.order.customer.vip","expected":"boolean"}]
}
}
}cardNumber should be a string but got a number. price should be β₯ 0. quantity should be a positive integer. email is not a valid email. vip should be a boolean. 5 errors, each with exact path and expected type.
The LLM sees exactly where it went wrong on its own JSON. Instead of rewriting everything, it only needs to fix the 5 marked fields. Precise, structured, immediately actionable feedback.
3.6. The Complete Feedback Loop
Combining everything into a single loop:
async function callWithFeedback(
llm: LLM,
func: ILlmFunction,
prompt: string,
maxRetries: number = 10,
): Promise<unknown> {
let feedback: string | null = null;
for (let i = 0; i < maxRetries; i++) {
// 1. Request function call from LLM (including previous feedback)
const rawOutput = await llm.call(prompt, feedback);
// 2. Lenient JSON parsing + type coercion
const parsed = func.parse(rawOutput);
if (!parsed.success) {
feedback = `JSON parsing failed: ${JSON.stringify(parsed.errors)}`;
continue;
}
// 3. Schema validation
const validated = func.validate(parsed.data);
if (!validated.success) {
// 4. Generate structured feedback (// β inline comments)
feedback = LlmJson.stringify(validated);
continue;
}
// 5. Success
return validated.data;
}
throw new Error("Maximum retry count exceeded");
}parse() recovers broken JSON and performs initial type coercion. validate() catches schema violations. LlmJson.stringify() renders errors in a format the LLM can read. The LLM self-corrects and retries.
This is the complete loop that turns 6.75% into 100%.
3.7. Harness Engineering: The Union of AutoBe + Typia
This is where the concept of harness is finally complete.
A climbing harness doesnβt make you stronger β it makes your strength safe. A test harness doesnβt make code correct β it makes bugs visible. A function calling harness doesnβt make the LLM smarter β it makes the LLMβs mistakes correctable.
The combination of AutoBe and Typia constitutes this harness. Each layer was added because the previous one wasnβt enough.
We started with raw JSON.parse(). It broke constantly β unclosed brackets, trailing commas, Markdown wrappers. So we built lenient parsing. That got us from 0% to βat least we can read the output.β
But parsed JSON had wrong types everywhere. "42" instead of 42, "true" instead of true. Without a schema, thereβs no way to know which is correct. So we built schema-based type coercion β the same schema the compiler generated from types now guided the parser.
Coercion fixed type mismatches, but values themselves were still wrong β negative prices, invalid emails, decimals where integers should be. The LLM had no idea what it got wrong. So we built validation feedback β // β inline comments showing exactly where and why each field failed.
Feedback fixed individual function calls, but the system as a whole still had consistency gaps β a valid DTO schema referencing a database field that didnβt exist, valid test code calling an API endpoint with wrong parameters. So we built 4-tier compiler validation β each tier catching a different class of inconsistency.
No layer was planned in advance. Each was the minimum response to a specific failure mode:
Typia Layer (function calling level):
- Type β automatic schema generation
- Broken JSON recovery (lenient parsing)
- Schema-based type coercion
- Precise error feedback (validation feedback)
AutoBe Layer (system level):
- 4 AST types + 4-tier compiler validation
- Self-healing loops (diagnostics β correction β revalidation)
- Hierarchical orchestration (40+ agent collaboration)
- Batch processing + prompt caching optimization
Typia makes function calling I/O robust, and AutoBe ensures system-wide consistency. The combination of these two layers completes the deterministic loop wrapping a probabilistic model β the harness.
Define a single TypeScript type and the Typia compiler handles the rest:
- Compile time: source code analysis β Analyzes TypeScript types to auto-generate JSON Schema, validators, and parser code
- Schema generation β
typia.llm.parameters<T>(),typia.llm.application<T>() - Parsing + type coercion β
ILlmFunction.parse()(recovers broken JSON, coerces types, and unwinds double-stringify in one pass using compiler-generated schemas) - Validation β
ILlmFunction.validate()(detects violations using the same schema) - Feedback generation β
LlmJson.stringify()(LLM-readable// βinline diagnostics)
The type is the schema, the validator, and the prompt. The harness is everything around it.
4. In Praise of Function Calling
Chapters 2 and 3 showed how it works. Now: why function calling is this powerful β and why the widespread skepticism, while accurate about the symptoms, misses the root cause.
βStructured outputs create false confidence.β βYour agent demo works perfectlyβ¦ then you deploy with 50 real endpoints and everything falls apart.β These arenβt strawmen β theyβre published findings and the lived experience of engineers who tried structured output on complex schemas and watched it crumble.
The criticism is accurate when you use structured output without a harness. Constrained decoding alone does degrade reasoning. Strict mode alone does fail on complex types. JSON.parse() alone does discard salvageable output. Every criticism describes what happens when you treat function calling as a feature to toggle on, rather than as infrastructure to build around.
This chapter argues that the failures the industry observed are not evidence against function calling, but evidence that the harness was missing.
4.1. Natural Language vs Types
Natural language evolved to be ambiguous. Metaphor, nuance, politeness, humor β all operate on top of ambiguity. βJust make it prettyβ works between humans.
Programming languages were designed to eliminate ambiguity. βJust make it prettyβ doesnβt compile.
When people communicate in natural language, misunderstandings arise. When they communicate through types, there are none.
Expressing constraints through prompts:
βThe age field should be a positive integer greater than 18. Donβt use string types for number fields. All required fields must be presentβ¦β
Is βgreater than 18β >18 or β₯18? You canβt know whether the LLM followed this rule without manually inspecting the output. As schemas grow, these rules multiply endlessly.
Expressing constraints through types:
interface IMember {
/** Only adults 19+ can register */
age: number & Type<"uint32"> & ExclusiveMinimum<18>;
}ExclusiveMinimum<18> is >18. Itβs an integer. Itβs required. No ambiguity, mechanically verifiable.
In domains requiring precision, type constraints provide certainty that natural language instructions cannot.
4.2. The Pink Elephant Problem
If youβve built a prompt-based AI agent, youβve written prohibition rules:
- βDonβt create utility functionsβ
- βDonβt use the
anytypeβ - βDonβt create circular dependenciesβ
βDonβt think of a pink elephant.β The first thing that comes to mind is a pink elephant. When you tell an LLM βdonβt do X,β X gets placed at the center of attention. To avoid a forbidden pattern, the model must first recall that pattern, which paradoxically increases its generation probability. This is the essence of token prediction.
Even knowing this, you canβt avoid prohibition rules in prompts. βDonβt do Xβ is the only way natural language can express constraints.
With schemas, this problem disappears.
No need to say βdonβt use the any typeβ β if any doesnβt exist in the schema, the LLM physically cannot generate it. No need to say βdonβt create utility functionsβ β if thereβs no slot for utility functions, thatβs the end of it. When field types are limited to "boolean" | "int" | "double" | "string" | "uri" | "uuid" | "datetime" β 7 choices β thereβs no path for the LLM to write "varchar".
Not prohibition, but absence. Prompts prohibit what you donβt want. Schemas allow only what you do want.
This is function callingβs deepest advantage: instead of fighting the modelβs tendencies, it makes unwanted outputs structurally impossible.
4.3. Model Neutrality
Prompt engineering is inherently model-dependent. A prompt optimized for GPT behaves differently on Claude, and differently again on Qwen. Rewriting prompts with each new model is routine.
Function calling-based approaches are model-neutral. JSON Schema means the same thing regardless of which model reads it. The validation feedback loop absorbs performance differences between models. Strong models converge in 1β2 attempts, weaker models take 3β4, but both reach 100%.
AutoBe running Qwen, GLM, DeepSeek, and OpenAI models with the same schema, the same pipeline and achieving 100% compilation across all of them is proof of this neutrality. No model-specific prompt tuning was ever performed.
This changes the nature of model selection. From βCan this model do this task?β β a capability question β to βWhich model is most cost-effective?β β a cost optimization problem: average retries Γ tokens per attempt Γ cost per token.
4.4. The Core: Verifiability
A single thread runs through everything.
Function callingβs fundamental advantage is that it brings LLM output into the domain of software engineering.
Free-form text output makes correctness an AI problem. Parsing is fuzzy. Validation is fuzzy. Correction is fuzzy.
Structured output makes correctness an engineering problem:
- Validation is deterministic β JSON Schema validation is a clear pass/fail
- Feedback is precise β βField X should be type Y but you gave Zβ
- Correction converges β precise feedback causes the model to fix only that part
The model is still probabilistic. It still makes mistakes. But because the structure wrapping the model is deterministic, the process converges to 100%.
Type schema + deterministic validator + structured feedback = harness
Prompt engineering tries to make the probabilistic part reliable. Function calling makes the deterministic part perfect. In domains requiring precision, the latter wins: 6.75% β 100%.
4.5. This Pattern Is Universal
Does this pattern only apply to code generation? No. It applies to every domain where output is mechanically verifiable.
4.5.1. Applicable Domains
AutoBeβs Database, Interface, Test, and Realize phases all fall into this category. Compilers serve as validators, guaranteeing 100% correctness.
This isnβt just about software. The same structure is possible in every field where βcorrect/incorrectβ can be mechanically determined, with a natural hierarchy based on validation cost:
| Domain | Fast (ms) | Medium (sec) | Deep (min+) |
|---|---|---|---|
| Software | Type check | Compilation | Test execution |
| Semiconductor | DRC | LVS | SPICE simulation |
| Chemical Process | Mass balance | Energy balance | Process simulation |
| Interior Design | Dimensions/clearance | Building codes, collision detection | Lighting/HVAC simulation |
| Control Systems | Transfer function validity | Stability/margin analysis | Time-domain simulation |
Running the cheapest validator first, fixing errors, then moving to the next tier is the natural strategy.
4.5.2. Concrete Types from Other Domains
The table above was an overview. Hereβs what this looks like as concrete types β each from a field where validators have been refined for decades.
Semiconductors β The physical rules of chip design are non-negotiable:
interface IChipLayout {
technology_node: "5nm" | "7nm" | "14nm" | "28nm";
blocks: IBlock[];
connections: IConnection[];
}
interface IBlock {
type: "logic" | "memory" | "io" | "analog" | "pll";
position: IPoint2D;
dimensions: IDimension;
sub_blocks: IBlock[]; // recursive hierarchy
}DRC (fast), LVS (medium), SPICE simulation (slow). All deterministic.
Chemical Processes β Conservation laws are absolute validators:
interface IProcessStream {
temperature: number & Minimum<0>; // Kelvin
pressure: number & Minimum<0>; // Pa
composition: IComponent[]; // must sum to 1.0
phase: "liquid" | "vapor" | "solid" | "two_phase";
flow_rate: number & Minimum<0>; // kg/s
}
interface IUnitOperation {
type: "reactor" | "distillation" | "heat_exchanger"
| "compressor" | "pump" | "mixer" | "splitter";
inlet_streams: IProcessStream[];
outlet_streams: IProcessStream[]; // mass balance: Ξ£in = Ξ£out
energy_duty: number; // energy balance
}Mass conservation (Ξ£ inlet = Ξ£ outlet), energy balance, thermodynamic consistency β these are laws of physics, not opinions. Tools like ASPEN and HYSYS have provided deterministic validation for over 40 years.
Interior Design β Rigid constraints define spaces beneath aesthetics:
interface IRoom {
type: "bedroom" | "living" | "kitchen" | "bathroom"
| "office" | "hallway" | "storage";
dimensions: IDimension3D;
openings: IOpening[];
fixtures: IFixture[];
}
interface IOpening {
type: "door" | "window" | "sliding_door" | "arch";
width: number & Minimum<0>; // door β₯ 900mm (accessibility)
height: number & Minimum<0>;
position: IPoint3D;
swing_direction?: "inward" | "outward" | "sliding";
}
interface IFixture {
type: "cabinet" | "counter" | "appliance"
| "furniture" | "lighting" | "plumbing";
position: IPoint3D;
dimensions: IDimension3D;
clearance_required: number; // minimum clearance (mm)
}Minimum passage width (800mm), door width for accessibility (β₯900mm), fire compartment regulations, emergency evacuation distances. BIM tools like Revit have provided collision detection for decades.
Control Systems β Stability is mathematically provable:
interface IControlLoop {
type: "PID" | "MPC" | "LQR" | "feedforward" | "cascade";
plant_model: ITransferFunction;
setpoint: number;
sampling_rate: number & Minimum<0>; // Hz
constraints: IConstraint[];
}
interface ITransferFunction {
numerator: number[]; // polynomial coefficients
denominator: number[]; // degree β₯ numerator
delay: number & Minimum<0>; // transport delay (sec)
}Bode plots, Nyquist plots, pole placement: over 60 years of analysis tool history. Transfer function validity (fast) β stability/gain-phase margin (medium) β time-domain simulation (deep).
Look at these types. They all have type fields with enumerated variants β "logic" | "memory" | ..., "reactor" | "distillation" | ..., "bedroom" | "living" | ..., "PID" | "MPC" | .... Several nest recursively. The same union + tree structure as AutoBeβs IJsonSchema and IExpression. This is not coincidence β itβs the nature of engineering data. Appendix A.3 explains why.
Note: The domain examples above were AI-recommended β all are engineering fields where deterministic validators have existed for decades, so the same structure applies in principle. However, as Iβm a developer and not a domain expert, please treat the specific details as reference material.
4.5.3. Inapplicable Domains
This approach has clear limitations.
First, domains without deterministic validators. Creative writing, emotional intelligence, strategic decision-making. There is no validator for βa good novelβ or βa wise business decision.β Without a validator, thereβs no feedback loop, and without a feedback loop, this structure doesnβt hold.
Second, when the cost of building the structure exceeds the cost of tolerating uncertainty. Precise type design, compiler integration, and feedback formatting require upfront investment. For one-off tasks with loose accuracy requirements, a well-crafted prompt may be more appropriate. This structure shines when repeatable, verifiable accuracy is needed at scale β exactly the situation AutoBe faces in code generation.
This is not a universal solution. Itβs a solution for domains where accuracy is non-negotiable and mechanically verifiable. In those domains, nothing can match it.
5. Qwen β Small Models and QA Engineering
5.1. Function Calling Performance: Small/Medium Models Excel
AutoBeβs entire pipeline is function calling. Whether a model writes good prose or holds natural conversations is irrelevant. The only criterion is how accurately it fills complex JSON Schemas.
Qwen isnβt the only open-weight model that does function calling well. GLM, Kimi, and others show strong performance at large scale. But at the small/medium scale, Qwen was the only one that could handle function calling of this complexity.
Even small MoE models with 3B active parameters support tool choice and process complex schemas containing 10+ recursive union variants. Why this small/medium performance was decisive for AutoBe continues below.
5.2. R&D Cost: Users vs Developers
For customers who use AutoBe, model cost is a non-issue. Even the most expensive model is cheaper than hiring a single backend developer.
For us developing AutoBe, itβs different. Every time we add new type designs or validation rules, we must run the entire pipeline from start to finish. Thousands of generate-compile-feedback cycles. Using commercial models at this scale would be financial ruin.
Local models make R&D cycles possible. We experiment without limits, without cost concerns. The journey from 6.75% to 100% required hundreds of experimental cycles β possible only because models were local.
5.3. Small Models Are the Best QA Engineers
Large models make fewer mistakes. Thatβs their advantage β and simultaneously their blind spot.
Even if there are vulnerabilities in our system that we havenβt thought of, large models rarely trigger those failures. They βcorrectly guessβ ambiguous parts of schemas and pass through. Our mistakes remain hidden.
Switch to a small model and the story changes:
| Model | Active / Total | Success Rate | What It Found |
|---|---|---|---|
qwen3-30b-a3b-thinking | 3B / 30B | ~10% | Fundamental schema ambiguities, missing required fields |
qwen3-next-80b-a3b-instruct | 3B / 80B | ~20% | Subtle type mismatches in complex nested relations |
The 10% success rate was the most valuable result. Every failure pointed to a system vulnerability, and each fix strengthened the pipeline for all models.
AI is probabilistic. Large models make mistakes less frequently, not never. Edge cases that surface with small models will eventually occur with large models too β just rarely. In production, βrarelyβ means outage.
When a system is robust enough that even a 35B model canβt find vulnerabilities, the probability of any model failing approaches zero.
Small models are the ultimate stress testers. From a QA engineering perspective, weaker models are actually the more powerful verification tool.
5.4. No Vendor Lock-In
Commercial API pricing changes, model deprecations, and request limits are at the vendorβs discretion. The model you use today could disappear tomorrow.
AutoBeβs function calling schemas are model-neutral. No model-specific prompt tricks. JSON Schema and type-based validation are industry standards β the system remains unchanged even when models change.
5.5. Open Source + Open Weights: A Virtuous Cycle
AutoBe is open source (AGPL 3.0). Qwen is open-weight. Both are part of the open ecosystem.
This combination enabled thousands of experiments, edge case discoveries, and system hardening. This scale of experimentation would have been financially impossible with commercial models.
The open ecosystem creates a virtuous cycle:
- AutoBe strengthens its system using Qwen
- The strengthened system proves Qwenβs production-level viability
- Qwenβs improvements raise AutoBeβs overall performance
- AutoBeβs discoveries (e.g., the double-stringify issue) can contribute to Qwenβs improvement
6. Conclusion
We started at 6.75%. The industry said complex function calling doesnβt work, and our results agreed.
But there was no alternative β deterministic AI output requires structured output β so we built the harness, one failure mode at a time. Lenient parsing because JSON broke. Type coercion because types were wrong. Validation feedback because values were wrong. Compiler pipelines because the system needed consistency.
AutoBe achieved 100% compilation across all four Qwen models. Not through better prompts, but through the accumulated engineering of every way things went wrong.
Three things: type schemas that constrain outputs, compilers that verify results, and structured feedback that corrects errors. These three form a deterministic loop wrapping probabilistic models.
This pattern is not limited to code generation. The same structure can be built in every engineering domain where deterministic validators exist β semiconductors, chemical processes, control systems.
Communicate through types and there are no misunderstandings. Constrain through schemas and there are no pink elephants. With a deterministic loop, even 6.75% becomes 100%.
6.75% is not a failure β itβs the first input to the loop. If you can verify, you converge.
About AutoBe: AutoBeΒ is an open-source AI agent developed by Wrtn TechnologiesΒ . It generates production-grade backend applications from natural language.
About Typia: TypiaΒ is a compiler library that automatically generates runtime validators, JSON Schema, and function calling schemas from TypeScript types.
Appendix: Technical Deep Dive
Union types appear throughout this talk. IJsonSchemaβs 10 variants (Section 2.5), IExpressionβs 30+ variants (Section 2.5), Qwen 3.5βs double-stringify problem (Section 3.4), type coercion (Section 3.3), validation feedback (Section 3.5). Sections A.1βA.4 explore why union types are the core challenge. Section A.5 explores a capability that schema-based parsing enables beyond validation.
A.1. What Is a Discriminated Union?
A union type represents βone of several kinds.β If a payment method can be card or bank transfer:
type Payment =
| { type: "card"; cardNumber: string; cvc: string }
| { type: "bank_transfer"; bankCode: string; accountNumber: string }A discriminated union has a discriminator field β a single field whose value determines the variant. Here, type is the discriminator. If type is "card", there are cardNumber and cvc; if "bank_transfer", there are bankCode and accountNumber. A single discriminator value determines the rest of the structure.
Why does this matter? When an LLM generates data for a union type and makes a mistake, correction requires knowing βwhich variant was intended.β With a discriminator, identification is simple β check one field and you know the variant. Without one, you must infer intent from the dataβs shape β harder, but possible with the right infrastructure.
AutoBeβs IJsonSchema (10 variants) and IExpression (30+ variants) are all discriminated unions, and Typiaβs ability to structurally identify variants and generate per-field precise feedback is the core mechanism behind 6.75% β 100%.
A.2. Typiaβs x-discriminator β Adding Intelligence to anyOf
JSON Schema provides anyOf (match any) and oneOf (match exactly one) for unions. Neither carries βwhich field distinguishes variantsβ β they simply say βmatch one of these schemas.β
OpenAPI v3.x has discriminator, but itβs oneOf-only, and most LLMs canβt reliably handle oneOf.
Typia solves this with x-discriminator. It uses anyOf, which LLMs broadly support, while attaching discriminator metadata:
{
"anyOf": [
{ "type": "object", "properties": { "type": { "const": "card" }, "cardNumber": { ... } } },
{ "type": "object", "properties": { "type": { "const": "bank_transfer" }, "bankCode": { ... } } }
],
"x-discriminator": {
"propertyName": "type",
"mapping": {
"card": "#/$defs/CardPayment",
"bank_transfer": "#/$defs/BankTransferPayment"
}
}
}This serves a different purpose from Typiaβs internal processing. Typiaβs coercion and validation logic use structural analysis β matching property names, types, and shapes against each variantβs schema β to identify the correct variant. They work regardless of whether a discriminator exists.
x-discriminator is for the LLM. It tells the model βuse the type field to select a variant,β reducing the probability of generating structurally ambiguous data in the first place.
The two work together:
x-discriminatorreduces errors at the source β the LLM reads the hint and generates clearer data- Structural analysis handles the rest β
parse()identifies the variant and applies variant-specific type coercion (including Qwen 3.5βs double-stringify unwinding).validate()identifies the variant and generates per-field precise errors β not βnone of the 10 variants matched,β but βcard variantβs cardNumber should be string but you gave numberβ
x-discriminator makes the LLM more accurate. Structural analysis makes the system robust. This is why coercion and validation work reliably on union types.
A.3. The World Is Made of Recursive Unions
Engineering manages complexity through hierarchical decomposition β breaking large systems into smaller parts, and those parts into even smaller parts. A chip is blocks, and blocks are sub-blocks. A plant is sections, and sections are units. A building is floors, and floors are rooms. This decomposition forms a tree. At each level, parts have different kinds β blocks can be logic, memory, or IO; units can be reactors, distillation columns, or heat exchangers. The moment tree nodes have kinds, it becomes a recursive union type.
All domains from Chapter 4 follow this pattern:
- Semiconductors:
IBlockβsub_blocks: IBlock[](chip β block β sub-block) - Chemical processes: plant β section β unit β sub-unit (recursive process hierarchy)
- Interior design: building β floor β room β zone (recursive spatial decomposition)
- Control systems: cascade control β outer loopβs output is inner loopβs setpoint (recursive nesting)
Structurally identical to AutoBeβs IJsonSchema (10 variants) and IExpression (30+ variants). All are ASTs β Abstract Syntax Trees. Hierarchical decomposition is how engineers manage complexity, and hierarchical decomposition produces recursive union types. Every deterministic engineering domain shares this structure.
If the same structure applies to all domains with deterministic validators, and those domains all share recursive union data structures, then conquering union types is the prerequisite for building this structure.
If coercion doesnβt work on unions, Qwen 3.5βs double-stringify will appear in chip design too. If validation feedback doesnβt work on unions, βnone of 30 variants matchedβ makes convergence impossible. Without identifying the intended variant, correction is impossible.
Typiaβs structural variant identification, schema-based coercion, and per-field precise validation are the solution for this universal structure. AutoBeβs 6.75% β 100% is not just a code generation achievement. It establishes reliability on the universal structure of recursive unions β an achievement transferable to every domain that shares this structure.
A.4. Why Not Zod?
Zod is the most popular runtime validation library in TypeScript. βWhy not Zod?β is a frequent question.
Letβs see what happens when you define AutoBe-scale 30+ variant recursive discriminated unions with Zod:
const ExpressionSchema: z.ZodType<IExpression> = z.lazy(() =>
z.discriminatedUnion("type", [
z.object({ type: z.literal("booleanLiteral"), value: z.boolean() }),
z.object({
type: z.literal("callExpression"),
expression: ExpressionSchema, // circular reference
arguments: z.array(ExpressionSchema), // circular reference
}),
// ... 28 more
])
);Three problems.
First, you must define TypeScript types and Zod schemas separately.
Zodβs documentation states this explicitly: βyou can define a recursive schema in Zod, but because of a limitation of TypeScript, their type canβt be statically inferred.β Using z.lazy() breaks z.infer, so you define TypeScript interfaces separately and connect them with z.ZodType<T>:
// 1. Define TypeScript types first
type IExpression =
| { type: "booleanLiteral"; value: boolean }
| { type: "callExpression"; expression: IExpression; arguments: IExpression[] }
| { type: "binaryExpression"; left: IExpression; operator: string; right: IExpression }
// ... 27 more
// 2. Define Zod schema separately with manual type hint connection
const ExpressionSchema: z.ZodType<IExpression> = z.lazy(() =>
z.discriminatedUnion("type", [
z.object({ type: z.literal("booleanLiteral"), value: z.boolean() }),
z.object({ type: z.literal("callExpression"), expression: ExpressionSchema, arguments: z.array(ExpressionSchema) }),
z.object({ type: z.literal("binaryExpression"), left: ExpressionSchema, operator: z.string(), right: ExpressionSchema }),
// ... 27 more
])
);For 30+ variant recursive unions, this dual definition runs to hundreds of lines. Over time the two diverge, and nothing catches the inconsistency.
Second, even accepting dual definitions, it wonβt compile.
As recursive union depth increases, you hit TypeScriptβs generic instantiation limit:
TS2589: Type instantiation is excessively deep and possibly infinite.
In native TypeScript types, recursive references are name lookups β pointers to the same definition. 30 variants referencing IExpression? 30 pointer lookups. O(N) β linear.
In Zod, z.discriminatedUnion is a deeply nested generic. TypeScript must structurally expand each variantβs output type through Zodβs conditional types. z.lazy() forces re-entry over the entire union β N variants Γ K recursive fields, each triggering another expansion. At N=30, K=2, depth 3, thatβs 216,000 type resolutions. O((NΒ·K)^d) β exponential.
This is the most repeatedly reported error on Zodβs issue tracker. #577Β , #5064Β , #5256Β β all recursive schemas, all TS2589, unresolved even in Zod v4. Discussion #1459Β shows the same error on complex discriminated unions that arenβt even recursive β generic expansion alone is costly enough.
The practical impact extends to IDEs. TypeScriptβs language server runs the same type checker for autocomplete and hover types. A 30+ variant recursive Zod schema triggers the same exponential expansion β memory soars to gigabytes, and the IDE freezes not just on the schema file but on every file that imports it.
Third, even accepting all of this, you cannot build the feedback loop.
This is the decisive problem.
When validation fails on a union type, Zod cannot determine βwhich variant was intended.β On 10-variant unions, errors flood for all variants (#792Β ), or discriminator mismatch silently hides other field errors (#2202Β ). Zod v4 regressed further: discriminator mismatch returns an empty error array and βNo matching discriminatorβ (#4909Β , #5670Β ).
From the LLMβs perspective: if it intended the callExpression variant but got the arguments field type wrong, it needs feedback like βarguments should be an IExpression array but you gave string.β What Zod gives is βnone of the 10 variants matched.β Feedback that doesnβt tell you what to fix is not feedback. Without precise feedback, the loop doesnβt converge.
Typia analyzes the dataβs shape to structurally identify the intended variant, then generates per-field precise errors against that variantβs schema. This is the prerequisite for the feedback loop to work, and Zod completely lacks this mechanism.
Zod: dual definitions, compilation failure, feedback loop impossible. This structure cannot be built on Zod.
Typia needs just one interface:
const result = typia.validate<AutoBeTest.IExpression>(llmOutput);It operates at the compiler level. No separate schema, no generic depth limits, no incomplete errors.
A.5. Beyond Token Limits: Incremental Structured Output
Function calling has an unspoken constraint: the entire JSON must fit in a single response. If the modelβs maximum output is 32K tokens but the target JSON is 100K tokens, the output gets cut off mid-JSON. To JSON.parse(), truncated JSON is failed JSON. The entire generation is wasted.
Typiaβs schema-based lenient parsing changes this dynamic. Because parse() automatically closes unclosed brackets, completes incomplete values, and recursively applies type coercion β truncated JSON is not a failure. DeepPartial<T>: completed fields are valid, and missing fields are a typed object identifiable by schema.
Turn 1: LLM generates 32K tokens β truncated mid-JSON
β Typia parse() β DeepPartial<T>
β Schema diff: "these fields are still missing"
Turn 2: "Please fill in the remaining fields" + previous DeepPartial<T>
β LLM generates next chunk β Typia parse()
β DeepPartial<T> updated, validate() on completed subtrees
Turn N: β All fields present β validate() passes β TEach turn, parse() recovers truncated output and coerces types, while validate() can run on completed subtrees first. Errors surface incrementally, not at the end.
This is incremental compilation applied to structured output. Traditional function calling discards truncated output and retries from scratch. Typiaβs approach reuses every valid field and asks the LLM to fill in only the missing parts.
Function callingβs output size is no longer limited by max_output_tokens. A 200K-token JSON can be progressively built over multiple turns, with type safety maintained at every step. The schema knows what you have and what you need, and the lenient parser ensures nothing is wasted.
Once structured output can be built incrementally, the upper bound on what function calling can produce disappears.
A.6. Current Function Calling Success Rates
The 6.75% figure cited throughout this talk was an early estimate. Since then, OpenRouterΒ introduced Exacto mode β a server-side enforcement of structured output β and success rates have improved noticeably. Here are the current measurements across six models and two of AutoBeβs most complex function calling targets.
Important methodological note: the β1st success rateβ column was not directly measured. What we observe is the total number of trials and successes across the entire self-healing loop. From these aggregate numbers, the first-try success probability is estimated using the following formula:
p_1 = \frac{1}{1 + \sqrt{\mu} \cdot (\mu - 1)}, \quad \mu = \frac{N_{trial}}{N_{success}}
where N_trial is the total number of function call attempts and N_success is the number that eventually produced valid output. ΞΌ represents the average number of attempts per success β when ΞΌ = 1, every attempt succeeds on the first try (pβ = 100%); as ΞΌ grows, pβ drops rapidly.
IAutoBeInterfaceSchemaRefineApplication.IProps
This is the function calling target for DTO schema generation β the 10-variant recursive IJsonSchema union from Section 2.5. The type that originally yielded the 6.75% estimate.
| Model | Trials | Successes | Overall Rate | Est. 1st Success Rate |
|---|---|---|---|---|
qwen/qwen3-coder-next | 619 | 166 | 26.82% | 15.95% |
qwen/qwen3.5-122b-a10b | 370 | 115 | 31.08% | 20.09% |
moonshotai/kimi-k2.5 | 382 | 177 | 46.34% | 37.02% |
z-ai/glm-5 | 169 | 96 | 56.80% | 49.78% |
openai/gpt-5.4 | 338 | 144 | 42.60% | 32.64% |
anthropic/claude-sonnet-4.6 | 360 | 151 | 41.94% | 31.88% |
qwen3-coder-nextβs estimated first-try rate rose from 6.75% to 15.95% β more than doubled. Exacto modeβs structured output enforcement catches many of the malformed JSON issues (unclosed brackets, trailing commas) at the API level before they ever reach AutoBeβs pipeline.
IAutoBeInterfaceEndpointReviewApplication.IProps
This is the function calling target for endpoint review β validating whether API endpoint designs are consistent with the database schema.
| Model | Trials | Successes | Overall Rate | Est. 1st Success Rate |
|---|---|---|---|---|
qwen/qwen3-coder-next | 188 | 46 | 24.47% | 13.81% |
qwen/qwen3.5-122b-a10b | 56 | 56 | 100.00% | 100.00% |
moonshotai/kimi-k2.5 | 116 | 38 | 32.76% | 21.80% |
z-ai/glm-5 | 67 | 53 | 79.10% | 77.10% |
openai/gpt-5.4 | 212 | 21 | 9.91% | 3.34% |
anthropic/claude-sonnet-4.6 | 24 | 24 | 100.00% | 100.00% |
Notable results: qwen3.5-122b-a10b and claude-sonnet-4.6 achieved 100% first-try success on endpoint review β every single attempt was valid. Meanwhile, gpt-5.4 scored the lowest at 3.34%, demonstrating that model size and brand do not predict function calling performance on complex schemas.
The core thesis holds: even with improved first-try rates, no model achieves 100% across all function calling targets. The harness remains essential. What changed is the starting point of the loop β and a higher starting point means fewer retries, lower cost, and faster convergence.
All experiments were conducted via OpenRouterΒ with Exacto mode enabled. Raw results are available in the autobe-examples repositoryΒ .