Skip to Content
πŸ“– Guide Documents🎀 SeminarsQwen Korea Meetup

Function Calling All-In

Qwen Meetup Korea β€” 2026.03.26

PQwen Korea Meetupqwen-korea-meetup-20260326.pptx↓

TL;DR

  1. AutoBeΒ 
    • A backend AI agent built entirely on function calling
    • The LLM never writes code β€” it fills typed structures, and the compiler converts them to code
    • 100% compilation success across all 4 Qwen models
  2. TypiaΒ 
    • Infrastructure that automates the entire function calling lifecycle
    • Schema generation β†’ lenient parsing β†’ type coercion β†’ validation feedback
    • qwen3-coder-next: 6.75% β†’ 100%, qwen3.5 series: 0% β†’ 100%
  3. The Case for Function Calling
    • A methodology for domains that demand precision
    • Constraints through structural absence, model-neutral, mechanically verifiable
  4. Why Qwen
    • Local models are essential for R&D
    • Small models make the best QA engineers
    • Open ecosystem
  5. The LLM doesn’t need to be accurate β€” it just needs to be correctable

1. AutoBe

6.75%.

That’s the probability that qwen3-coder-next produces a valid result on its first attempt when asked to generate API data types (input/output structures for products, orders, payments, etc.) for a shopping mall backend. Out of 100 tries, 93 fail.

And yet AutoBe’s final compilation success rate is 100%. Across all four Qwen models.

1.1. What AutoBe Does

AutoBeΒ  is an open-source AI agent that generates production-ready backends from natural language conversation. It’s developed by Wrtn TechnologiesΒ .

β€œBuild me a shopping mall backend. I need product listings, a shopping cart, orders, and payments.” β€” Say this, and AutoBe generates all of the following:

  • Requirements analysis (SRS)
  • Database schema (Prisma ERD)
  • API specification (OpenAPI 3.1)
  • E2E test code
  • Complete implementation code
  • Type-safe SDK

Every line of generated code compiles. What comes out is a real, working backend built on TypeScript + NestJS + Prisma.

Todo

qwen/qwen3.5-122b-a10b
Analyzeactors: 2, documents: 6
Analyze Phase
Token Usage: 377.8K
in: 308.6K / out: 69.2K
Function Calls: 47 / 47 (100.0%)
Databasenamespaces: 2, models: 7
Database Phase
Token Usage: 1.25M
in: 1.03M / out: 219.5K
Function Calls: 37 / 38 (97.4%)
Interfaceoperations: 14, schemas: 28
Interface Phase
Token Usage: 11.65M
in: 11.40M / out: 242.6K
Function Calls: 163 / 203 (80.3%)
Testfunctions: 44
Test Phase
Token Usage: 2.90M
in: 2.76M / out: 142.2K
Function Calls: 102 / 107 (95.3%)
Realizefunctions: 24
Realize Phase
Token Usage: 1.90M
in: 1.78M / out: 120.8K
Function Calls: 71 / 82 (86.6%)
βœ“Function Calling Success Rate
88.05%
⏱Elapsed Time
1h 18m 29s
🧠Total Tokens
18.08M
in: 17.29M (0 cached)
out: 794.4K

Reddit

qwen/qwen3.5-122b-a10b
Analyzeactors: 2, documents: 6
Analyze Phase
Token Usage: 1.33M
in: 1.07M / out: 253.7K
Function Calls: 99 / 105 (94.3%)
Databasenamespaces: 6, models: 21
Database Phase
Token Usage: 2.71M
in: 2.53M / out: 171.5K
Function Calls: 60 / 63 (95.2%)
Interfaceoperations: 62, schemas: 80
Interface Phase
Token Usage: 67.76M
in: 66.30M / out: 1.46M
Function Calls: 628 / 898 (69.9%)
Testfunctions: 183
Test Phase
Token Usage: 25.28M
in: 24.14M / out: 1.14M
Function Calls: 608 / 624 (97.4%)
Realizefunctions: 98
Realize Phase
Token Usage: 11.70M
in: 11.03M / out: 661.2K
Function Calls: 286 / 320 (89.4%)
βœ“Function Calling Success Rate
83.63%
⏱Elapsed Time
3h 40m 14s
🧠Total Tokens
108.77M
in: 105.09M (0 cached)
out: 3.68M

Shopping

qwen/qwen3.5-122b-a10b
Analyzeactors: 3, documents: 6
Analyze Phase
Token Usage: 3.83M
in: 3.29M / out: 541.3K
Function Calls: 170 / 197 (86.3%)
Databasenamespaces: 10, models: 30
Database Phase
Token Usage: 5.01M
in: 4.87M / out: 148.1K
Function Calls: 85 / 87 (97.7%)
Interfaceoperations: 148, schemas: 155
Interface Phase
Token Usage: 160.24M
in: 157.56M / out: 2.68M
Function Calls: 1322 / 1764 (74.9%)
Testfunctions: 429
Test Phase
Token Usage: 84.24M
in: 81.16M / out: 3.08M
Function Calls: 1403 / 1445 (97.1%)
Realizefunctions: 207
Realize Phase
Token Usage: 32.63M
in: 31.51M / out: 1.12M
Function Calls: 599 / 665 (90.1%)
βœ“Function Calling Success Rate
86.08%
⏱Elapsed Time
4h 55m 40s
🧠Total Tokens
285.94M
in: 278.38M (0 cached)
out: 7.56M

Erp

qwen/qwen3.5-122b-a10b
Analyzeactors: 2, documents: 6
Analyze Phase
Token Usage: 1.86M
in: 1.56M / out: 295.3K
Function Calls: 161 / 162 (99.4%)
Databasenamespaces: 7, models: 27
Database Phase
Token Usage: 2.55M
in: 2.43M / out: 120.9K
Function Calls: 73 / 75 (97.3%)
Interfaceoperations: 101, schemas: 135
Interface Phase
Token Usage: 89.90M
in: 87.74M / out: 2.17M
Function Calls: 920 / 1313 (70.1%)
Testfunctions: 34
Test Phase
Token Usage: 6.61M
in: 6.37M / out: 243.1K
Function Calls: 208 / 215 (96.7%)
Realizefunctions: 157, errors: 4
Realize Phase
Token Usage: 23.84M
in: 22.46M / out: 1.39M
Function Calls: 581 / 638 (91.1%)
βœ“Function Calling Success Rate
80.86%
⏱Elapsed Time
3h 17m 10s
🧠Total Tokens
124.76M
in: 120.55M (0 cached)
out: 4.21M

1.2. The LLM Never Writes Code Directly

Most AI coding agents tell the LLM β€œwrite this code,” then save whatever text it outputs straight to a source file. AutoBe doesn’t work that way.

Instead, AutoBe uses function calling. Rather than letting the LLM generate freeform text, it hands the LLM a predefined structure (a JSON Schema) and says β€œfill in the blanks.” Think of it as giving someone a form and asking them to complete it.

Once the LLM fills in the form and returns structured data, AutoBe’s compiler reads that data and converts it into actual code. The LLM fills structures; the compiler writes code.

The entire pipeline works this way:

PhaseWhat the LLM fillsCompiler validation
RequirementsAutoBeAnalyze β€” structured SRSStructure check
DatabaseAutoBeDatabase β€” Prisma schema structurePrisma compiler
API designAutoBeOpenApi β€” OpenAPI spec structureOpenAPI compiler
TestsAutoBeTest β€” 30+ expression typesTypeScript compiler
ImplementationModular code (Collector/Transformer/Operation)TypeScript compiler

At every phase, the LLM fills a structure, and a compiler validates it. This is AutoBe’s all-in function calling strategy.

1.3. What the LLM Has to Fill Is Far from Simple

The β€œforms” the LLM has to fill are anything but trivial. Two examples will give you a sense of the precision required.

First, the DTO schema type that the LLM must generate during API design. A DTO (Data Transfer Object) describes the data structures in API requests and responses β€” things like β€œa product’s price is a positive integer, its name is a string, and its category list is an array of strings.”

The type that defines these DTO schemas is IJsonSchema. It’s a union of 10 distinct kinds (constant, boolean, integer, number, string, array, object…) with recursive nesting β€” arrays contain more IJsonSchema, objects map to more IJsonSchema:

export type IJsonSchema = | IJsonSchema.IConstant | IJsonSchema.IBoolean | IJsonSchema.IInteger | IJsonSchema.INumber | IJsonSchema.IString | IJsonSchema.IArray // items: IJsonSchema ← recursive | IJsonSchema.IObject // properties: Record<string, IJsonSchema> ← recursive | IJsonSchema.IReference | IJsonSchema.IOneOf // oneOf: IJsonSchema[] ← recursive | IJsonSchema.INull;

10 variants, infinitely recursive nesting. The 6.75% figure from earlier? That’s the raw function calling success rate for this exact type.

The test phase takes complexity up another level. To generate E2E test code, the LLM has to express logic like β€œcall this API, check that the response status is 200, verify that the body’s items array has length greater than 0.” The type that captures this is IExpression:

export type IExpression = | IBooleanLiteral | INumericLiteral | IStringLiteral // literals | IArrayLiteralExpression | IObjectLiteralExpression // compound literals | INullLiteral | IUndefinedKeyword // null/undefined | IIdentifier | IPropertyAccessExpression // accessors | IElementAccessExpression | ITypeOfExpression // access/ops | IPrefixUnaryExpression | IPostfixUnaryExpression // unary ops | IBinaryExpression // binary ops | IArrowFunction | ICallExpression | INewExpression // functions | IArrayFilterExpression | IArrayForEachExpression // array ops | IArrayMapExpression | IArrayRepeatExpression // array ops | IPickRandom | ISampleRandom | IBooleanRandom // random generation | IIntegerRandom | INumberRandom | IStringRandom // random generation | IPatternRandom | IFormatRandom | IKeywordRandom // random generation | IEqualPredicate | INotEqualPredicate // assertions | IConditionalPredicate | IErrorPredicate; // assertions

Over 30 variants, recursively nested. This is essentially programming-language-level complexity, and the LLM must generate it through a single function call.

1.4. How 6.75% Becomes 100%

Given structures this complex, a 6.75% first-attempt success rate is no surprise. The real question is how to turn 6.75% into 100%.

The answer is a validation feedback loop β€” a cycle of verification, feedback, and correction.

When a function call fails, the system doesn’t just say β€œwrong.” Typia (a library we’ll cover in detail shortly) takes the LLM’s raw JSON output and inserts // ❌ inline annotations at every exact point where an error occurred. Here’s an example from Typia’s documentationΒ :

{ "order": { "payment": { "type": "card", "cardNumber": 12345678 // ❌ [{"path":"$input.order.payment.cardNumber","expected":"string"}] }, "product": { "name": "Laptop", "price": -100, // ❌ [{"path":"$input.order.product.price","expected":"number & Minimum<0>"}] "quantity": 2.5 // ❌ [{"path":"$input.order.product.quantity","expected":"number & Type<\"uint32\">"}] }, "customer": { "name": "John Doe", "email": "invalid-email", // ❌ [{"path":"$input.order.customer.email","expected":"string & Format<\"email\">"}] "vip": "yes" // ❌ [{"path":"$input.order.customer.vip","expected":"boolean"}] } } }

cardNumber should be a string, not a number. price must be β‰₯ 0. quantity must be a positive integer. email isn’t a valid email. vip should be a boolean. Five errors, each with the exact path and expected type.

With this feedback, the LLM doesn’t need to regenerate everything from scratch. It can precisely correct only the flagged fields and retry.

Compiler validation β†’ precise diagnostics β†’ LLM correction β†’ revalidation. This loop repeats until it succeeds. Whether it takes one attempt or ten, the end result is 100%.

1.5. Qwen 3.5: From 0% to 100%

The qwen3.5 series presents an even more dramatic case.

Here’s a function calling application from Typia’s documentationΒ :

interface IOrder { payment: IPayment; product: { name: string; price: number & tags.Minimum<0>; quantity: number & tags.Type<"uint32">; }; customer: { name: string; email: string & tags.Format<"email">; vip: boolean; }; } type IPayment = | { type: "card"; cardNumber: string } | { type: "bank"; accountNumber: string };

And here’s what the LLM actually returns:

const llmOutput = ` > I'd be happy to help you with your order! 😊 \`\`\`json { "order": { "payment": "{\\"type\\":\\"card\\",\\"cardNumber\\":\\"1234-5678", "product": { name: "Laptop", price: 1300, quantity: 2, }, "customer": { "name": "John Doe", "email": "john@example.com", vip: tru \`\`\``;

Markdown wrapping, explanation prefix, unquoted keys, trailing commas, tru instead of true, unclosed brackets β€” and payment is double-stringified because IPayment is an anyOf (JSON Schema’s way of saying β€œone of several types”). Double-stringify means the LLM wrote the object as a JSON string inside a string β€” instead of {"type": "card", ...} (an object), it produced "{\"type\": \"card\", ...}" (a string containing JSON). Seven problems in a single output.

The double-stringify is the one that makes success rate 0%. Other errors are occasional; anyOf double-stringify is 100% consistent β€” every anyOf field, every time. This isn’t Qwen-specific; Anthropic’s Claude models do the same thing with oneOf. Every model family has its union-type blind spot.

Typia’s parse() handles all of this in a single call β€” broken JSON recovery, type coercion, double-stringify unwrapping. No changes to the model. This is how Qwen 3.5 went from 0% to 100%.

1.6. Four Models, All at 100%

AutoBe currently tests against four Qwen models. All of them pass compilation.

ModelActive parametersCharacteristics
qwen/qwen3-coder-next3B / 80BCoding-focused, tool choice support
qwen/qwen3.5-397b-a17b17B / 397BLargest MoE
qwen/qwen3.5-122b-a10b10B / 122BMid-size MoE
qwen/qwen3.5-35b-a3b3B / 35BCompact MoE

From 397B down to 35B. Even a compact model with just 3B active parameters can generate a complete shopping mall backend. Same pipeline, same schemas, same results.

1.7. It Runs Without System Prompts

One anecdote.

AI agents typically have system prompts β€” documents that instruct the LLM in natural language: β€œYou are a backend development expert. Follow these rules when writing code…” In most AI agents, the system prompt is the crown jewel.

Once, we shipped a build where the system prompt was completely missing. The agent ran on nothing but function calling schemas and validation logic. No natural language instructions whatsoever.

Nobody noticed. Output quality was identical.

This wasn’t a one-time fluke. It happened multiple times, and the result was the same every time.

The types were the best prompt, and validation feedback was the best orchestration.


2. Typia β€” The Infrastructure Behind All of This

The things that kept appearing naturally throughout Section 1 β€” schema conversion, broken JSON recovery, type coercion, precise error feedback β€” who does all of that?

To use function calling in production, there’s no shortage of problems to solve. How do you generate the JSON Schema to send to the LLM? What do you do when the LLM returns broken JSON? How do you correct wrong types? How do you communicate errors in a format the LLM can understand?

TypiaΒ  handles all of this in a single library.

2.1. From TypeScript Types to Function Calling Schemas

Function calling requires a JSON Schema that tells the LLM β€œgive me data in this structure.” Normally, developers write these schemas by hand β€” define the type, write a matching schema separately, then make sure the two don’t drift apart over time.

Typia automates this. Define a TypeScript type, and Typia automatically generates its JSON Schema at compile time. Not through runtime reflection, but by directly leveraging the TypeScript compiler’s type analyzer:

import typia, { tags } from "typia"; interface IMember { /** * The member's age. * * Only adults aged 19 or older can register. * This is the platform's legal age restriction. */ age: number & tags.Type<"uint32"> & tags.ExclusiveMinimum<18>; email: string & tags.Format<"email">; name: string & tags.MinLength<1> & tags.MaxLength<100>; } const schema = typia.llm.parameters<IMember>(); // { // type: "object", // properties: { // age: { // type: "integer", // description: "The member's age.\n\nOnly adults aged 19 or older can register.\nThis is the platform's legal age restriction.", // exclusiveMinimum: 18 // }, // email: { type: "string", format: "email" }, // name: { type: "string", minLength: 1, maxLength: 100 } // }, // required: ["age", "email", "name"] // }

Two things to note here.

First, JSDoc comments become description fields. The LLM reads these descriptions to decide what values to generate. β€œOnly adults aged 19 or older can register” is automatically included in the schema, giving the LLM the context it needs.

Second, type constraints become validation rules. ExclusiveMinimum<18> becomes a ”> 18” validation rule; Format<"email"> becomes an email format check. A single type definition produces both LLM guidance and validation rules simultaneously.

When schemas are written by hand, they inevitably drift from the types over time. Typia eliminates this problem entirely. The type is the schema.

At the class level, typia.llm.application<T>() converts all public methods into function calling schemas, with parse(), coerce(), and validate() methods automatically built into each function.

2.2. Lenient JSON Parsing: Cleaning Up the LLM’s Broken JSON

LLMs don’t produce perfect JSON. Why? Because an LLM is a language model that generates text token by token β€” not a JSON generator. It forgets to close brackets, misplaces commas, prepends β€œHere is your answer:” before the JSON, and wraps everything in Markdown code blocks.

JSON.parse() rejects all of these. Typia’s ILlmFunction.parse() handles every case:

ProblemExampleResolution
Unclosed bracket{"name": "John"Auto-close
Trailing comma[1, 2, 3, ]Ignore
JavaScript comments{"a": 1 /* comment */}Strip
Unquoted keys{name: "John"}Allow
Incomplete keywords{"done": truComplete to true
Explanation prefixHere is your JSON: {"a": 1}Skip
Markdown code block```json\n{"a": 1}\n```Extract inner

In real LLM outputs, these problems occur simultaneously:

const llmOutput = ` > I'd be happy to help you with your order! \`\`\`json { "order": { "payment": "{\\"type\\":\\"card\\",\\"cardNumber\\":\\"1234-5678", "product": { name: "Laptop", price: "1299.99", quantity: 2, }, "customer": { "name": "John Doe", "email": "john@example.com", vip: tru \`\`\` `; const result = func.parse(llmOutput); // Markdown code block, explanation prefix, unquoted keys, trailing commas, // double-stringify, string→number, incomplete keyword, unclosed brackets // — 8 problems at once, all handled by a single parse() call.

Most JSON repair tools (jsonrepair, dirty-json, LangChain’s parse_partial_json, etc.) work at the string level β€” fix trailing commas, close brackets, strip Markdown, then pass the result to JSON.parse(). The output is a syntactically valid JSON string. But a double-stringified value like "{\\"type\\":\\"card\\"}" passes through unscathed β€” it’s already valid JSON (it’s a string). Without knowing the schema, there’s no way to know it should be an object.

Typia’s parse() works differently. It doesn’t repair a string and hand it off β€” it parses greedily while consulting the schema. When it encounters a string value where the schema expects an object, it re-enters parse() on that string, applying the same lenient recovery. The result feeds into type coercion, which may find another string-where-object-expected, triggering yet another round. Parsing and coercion call each other recursively, unwinding layers of stringify naturally β€” double, triple, however deep.

This is why Section 1.5’s seven problems are solved in β€œa single parse() call.” It’s not seven separate fixes applied sequentially. It’s a schema-driven recursive cycle where parsing and coercion are inseparable. And it’s why double-stringify β€” the problem that made Qwen 3.5’s success rate 0% β€” can’t be solved by string-level repair. You need the schema to know what’s supposed to be an object.

2.3. Schema-Based Type Coercion: Correction That Knows the Schema

LLMs frequently get types wrong, not just structure. They write "42" (a string) where 42 (a number) is expected, and "true" (a string) where true (a boolean) is expected. A human would see these as equivalent, but to a program they’re completely different types.

Naive type casting can’t solve this. Whether "42" should be a number or remain a string depends entirely on whether the schema for that field says number or string.

Typia’s ILlmFunction.coerce() consults the JSON Schema and converts values to the type the schema expects:

LLM outputExpected typeResult
"42"number or integer42
"true" / "false"booleantrue / false
"null"nullnull
"{\"x\": 1}"object{ x: 1 } (recursive parsing)
"[1, 2, 3]"array[1, 2, 3] (recursive parsing)

Here’s what this looks like in practice:

const fromLlm = { order: { payment: '{"type":"card","cardNumber":"1234-5678"}', // double-stringify product: { name: "Laptop", price: "1299.99", // string, but schema says number quantity: "2", // string, but schema says integer }, customer: { name: "John Doe", vip: "true", // string, but schema says boolean }, }, }; const result = func.coerce(fromLlm); // result.order.product.price === 1299.99 (number) // result.order.product.quantity === 2 (integer) // result.order.customer.vip === true (boolean) // result.order.payment === { type: "card", cardNumber: "1234-5678" } (object)

For union types (structures where one of several types is selected), Typia structurally analyzes the data to identify the correct variant, then applies that variant’s coercion rules β€” no discriminator required.

This is the mechanism behind the Qwen 3.5 series’ 0% β†’ 100% from Section 1. The model’s tendency to double-stringify objects in union types was solved at the infrastructure level using schema information.

When the SDK has already parsed the JSON (Anthropic SDK, Vercel AI, LangChain, MCP, etc.), use coerce() instead of parse().

2.4. Validation and Precise Feedback

Even after parsing and type coercion, the values themselves can be wrong. A negative number for a price, a non-email string in an email field, a decimal where an integer is required.

Typia’s ILlmFunction.validate() detects these schema violations and pinpoints not just that something is wrong, but exactly where and why:

const result = func.validate(input); // Error example: // { // path: "$input.order.product.price", // expected: "number & Minimum<0>", // value: -100 // }

β€œThe price inside product inside order must be a number β‰₯ 0, but you gave -100.” That’s the level of precision.

LlmJson.stringify() then inserts these errors as // ❌ inline annotations directly onto the LLM’s original JSON output:

{ "order": { "payment": { "type": "card", "cardNumber": 12345678 // ❌ [{"path":"$input.order.payment.cardNumber","expected":"string"}] }, "product": { "name": "Laptop", "price": -100, // ❌ [{"path":"$input.order.product.price","expected":"number & Minimum<0>"}] "quantity": 2.5 // ❌ [{"path":"$input.order.product.quantity","expected":"number & Type<\"uint32\">"}] }, "customer": { "email": "invalid-email", // ❌ [{"path":"$input.order.customer.email","expected":"string & Format<\"email\">"}] "vip": "yes" // ❌ [{"path":"$input.order.customer.vip","expected":"boolean"}] } } }

The LLM can see exactly where and why it went wrong, right on top of its own JSON. With this feedback, there’s no need to rewrite everything β€” just correct the five flagged fields and retry.

2.5. The Full Loop: Parse β†’ Coerce β†’ Validate β†’ Feedback β†’ Retry

Combining everything introduced so far into a single loop, we get the complete picture of the validation feedback loop from Section 1:

async function callWithFeedback( llm: LLM, func: ILlmFunction, prompt: string, maxRetries: number = 10, ): Promise<unknown> { let feedback: string | null = null; for (let i = 0; i < maxRetries; i++) { // 1. Request function call from LLM (with previous feedback) const rawOutput = await llm.call(prompt, feedback); // 2. Lenient JSON parsing + Type coercion const parsed = func.parse(rawOutput); if (!parsed.success) { feedback = `JSON parsing failed: ${JSON.stringify(parsed.errors)}`; continue; } // 3. Schema validation const validated = func.validate(parsed.data); if (!validated.success) { // 4. Generate structured feedback (// ❌ inline annotations) feedback = LlmJson.stringify(validated); continue; } // 5. Success return validated.data; } throw new Error("Max retries exceeded"); }

parse() rescues broken JSON and performs first-pass type correction. validate() catches schema violations. LlmJson.stringify() renders errors in a format the LLM can read. The LLM reads this feedback and corrects itself. This is the complete engine that turns 6.75% into 100%.

2.6. One Type Does It All

To sum up: define a single TypeScript type, and Typia handles the rest:

  1. Generates the schema β€” typia.llm.parameters<T>(), typia.llm.application<T>()
  2. Parses β€” ILlmFunction.parse() (broken JSON recovery + type coercion)
  3. Coerces β€” ILlmFunction.coerce() (type coercion for SDK-parsed objects)
  4. Validates β€” ILlmFunction.validate() (schema violation detection)
  5. Generates feedback β€” LlmJson.stringify() (LLM-readable // ❌ inline diagnostics)

No other tool provides this complete pipeline. Individual pieces exist elsewhere β€” JSON repair libraries handle broken syntax, Pydantic offers validation, some frameworks have retry loops. But the schema-driven recursive cycle of parse ↔ coerce, combined with structural variant identification and inline error feedback, exists only in Typia.

The type is the schema, the validator, and the prompt.


3. The Case for Function Calling

So far, we’ve seen how function calling works through AutoBe and Typia. Now let’s talk about why function calling is an effective methodology for domains that demand precision and correctness.

3.1. Natural Language vs. Types

Natural language is, well, natural. It evolved organically over millennia of human society, and ambiguity is a feature, not a bug. Metaphor, nuance, politeness, humor β€” all of it runs on ambiguity. β€œJust make it look nice” works as an instruction between humans.

Programming languages are designed. Someone intentionally built them to eliminate room for interpretation. β€œJust make it look nice” doesn’t compile. Ambiguity is a bug.

When people communicate in natural language, they misunderstand each other and argue. When they communicate in types and schemas, there’s no misunderstanding.

Let’s contrast an LLM prompt with a type schema.

Expressing constraints via prompt:

β€œThe age field must be a positive integer greater than 18. Don’t use string types for numeric fields. All required fields must be present…”

Several problems are visible. Does β€œgreater than 18” mean >18 or β‰₯18? There’s no way to verify whether the LLM followed these rules without inspecting the output. And as the schema grows more complex, rules like these multiply endlessly.

Expressing constraints via types:

interface IMember { /** Only adults aged 19 or older can register */ age: number & Type<"uint32"> & ExclusiveMinimum<18>; }

ExclusiveMinimum<18> means >18. It’s an integer. It’s required. Unambiguous and mechanically verifiable.

In domains that demand precise results, defining a schema and annotating each field is far clearer, easier, and more verifiable than writing a natural language prompt.

3.2. The Pink Elephant Problem

If you’ve ever built a prompt-based AI agent, you’ve written prohibition rules:

  • β€œDo not create utility functions”
  • β€œDo not use the any type”
  • β€œDo not create circular dependencies”

When someone says β€œdon’t think of a pink elephant,” a pink elephant is the first thing that comes to mind. When you tell an LLM β€œdon’t do X,” X is placed at the center of its attention. To avoid a forbidden pattern, the model must first recall that pattern β€” which paradoxically increases the probability of generating it. This is inherent to the token prediction mechanism.

Even knowing this problem, you can’t avoid prohibition rules in prompts. β€œDon’t do X” is the only tool natural language has for expressing constraints. I’ve never seen a prompt-based AI agent that doesn’t use prohibition rules.

In schemas, this problem doesn’t exist.

You don’t need to say β€œdon’t use the any type” β€” if any isn’t in the schema, the LLM physically can’t produce it. You don’t need to say β€œdon’t create utility functions” β€” if there’s no slot for utility functions in the schema, that’s the end of it. If the field type is limited to "boolean" | "int" | "double" | "string" | "uri" | "uuid" | "datetime" β€” seven options β€” there’s no path for the LLM to write "varchar".

Not prohibition, but absence. Prompts try to forbid what you don’t want; schemas only permit what you do. This is why function calling is particularly effective in domains that demand precise output.

3.3. Model Neutrality

Prompt engineering is inherently model-dependent. A prompt optimized for GPT behaves differently on Claude, and differently again on Qwen. When a new model comes out or you want to experiment with a different one, it’s not uncommon to rewrite prompts from scratch.

Function calling schemas are model-neutral. JSON Schema is JSON Schema. It means the same thing regardless of which model reads it, and the validation feedback loop absorbs any performance differences between models. A strong model gets it right in 1–2 attempts; a weaker model takes 3–4; both converge to 100%.

AutoBe running Qwen, GLM, DeepSeek, and OpenAI models on the same schemas, the same pipeline, achieving 100% compilation across the board, is proof of this neutrality. We’ve never done model-specific prompt tuning.

This changes the nature of model selection. It goes from β€œcan this model do this task?” β€” a capability question β€” to β€œwhich model is the most cost-effective?” β€” a cost optimization problem: average retries Γ— tokens per attempt Γ— price per token.

3.4. The Core: Verifiability and the Feedback Loop

One thread runs through everything we’ve discussed.

The most powerful advantage of function calling is that it brings LLM output into the domain of software engineering.

If you let an LLM generate freeform text, determining whether that output is correct becomes yet another AI problem. Parsing is fuzzy. Validation is fuzzy. Correction is fuzzy. Everything is uncertain.

With function calling, the output is structured data. From that moment on, you can use the tools of software engineering:

  1. Verification is deterministic β€” JSON Schema validation yields a clear pass/fail
  2. Feedback is precise β€” β€œfield X should be type Y, but you gave Z” can be identified exactly
  3. Correction converges β€” precise feedback enables the model to fix only the affected parts

These three form a deterministic chain. The model is still probabilistic and still makes mistakes, but the loop outside the model is deterministic, so the process converges to 100%.

Typed Schema + Deterministic Validator + Structured Error Feedback = Reliable LLM Output

If prompt engineering is about tinkering with the inside of the model, function calling is about making the outside of the model rock-solid. In domains that demand precision, the effectiveness of the latter approach is proven by results: 6.75% β†’ 100%.

The LLM doesn’t need to be accurate. It just needs to be correctable. And correctability is not a property of the model β€” it’s a property of the validation infrastructure.

3.5. Application Spectrum: How Far Can This Go?

So is this pattern β€” function calling + validation feedback β€” limited to coding? No. It forms a spectrum based on verifiability.

3.5.1. Domains Where All Output Is Verifiable

AutoBe’s Database, Interface, Test, and Realize phases fall here. The compiler serves as the validator, guaranteeing 100% correctness.

This isn’t unique to software. Any field where β€œcorrect or incorrect” can be mechanically determined supports the same structure, with a natural hierarchy based on verification cost:

DomainFast (ms)Medium (sec)Deep (min+)
SoftwareType checkCompilationTest execution
SemiconductorsDRCLVSSPICE simulation
Chemical ProcessMass balanceEnergy balanceProcess simulation
Interior DesignDimension / clearanceCode compliance, clash detectionLighting / HVAC simulation
Control SystemsTransfer function validityStability / margin analysisTime-domain simulation

The feedback loop naturally exploits this hierarchy: run the cheapest validator first, fix the errors, then move to the next level.

3.5.2. The Pattern in Practice

The table above shows the hierarchy in overview. Here’s what the pattern looks like as concrete types β€” each from an engineering field with validators refined over decades.

Semiconductors β€” Physical rules in chip design are non-negotiable:

interface IChipLayout { technology_node: "5nm" | "7nm" | "14nm" | "28nm"; blocks: IBlock[]; connections: IConnection[]; } interface IBlock { type: "logic" | "memory" | "io" | "analog" | "pll"; position: IPoint2D; dimensions: IDimension; sub_blocks: IBlock[]; // Recursive hierarchy }

DRC (Design Rule Check, fast), LVS (Layout vs. Schematic, medium), SPICE simulation (slow). Costs vary by tier, but all are deterministic validations. The feedback loop starts with the cheapest β€” DRC.

Chemical Process β€” Conservation laws are absolute validators:

interface IProcessStream { temperature: number & Minimum<0>; // Kelvin pressure: number & Minimum<0>; // Pa composition: IComponent[]; // sum must equal 1.0 phase: "liquid" | "vapor" | "solid" | "two_phase"; flow_rate: number & Minimum<0>; // kg/s } interface IUnitOperation { type: "reactor" | "distillation" | "heat_exchanger" | "compressor" | "pump" | "mixer" | "splitter"; inlet_streams: IProcessStream[]; outlet_streams: IProcessStream[]; // mass balance: Ξ£in = Ξ£out energy_duty: number; // energy balance }

Mass conservation (Ξ£ inlet = Ξ£ outlet), energy balance, thermodynamic consistency β€” these are the laws of physics, not opinions. Tools like ASPEN and HYSYS have provided deterministic validation for over 40 years. Mass balance check (fast) β†’ energy balance (medium) β†’ full process simulation (deep).

Interior Design β€” Beneath the aesthetics, hard constraints define the space:

interface IRoom { type: "bedroom" | "living" | "kitchen" | "bathroom" | "office" | "hallway" | "storage"; dimensions: IDimension3D; openings: IOpening[]; fixtures: IFixture[]; } interface IOpening { type: "door" | "window" | "sliding_door" | "arch"; width: number & Minimum<0>; // door β‰₯ 900mm (accessibility) height: number & Minimum<0>; position: IPoint3D; swing_direction?: "inward" | "outward" | "sliding"; } interface IFixture { type: "cabinet" | "counter" | "appliance" | "furniture" | "lighting" | "plumbing"; position: IPoint3D; dimensions: IDimension3D; clearance_required: number; // min clear space (mm) }

People think of interior design as purely aesthetic, but it’s built on hard constraints: minimum passage width (800mm), door width for accessibility (β‰₯900mm), fire compartment regulations, emergency egress distances. BIM tools like Revit have provided clash detection for decades. Dimension and clearance checks (fast) β†’ building code compliance and collision detection (medium) β†’ lighting (lux) and HVAC simulation (deep).

Control Systems β€” Stability is mathematically provable:

interface IControlLoop { type: "PID" | "MPC" | "LQR" | "feedforward" | "cascade"; plant_model: ITransferFunction; setpoint: number; sampling_rate: number & Minimum<0>; // Hz constraints: IConstraint[]; } interface ITransferFunction { numerator: number[]; // polynomial coefficients denominator: number[]; // degree β‰₯ numerator delay: number & Minimum<0>; // transport delay (sec) }

A control system is either stable or it isn’t β€” and this can be proven mathematically. Bode plots, Nyquist diagrams, pole placement: over 60 years of established analysis tools. Transfer function validity (fast) β†’ stability and gain/phase margin analysis (medium) β†’ full time-domain simulation (deep).

Look at these types again. Every one has a type field with enumerated variants β€” "logic" | "memory" | ..., "reactor" | "distillation" | ..., "bedroom" | "living" | ..., "PID" | "MPC" | .... Several nest recursively. These are the same union + tree structures as AutoBe’s IJsonSchema and IExpression. This isn’t coincidence β€” it’s the nature of engineering data. Appendix A.3 explains why.

Note: These domain examples are AI-recommended β€” all are engineering fields where deterministic validators have been established for decades, so the pattern fits in principle. That said, I’m a developer, not a domain expert β€” take the specifics with a grain of salt.


3.5.3. Where This Doesn’t Apply

Conversely, this pattern doesn’t fit domains where deterministic validators can’t be built. Creative writing, emotional intelligence, strategic decision-making. There’s no validator for β€œa good novel” or β€œa wise business decision.” I’ll acknowledge that honestly.


4. Why Qwen

4.1. Function Calling Performance: Best in Class for Small/Medium Models

Let me start with the most direct answer to β€œwhy Qwen?”

AutoBe’s entire pipeline is function calling. Whether a model writes good prose or carries on a smooth conversation doesn’t matter. The only criterion is how accurately it fills complex JSON Schemas.

Qwen is not the only open-weight model that does function calling well. GLM, Kimi, and others deliver strong function calling performance at large model scales. But at the small and medium scale, Qwen was the only one that could handle function calling of this complexity.

Even a compact 3B-active-parameter MoE model supports tool choice and processes complex schemas containing 10+ variant recursive unions. For AutoBe, this small/medium-scale performance was decisive β€” the reasons why continue in the following sections.

4.2. R&D Cost: Users vs. Developers

For customers using AutoBe, model cost isn’t an issue. Even the most expensive model is cheaper than actually hiring a backend developer.

But for us developing AutoBe, it’s different. Every time we design a new type or add a new validation rule, we need to run the entire pipeline end to end. Thousands of generate-compile-feedback cycles. Using commercial models every time would be financially ruinous.

Local models make this R&D cycle possible. We can experiment without limit, without worrying about cost. The journey from 6.75% to 100% required hundreds of experiment cycles, and that was only possible because the models were local.

4.3. Small Models Make the Best QA Engineers

Large models make fewer mistakes. That’s an advantage β€” and simultaneously a disadvantage.

Even when our validation has blind spots we haven’t thought of, large models rarely trigger those failures. They β€œguess correctly” through ambiguous parts of the schema and get it right. Our mistakes stay hidden.

Switch to a small model, and the story changes. These are separate models from the four in Section 1.6 β€” smaller or non-coding-optimized variants we used specifically for QA:

ModelActive / TotalSuccess rateWhat it found
qwen3-30b-a3b-thinking3B / 30B~10%Fundamental schema ambiguities, missing required fields
qwen3-next-80b-a3b-instruct3B / 80B~20%Subtle type mismatches in complex nested relationships

The 10% success rate was the most valuable result. Every failure pointed to a gap in our system, and each fix strengthened the pipeline not just for weak models, but for all models.

AI is probabilistic. Large models make mistakes less often, not never. Counterexamples that surface with small models will eventually occur with large models too β€” just rarely. In production, β€œrarely” is an outage.

When the schema is precise enough that even a 35B model can’t misinterpret it, the probability of a strong model getting it wrong converges to effectively zero.

4.4. No Vendor Lock-In

Price changes, model deprecation, and rate limits for commercial APIs are entirely at the vendor’s discretion. The model you use today could disappear tomorrow.

AutoBe’s function calling schemas are designed to be model-neutral. We don’t use model-specific prompt tricks. JSON Schema and type-based validation are industry standards β€” the code stays the same even when the model changes.

4.5. Open Source + Open Weights: A Virtuous Cycle

AutoBe is open source (AGPL 3.0), and Qwen is open-weight. Both are part of the open ecosystem.

This combination is what made thousands of experiments possible, what made edge case discovery possible, and what made system hardening possible. With commercial models, experimentation at this scale would have been financially impossible.

The open ecosystem creates a virtuous cycle of mutual reinforcement:

  • AutoBe hardens its system using Qwen
  • The hardened system proves Qwen’s production-level viability
  • Improvements to Qwen raise AutoBe’s overall performance
  • AutoBe’s discoveries (like the double-stringify issue) can contribute to Qwen’s improvement

5. Closing

AutoBe achieved 100% compilation success across all four Qwen models through an all-in function calling strategy.

What made it possible was neither smarter prompts nor more sophisticated orchestration. It was the type-based infrastructure Typia provides β€” automatic schema generation, lenient parsing, type coercion, validation feedback β€” deterministically overcoming the model’s probabilistic limitations.

When you communicate in types, there’s no misunderstanding. When you constrain with schemas, there’s no pink elephant. When you have a deterministic validation loop, even 6.75% becomes 100%.

This pattern isn’t limited to coding. It’s transferable to any engineering field where deterministic validators exist.

And what made all of this experimentation and validation possible was Qwen β€” an open-weight model.

The LLM doesn’t need to be accurate. It just needs to be correctable.


About AutoBe: AutoBeΒ  is an open-source AI agent developed by Wrtn TechnologiesΒ . It generates production-ready backend applications from natural language.

About Typia: TypiaΒ  is a compiler library that automatically generates runtime validators, JSON Schema, and function calling schemas from TypeScript types.


Appendix: Technical Deep Dives

Union types appear throughout this talk, from start to finish. The 10 variants of IJsonSchema (Section 1.3), the 30+ variants of IExpression (Section 1.3), Qwen 3.5’s double-stringify issue (Section 1.5), type coercion (Section 2.3), validation feedback (Section 2.4). Sections A.1–A.4 dive deep into why union types are the make-or-break challenge for function calling infrastructure. Section A.5 explores a separate capability that Typia’s schema-driven parsing makes possible.

A.1. What Is a Discriminated Union?

A union type represents β€œone of several kinds.” For example, if a payment method can be either a card or a bank transfer:

type Payment = | { type: "card"; cardNumber: string; cvc: string } | { type: "bank_transfer"; bankCode: string; accountNumber: string }

A discriminated union is a union that has a discriminator field β€” a single field whose value determines which variant the data belongs to. In the example above, type is the discriminator. If type is "card", the data has cardNumber and cvc; if it’s "bank_transfer", it has bankCode and accountNumber. A single discriminator value determines the rest of the structure.

Why does this matter? When an LLM generates data for a union type and makes a mistake, correcting it requires knowing β€œwhich variant was this data intended to be?” first. A discriminator makes this identification straightforward β€” check one field, know the variant. Without one, intent must be inferred from the data’s shape, which is harder but still possible with the right infrastructure.

AutoBe’s IJsonSchema (10 variants) and IExpression (30+ variants) are all discriminated unions, and Typia’s ability to structurally identify variants and generate precise per-field feedback is the core mechanism behind 6.75% β†’ 100%.

A.2. Typia’s x-discriminator β€” Adding Intelligence to anyOf

The JSON Schema standard offers two ways to represent union types: anyOf (matches any) and oneOf (matches exactly one). But neither carries β€œwhich field distinguishes the variants” β€” they just say β€œmatch one of these schemas.”

OpenAPI 3.x has a discriminator, but it’s exclusive to oneOf, and most LLMs don’t handle oneOf reliably.

Typia solves this with a plugin property called x-discriminator. It uses anyOf β€” which LLMs broadly support β€” while attaching discriminator metadata:

// Schema generated by Typia (simplified) { "anyOf": [ { "type": "object", "properties": { "type": { "const": "card" }, "cardNumber": { ... } } }, { "type": "object", "properties": { "type": { "const": "bank_transfer" }, "bankCode": { ... } } } ], "x-discriminator": { "propertyName": "type", "mapping": { "card": "#/$defs/CardPayment", "bank_transfer": "#/$defs/BankTransferPayment" } } }

This serves a distinct purpose from Typia’s internal processing. Typia’s coerce() and validate() identify the correct variant through structural analysis of the data itself β€” matching property names, types, and shapes against each variant’s schema. This works with or without a discriminator.

x-discriminator is LLM-facing. It tells the model β€œuse the type field to select a variant,” reducing the chance of the LLM generating structurally ambiguous data in the first place. Better input from the LLM means fewer corrections needed downstream.

The two work in tandem:

  1. x-discriminator reduces errors at the source β€” the LLM reads the hint and generates data that more clearly belongs to one variant
  2. Typia’s structural analysis handles the rest β€” coerce() identifies the variant and applies variant-specific coercion (including double-stringify unwrapping for Qwen 3.5). validate() identifies the variant and produces precise per-field errors β€” not β€œdoesn’t match any of 10 variants,” but β€œcard variant’s cardNumber should be string, but you gave number”

x-discriminator makes the LLM smarter; Typia’s structural engine makes the infrastructure robust. This is why the type coercion from Section 2.3 and validation feedback from Section 2.4 work reliably on union types.

A.3. The World Is Made of Recursive Unions

Engineering manages complexity through hierarchical decomposition β€” break a big system into smaller parts, break those into smaller parts still. A chip is blocks; blocks are sub-blocks. A plant is sections; sections are units. A building is floors; floors are rooms. This decomposition is a tree. And at each level, the parts come in different kinds β€” a block can be logic, memory, or IO; a unit can be a reactor, distillation column, or heat exchanger. The moment a tree’s nodes have kinds, it becomes a recursive union type.

The engineering domains from Section 3 are no exception:

  • Semiconductors: IBlock β†’ sub_blocks: IBlock[] (chip β†’ block β†’ sub-block)
  • Chemical Process: plants β†’ sections β†’ units β†’ sub-units (recursive process hierarchy)
  • Interior Design: buildings β†’ floors β†’ rooms β†’ zones (recursive spatial decomposition)
  • Control Systems: cascade control β€” outer loop’s output becomes inner loop’s setpoint (recursive nesting)

These are structurally identical to AutoBe’s IJsonSchema (10 variants) and IExpression (30+ variants). They’re all ASTs β€” abstract syntax trees. This isn’t a coincidence specific to these four fields. Hierarchical decomposition is how engineers manage complexity, and hierarchical decomposition produces recursive union types. Any deterministic engineering domain that structures its data β€” which is virtually all of them β€” will share this structure.

In Section 3, we said β€œif a domain’s output is verifiable, the function calling + validation feedback pattern is transferable.” But if the data structures of those domains are all recursive unions, then conquering union types is the prerequisite for that transfer.

If coercion doesn’t work on union types, Qwen 3.5’s double-stringify problem will surface in chip design too. If validation feedback doesn’t work on union types, β€œdoesn’t match any of 30 variants” won’t get the feedback loop to converge. If you can’t identify which variant the data was intended to be, correction itself is impossible.

Typia’s structural variant identification, schema-based coercion, and precise per-field validation are the solution for this universal structure. AutoBe’s 6.75% β†’ 100% is not just an achievement in code generation. It’s the establishment of 100% reliability on the universal structure of recursive unions β€” an achievement transferable to every structured domain that shares this structure.

A.4. Why Not Zod?

Zod is the most popular runtime validation library in the TypeScript ecosystem. β€œWhy don’t you use Zod?” is a question we get often.

Let’s see what happens when you try to define AutoBe-scale 30+ variant recursive discriminated unions in Zod:

const ExpressionSchema: z.ZodType<IExpression> = z.lazy(() => z.discriminatedUnion("type", [ z.object({ type: z.literal("booleanLiteral"), value: z.boolean() }), z.object({ type: z.literal("callExpression"), expression: ExpressionSchema, // circular reference arguments: z.array(ExpressionSchema), // circular reference }), // ... 28 more ]) );

Three problems.

First, you must define TypeScript types and Zod schemas separately.

Zod’s official documentation states this explicitly: β€œyou can define a recursive schema in Zod, but because of a limitation of TypeScript, their type can’t be statically inferred.” When you use z.lazy(), z.infer doesn’t work, so you must define a TypeScript interface separately and pass it manually via z.ZodType<T>:

// 1. Define the TypeScript type first type IExpression = | { type: "booleanLiteral"; value: boolean } | { type: "callExpression"; expression: IExpression; arguments: IExpression[] } | { type: "binaryExpression"; left: IExpression; operator: string; right: IExpression } // ... 27 more // 2. Define the Zod schema separately, manually linking the type hint const ExpressionSchema: z.ZodType<IExpression> = z.lazy(() => z.discriminatedUnion("type", [ z.object({ type: z.literal("booleanLiteral"), value: z.boolean() }), z.object({ type: z.literal("callExpression"), expression: ExpressionSchema, arguments: z.array(ExpressionSchema) }), z.object({ type: z.literal("binaryExpression"), left: ExpressionSchema, operator: z.string(), right: ExpressionSchema }), // ... 27 more ]) );

For a 30+ variant recursive union, this dual definition runs to hundreds of lines. Over time the two drift apart, and there’s nothing to catch the mismatch.

Second, even with dual definitions, it won’t compile.

As the depth of recursive unions increases, you hit TypeScript’s generic instantiation limit:

TS2589: Type instantiation is excessively deep and possibly infinite.

Why does this happen with Zod but not with native TypeScript types? The difference is how the type checker resolves recursive references.

When TypeScript encounters IExpression in a native type alias, the recursive reference is a name lookup β€” a pointer back to the same definition. 30 variants referencing IExpression in their fields? 30 pointer lookups. O(N) β€” linear.

In Zod, z.discriminatedUnion is a deeply nested generic. TypeScript must structurally expand each variant’s output type through Zod’s internal conditional types. When it hits ExpressionSchema inside a variant, z.lazy() forces re-entry into the full union β€” N variants Γ— K recursive fields, each triggering another full expansion. Depth 0: N. Depth 1: NΒ·K. Depth 2: (NΒ·K)Β². For N=30, K=2, depth 3 alone is 216,000 type resolutions. O((NΒ·K)^d) β€” exponential.

This is the most recurrently reported error in Zod’s issue tracker. #577Β , #5064Β , #5256Β  β€” all recursive schemas, all TS2589, all unresolved in Zod v4. Discussion #1459Β  even shows the same error with complex discriminated unions that aren’t recursive at all β€” the generic expansion is expensive enough on its own.

The practical consequence goes beyond compilation. The TypeScript language server runs the same type checker for IDE features β€” autocompletion, hover types, error highlighting. With a 30+ variant recursive Zod schema, the language server enters the same exponential expansion, memory spikes to gigabytes, and the IDE freezes β€” not just in the file where the schema is defined, but in every file that imports it. The development environment becomes unusable.

Third, even after enduring all of that, validation feedback is fundamentally impossible.

This is the most critical problem.

When validation fails on a union type, Zod can’t determine β€œwhich variant was this value intended to be.” In a 10-variant union, errors either flood out for all variants at once (#792Β ), or β€” if the discriminator doesn’t match β€” errors for other fields are silently hidden (#2202Β ). In Zod v4, this actually regressed: on discriminator mismatch, it returns an empty error array and β€œNo matching discriminator” (#4909Β , #5670Β ).

Think about it from the LLM’s perspective. If it intended a callExpression variant but got the arguments field’s type wrong, it needs feedback like β€œarguments should be an IExpression array, but you gave a string.” But Zod says β€œdoesn’t match any of 10 variants.” Feedback that doesn’t tell you what to fix isn’t feedback at all.

Typia structurally identifies the intended variant by analyzing the data’s shape, then generates precise per-field errors against that variant’s schema. This is the prerequisite for the validation feedback loop to converge, and Zod lacks this mechanism entirely.

In summary: with Zod, you get dual definitions, compilation failure, and β€” even then β€” no feedback loop. The very engine behind AutoBe’s 6.75% β†’ 100% simply cannot exist on top of Zod.

With Typia, a single TypeScript interface is all you need:

const result = typia.validate<AutoBeTest.IExpression>(llmOutput);

It operates at the compiler level, so it handles types of any complexity. No separate schema definitions, no generic depth limits, no incomplete error messages.

A.5. Beyond the Token Limit: Incremental Structured Output

Function calling has an unspoken constraint: the entire JSON must fit in a single response. If the model’s max output is 32K tokens and the target JSON is 100K tokens, the output gets truncated mid-JSON. With JSON.parse(), a truncated JSON is a failed JSON. The entire generation is wasted.

This is the structured output equivalent of non-incremental compilation β€” if any part fails, you throw everything away and start from scratch.

Typia’s schema-driven lenient parsing changes this equation. Because parse() auto-closes unclosed brackets, completes incomplete values, and applies type coercion recursively β€” a truncated JSON isn’t a failure. It’s a DeepPartial<T>: a typed object where completed fields are valid and missing fields are identifiable by the schema.

Turn 1: LLM generates 32K tokens β†’ truncated mid-JSON β†’ Typia parse() β†’ DeepPartial<T> β†’ Schema diff: "these fields are still missing" Turn 2: "Fill in the remaining fields" + previous DeepPartial<T> β†’ LLM generates next chunk β†’ Typia parse() β†’ DeepPartial<T> updated, validate() on completed subtrees Turn N: β†’ All fields present β†’ validate() passes β†’ T

At each turn, parse() recovers the truncated output, coerce() ensures correct types on what exists, and validate() can run on completed subtrees before the whole object is finished. Errors surface incrementally, not at the end.

This is incremental compilation applied to structured output. A traditional compiler recompiles everything on each run; an incremental compiler reuses previous results and only processes what changed. Similarly, traditional function calling discards truncated output and retries from scratch; Typia’s approach reuses every valid field and only asks the LLM to fill what’s missing.

The implication is that function calling’s output size is no longer bounded by max_output_tokens. A 200K-token JSON β€” far beyond any single model response β€” can be built incrementally across multiple turns, with type safety maintained at every step. The schema tells you what you have and what you need; the lenient parser ensures nothing is wasted.

Last updated on