[AutoBe] Hardcore function calling benchmark in backend coding agent

2/2/2026Jeongho NamOriginal on DEV

#agents#ai#backend#llm

Hardcore Benchmark

AutoBE is an open-source initiative that generates backend applications through extensive function calling. The project leverages LLM function calling throughout every phase rather than relying on plain text generation, including compiler AST structures with infinite depths. This approach establishes the most extreme function calling benchmark ever.

The project includes three primary AST structures:

DB Compiler’s AST
API specification’s AST
Test function’s AST

These structures handle complex nested schemas for JSON schema handling across multiple type categories including constants, booleans, integers, strings, arrays, objects, and references.

Limitations

Different models produce significantly varying outputs for identical tasks. For instance, while anthropic/claude-sonnet-4.5 and openai/gpt-5.1 generate 630 and 2,000 test functions respectively for the same topic, qwen/qwen3-next-80b-a3b creates only 360.

The current benchmark is simply uncontrolled and only indicates whether or not each AI model can properly construct extremely complex types.

AutoBE remains incomplete — though generated applications guarantee 100% compilation success, they don’t guarantee 100% runtime success.

Promise

We have previously achieved 100% build success with local models. We are committed to expanding benchmark coverage across various local and commercial LLMs while refining controlled variables in future evaluations.