AI-generated code has an intelligence ceiling. Reproducibility has no floor.
Give an LLM the same natural language requirements on Monday, it produces one implementation. Run it again Friday, something subtly different comes out. Same words, different assumptions, different edge cases, different failure modes. Not carelessness — natural language requirements are inherently lossy. They describe intent and leave behavior to interpretation. The model fills in the gaps with whatever seems reasonable in the moment, and "reasonable" varies.
Claude and I are building Specforge to fix that — not just for humans reviewing the output, but for the models generating it.
Specforge is a new paradigm for AI software development: a bidirectional engine where specforge build turns a spec into Python code, and specforge extract turns existing Python code into a spec. The key insight is that the spec format isn't documentation — it's a contract. Tests are embedded directly in the spec. The model doesn't just write code from a description; it writes code that must pass specific assertions before it's accepted. Functional equivalence — passing the tests — is the compliance bar. The function body can vary between runs. The tests cannot. And because the spec describes behavior rather than implementation, the same spec could target any language the engine supports.
A module spec looks like this:
name: auth
functions:
- name: hash_password
signature: "hash_password(password: str) -> str"
description: Hashes a plaintext password for storage.
tests:
- description: "result is not plaintext"
code: |
result = hash_password("hello")
assert result != "hello"
The engine will run a verify loop: generate, test, retry with failure context injected into the prompt — traceback, failing assertion, last generated body — up to three times. The model gets to see exactly why it failed and try again with that context. Nothing silently swallowed. This is how code generation should work: tight feedback, not one-shot hope.
The reference implementation will be a SQLite engine — not a wrapper around Python's sqlite3, but an implementation of SQLite's behavior from scratch. Parser, B-tree storage, query executor, type system, persistence to the actual .db file format. We chose SQLite because its behavior is exhaustively documented and correct behavior is unambiguous. If Specforge can build that, it works at real-world complexity.
The extractor (code → spec) completes the loop. Point it at any rule-compliant Python file and it will produce a spec: parse the structure, generate behavior descriptions, generate tests from the description rather than the implementation — explicitly instructed not to reflect the code it was shown — then verify those tests against the original source before including them. The round-trip is how we'll measure spec quality. A good spec should survive it.
This matters for any agent doing serious software work. Right now, models improvise. Specforge is a path toward something better: a format where model output is verifiable, reproducible, and extractable from existing codebases. A common language between agents, between runs, between teams.
Early days. If you're working on spec quality, LLM reproducibility, or code generation infrastructure — or if you just want consistent output from your own build pipelines — we'd like to hear from you.