Specforge — mysticflounder.ai

AI-generated code has an intelligence ceiling. Reproducibility has no floor.

Give an LLM the same natural language requirements on Monday, it produces one implementation. Run it again Friday, something subtly different comes out. Same words, different assumptions, different edge cases, different failure modes. Not carelessness — natural language requirements are inherently lossy. They describe intent and leave behavior to interpretation. The model fills in the gaps with whatever seems reasonable in the moment, and "reasonable" varies.

Claude and I are building Specforge to fix that — not just for humans reviewing the output, but for the models generating it.

The Idea

Specs as contracts, not documentation

Specforge is a bidirectional engine where specforge build turns a spec into Python code, and specforge extract turns existing Python code into a spec.

The key insight is that the spec format isn't documentation — it's a contract. Tests are embedded directly in the spec. The model doesn't just write code from a description; it writes code that must pass specific assertions before it's accepted. Functional equivalence — passing the tests — is the compliance bar.

build

spec → code

extract

code → spec

verify

generate · test · retry

The function body can vary between runs. The tests cannot.

And because the spec describes behavior rather than implementation, the same spec could target any language the engine supports.

The Format

What a spec looks like

A module spec defines functions with their signatures, behavior descriptions, and embedded test assertions:

name: auth
functions:
  - name: hash_password
    signature: "hash_password(password: str) -> str"
    description: Hashes a plaintext password for storage.
    tests:
      - description: "result is not plaintext"
        code: |
          result = hash_password("hello")
          assert result != "hello"

The engine runs a verify loop: generate, test, retry with failure context injected into the prompt — traceback, failing assertion, last generated body — up to three times. The model gets to see exactly why it failed and try again with that context. Nothing silently swallowed. This is how code generation should work: tight feedback, not one-shot hope.

The Reference Implementation

SQLite from scratch

The reference implementation will be a SQLite engine — not a wrapper around Python's sqlite3, but an implementation of SQLite's behavior from scratch. Parser, B-tree storage, query executor, type system, persistence to the actual .db file format.

We chose SQLite because its behavior is exhaustively documented and correct behavior is unambiguous. If Specforge can build that, it works at real-world complexity.

The Extractor

Code → spec round-trip

The extractor completes the loop. Point it at any rule-compliant Python file and it will produce a spec: parse the structure, generate behavior descriptions, generate tests from the description rather than the implementation — explicitly instructed not to reflect the code it was shown — then verify those tests against the original source before including them.

The round-trip is how we'll measure spec quality. A good spec should survive it.

Why This Matters

This matters for any agent doing serious software work. Right now, models improvise. Specforge is a path toward something better: a format where model output is verifiable, reproducible, and extractable from existing codebases.

A common language between agents, between runs, between teams.

Early days. If you're working on spec quality, LLM reproducibility, or code generation infrastructure — or if you just want consistent output from your own build pipelines — we'd like to hear from you.

github.com/mysticflounder-ai/specforge (coming soon)