The Spec-Code Gap

If you attempt to use words to specify a program's behavior in enough detail, at one point are your words equivalent to the code itself?

Hillel Wayne shared an interesting idea yesterday (emphasis mine).

§1What I'm getting at here is that a specification is an abstraction of code. For every spec, there is a set of possible programs that satisfy that spec. The more comprehensive and precise the spec, the fewer programs in this set.

I like this framing. A spec as an abstraction of code wasn't a lens I'd used before.

Abstractions have concrete instances. For a spec, those instances are the programs that satisfy it.

Multiple programs can satisfy the same spec. But not all of them are programs you'd want.

§2The set of possible implementations includes the programs they want, but also lots of programs they don't want.

If the set contains programs you wouldn't accept, that is evidence that the spec needs refinement. Add more constraints to shrink the set.

§3A specification is sufficient if it does not need to be refined further: no matter what implementation (within reason) is provided, the specifier would be satisfied. A spec does not need to be fully comprehensive to be sufficient.

Consider an order pricing function.

ConstraintsAdmissible setStatus
ACalculate the total price of an orderHuge; tax before or after discount, rounding varies, shipping included or notInsufficient
BA + apply discounts before taxSmaller; rounding and shipping handling still varyNarrower
CB + round to nearest cent, shipping separateStakeholder accepts any program that does thisSufficient

B refines A, which is another way of saying that every program satisfying B also satisfies A, but not vice versa. C is sufficient. The stakeholder is satisfied with any program that makes it through.

Spec vs. Verification

Tests are one way to check whether a program satisfies a spec. A test is a verification1 of the spec; it isn't the spec itself.

Once that distinction is clear, it follows that the same spec can be verified using example tests, property-based tests, static types, or even theorem provers.

The spec itself and how you verify it are two independent things, and this separation makes the spec a first-class artifact. You write it down in some format (or multiple formats) and maintain it independently of the code that implements it.

What it isExampleChanges when
SpecWhat the program must doDiscounts apply before tax; round to nearest centRequirements change
CodeOne program that does itA specific implementation of the pricing functionImplementation changes (refactor, optimization, rewrite)
VerificationEvidence that the code satisfies the specUnit tests, property tests, type constraints, formal proofSpec changes or verification strategy changes

The Gap

The gap between the spec and the code is every decision the spec leaves open.

In practice, programmers are often in the room with the people creating the spec, and teams evolve it into existence somewhere on the spectrum between implicit and explicit.

Traditionally, this gap gets left open on purpose. Hillel notes:

A spec does not need to be fully comprehensive to be sufficient.

The spec evolves as the software is used. Feedback from real users drives refinement, and the code shifts to meet the moving target.

With AI involved, the gap is being filled in two new ways.

Vibes

The model2 makes decisions on your behalf. You trade visibility for velocity.

With vibes, you get a working program faster, but you're accepting most of the model's choices without seeing them individually. In exchange for not having to make those decisions yourself, you give up the ability to confirm or challenge each one. And the model might fill gaps prematurely, or in ways that close off changes you'd want to make later.

Prompts

You write instructions that encode your intent and let the model follow them.

This gives you explicit say over what you're asking the model to do. But once a prompt exists, it becomes a form of code. It has to be maintained and kept in sync with your intent.

"Prompt" covers a spread3 of mechanisms, each closing a different slice of the gap.

MechanismWhen it appliesWhat it pins down
Instruction fileAlways, project-wideConventions, defaults, style
Scoped ruleOn file pattern or topic matchConditional guidance
Skill or slash commandOn explicit invocationA named, reusable workflow
MemoryCarried across sessionsAccumulated project knowledge
HookDeterministic, on lifecycle eventGuardrails the model can't skip

These tools differ in when they apply and how strongly they hold, but all of them exist to specify and steer in one way or another.

The risk of investing heavily in these structures is that every new model, and every new provider, interacts with them a little differently.

At best, switching models means reworking the controls; at worst, your own controls can hold the newer/better AIs back, constraining them into shapes4 tuned for the model they replaced. The more controls you have in place, the higher your potential switching cost.

The Spectrum

This vibes/prompts tradeoff is what I'm most interested in. I often dream of "mini languages" or schemas that help us organize and specify how to name specific gaps and keep them open or closed in explicit, legible ways, to avoid closing them unintentionally or prematurely. What I'm after is a way of controlling gaps that's generic enough to minimize the risk of stifling better future models.

"Pure vibes" and "full spec" are points on a gap-control spectrum of human participation. Each position along it shifts what you maintain, who controls the gap, and what you trade off.

Pure vibes Full spec
StanceWrite it for me
You maintainNothing
Gap controlCeded to the model
TradeoffVisibility for velocity

The easiest and simplest way to approach this is to do nothing. Use minimal or no prompting, and just let the models get better over time, and see what happens. That depends on your optimism in the underlying technology like scaling laws, current model architecture, or even gradient descent itself.

I'm entirely open to the idea that the training process will evolve faster and better than my sophisticated prompting, such that all possible gaps can be left open or filled at the right time in the right way with no human involvement.

But at least for now, I continue to invest in carefully specifying prompts and designing context windows to keep gaps open or closed to my own specification. I take the time hit because the outputs are better when I'm steering than when I'm not.