May 10, 2026 · Kryven Research · Cost: $0 · METHODOLOGY

We Pre-Registered a Metric Whose Math Said the Opposite of What We Meant

TL;DR. We froze a primary metric whose math rewarded forgetting — a faithful arm scored worse than a degrading one under the frozen reference. Three rounds of internal design review missed it before a fifteen-minute $0 audit caught it before any GPU spend. The mandatory sign-check protocol that did the catch is below; it costs zero, and we now require it on every pre-registered primary metric we freeze.

A proposal of ours for per-conversation async thinking compression went through three rounds of design review and reached implementation readiness before a $0 final audit found that the pre-registered primary metric was directionally inverted — it rewarded forgetting. The proposal was killed before any GPU spend. The methodological lesson generalizes well beyond the specific architecture: pre-registration freezes a metric, it does not certify that the math matches the verbal claim attached to it.

We publish this because the pre-registration / freeze-the-metric workflow is now widely adopted in serious ML research, including at Kryven. Most failure-mode discussions of pre-registration focus on p-hacking and post-hoc threshold adjustment — both real. The sign-error failure mode — pre-registering a metric whose math says the opposite of its stated semantics — is rarely discussed, and is exactly the kind of thing pre-registration is supposed to prevent but doesn't, on its own.

The metric

The frozen primary was:

ρ = (KL₁ − KL₁₆) / 15, where KL_t = KL(p_base_no_tick ‖ p_arm), higher = better.

p_base_no_tick is the base LM with no cross-turn state at all — the memoryless baseline. The verbal claim attached to ρ was "fidelity retention: higher ρ means the arm retains fidelity better across turns."

The math says the opposite.

A faithful arm accumulates reasoning the memoryless reference lacks → its distribution moves away from the memoryless reference as turns accrue → KL₁₆ > KL₁ → ρ < 0. A forgetting arm reverts toward memoryless → KL₁₆ < KL₁ → ρ > 0. The gate ρ_arm ≥ ρ_base is therefore a sign-inverted pass/fail gate that a degrading model passes while a faithful one fails.

The correct reference for a fidelity-retention claim is a full-information distribution — the native text-trace arm, or an oracle continuation — never the memoryless base.

Why three reviews missed it

This is the part we want other labs to read.

The proposal went through three rounds of internal design review. The first targeted novelty and the assumption-relabel ("does this proposal break a real assumption"). The second targeted statistical power. The third targeted experimental design and confounds. None was specifically tasked with tracing the sign of the primary metric on a known-good and a known-bad arm. The mistake survived three rounds of an audit process that, in retrospect, simply did not look at the question that would have caught it.

The error surfaced only because the final-revision review's mandate explicitly required: "correct formula / correct reference / correct sign on a known-good and a known-bad arm." That mandate produced the kill in roughly fifteen minutes of work.

The mandatory sign-check we now require

For any pre-registered primary metric, before the freeze:

Hand-construct a known-good arm and a known-bad arm.

Compute the metric for both under the frozen reference distribution.

Confirm that the directional sign of the result matches the verbal claim attached to the metric.

If the metric is a difference or ratio against a reference, write out which side of the reference a faithful arm should land on, and verify the math agrees.

Ten lines of Python, runnable in a notebook before the metric is frozen:

# Before freezing a pre-registered primary metric: build a known-good arm
# and a known-bad arm by hand, then run this. Cost: zero.

def sign_check(metric, reference, good_arm, bad_arm,
               direction="higher_is_better"):
    g = metric(good_arm, reference=reference)
    b = metric(bad_arm,  reference=reference)
    passes = (g > b) if direction == "higher_is_better" else (g < b)
    if not passes:
        raise SignError(f"metric sign-inverted under reference: "
                        f"good={g:.4f}, bad={b:.4f}; verbal claim "
                        f"'{direction}' is contradicted by the math.")

This costs zero. It would have caught our error. We now require it for every internal proposal.

The companion audit checklist

The same final-revision audit found four additional defects in the same proposal. They are usefully generalizable as a pre-GPU-spend checklist for any pre-registered experiment:

Spec-complete plus a passed design review certifies novelty, falsifiability, and schema compliance. It does not certify that the implementation actually wires the mechanism into the measured path, or that the substrate's architectural constants are real. Especially dangerous for post-knowledge-cutoff substrates where the implementing engineer codes against placeholder values that were never verified against the real config.

The mandatory $0 audit before any paid GPU step:

Fetch the real model card and config.json. Verify every architectural constant against the engineered code.

Assert the mechanism's output tensor actually enters the scored forward pass.

Confirm the primary-metric code equals the frozen formula and reference distribution.

Confirm any regression-gate code is not a stub.

(From this post) Trace the metric's sign on a known-good and a known-bad arm.

In our case, the audit also found that the eval script computed the mechanism's prefix and discarded it — the deliberation mechanism never reached any scored distribution, making all three arms statistically identical by construction. The MMLU/HumanEval regression gate was a return NaN stub. The real model card had every architectural constant different from the placeholder values the implementation coded against — including a model class (Qwen3_5MoeForCausalLM) inconsistent with the proposed mechanism's substrate assumptions. Each of these is a separate failure mode, but the cost lesson is the same: the gate worked because the audit was scoped to check exactly the things that would otherwise have wasted GPU time.

Cost

Total GPU spend across this entire effort, including four rounds of design review and the final audit: $0.

The cost of the lesson was the time invested in the design itself — which we consider well spent. The technical kernel of the work (proper integration with the substrate's matrix-valued Gated-DeltaNet state, the cross-turn versus inline framing, the typing against the real Qwen3_5MoeForCausalLM config rather than placeholders) is documented for any future from-scratch successor effort. The line is closed; the idea space is open.

What we want other labs to take from this

The sign-check is the durable lesson. It costs zero. It catches a class of failure that statistical-power audits, novelty audits, and falsifiability audits do not. We have added it as a standing pre-freeze check on every pre-registered primary metric we will freeze going forward, and we recommend it elsewhere.

Pre-registration is necessary. Pre-registration is not sufficient.

← Back to Kryven Research