May 15, 2026 · Kryven Research · DESIGN

Constraints We Now Respect

A short list of architectural and methodological dead-ends we have either reconfirmed or proved internally, presented not as "things we tried that failed" but as design constraints we treat as binding going forward. Each constraint has a citation and a stated reason it fails — citing the conclusion alone is not enough; cite the mechanism.

We publish this because most labs rediscover these every cycle, and a public list with citations is a small public service. Most are reconfirmations of literature; a few are proven internally in conditions the literature does not document.

1. Token→expert routing-distribution changes on a static-FP8 or W4-activation-calibrated serving stack cause an estimated 2–5 pp silent MMLU regression unless paired with fixed-capacity routing or online recalibration. Reason: the calibration is conditional on the activation distribution at calibration time; the distribution shifts when routing changes. W4 weight-only AWQ/GPTQ scales survive because the activation distribution is intact; only activation-scale calibration breaks. The regression is silent — perplexity passes — and the constraint is scale-invariant (do not assume larger scale dilutes it). Any MoD / SeqTopK / elastic-top-k / relevance-filter / per-token gating change deployed on an activation-calibrated serving stack must carry either fixed-capacity routing by construction or an online recalibration pass.

2. Attention-entropy / KV-sparsity / learned-eviction gates that report only perplexity or MMLU re-derive the H2O / StreamingLLM NIAH-collapse dead end — perplexity passes while Needle-in-a-Haystack collapses to <8% retrieval. Reason: perplexity averages over the whole context; NIAH probes long-range retrieval at a specific position, which a sparsity gate can silently delete without changing average likelihood. NIAH (≥95% retention) is the only valid quality adjudicator for any KV-reduction-by-selection mechanism. PyramidKV (100% NIAH at 12% cache retention) is the documented NIAH-safe exception and is the correct safe alternative to recommend over any eviction scheme that has not been NIAH-validated.

3. Frozen-base post-hoc routing / skip / tick gates with no LM-coupled data-dependent gradient pathway collapse to a degenerate constant routing distribution. Reason: without a gradient path from the LM loss to the gate's logits, the gate has no learning signal to differentiate tokens; the optimizer pushes it toward a degenerate constant (variance ≈ 0 across tokens and seeds). We confirmed this internally with cross-seed bit-identical results on a tick-routing prototype. Train-time or gate-fine-tuned designs are mandatory; bolt-on gates over a frozen base are auto-flagged as collapse-prone.

4. A CPU smoke that rescales hyperparameters to surface an effect in a few steps validates the mechanism in rank-order, not the production hyperparameter or absolute magnitude. Reason: rescaling γ from 0.001 → 0.05 (×50) to make the control loop bind in 100 steps demonstrates the loop binds in principle; it does not validate that γ=0.001 works at production scale. Reporting a rescaled-HP CPU pass as production-HP validation is the dead end. The production HP must still be GPU-validated.

5. Power-law extrapolation from a small or proxy run onto a discontinuous capability metric is invalid. Reason: emergent capabilities — multi-digit arithmetic at ~13B→175B and chain-of-thought at the 100B+ scale (Wei et al. 2022, arXiv:2206.07682) — are phase transitions, not scaling laws on the metric the literature uses to score them. A sub-1B proxy on the zero-capability side of the transition is null, not predictive. (Cf. Schaeffer et al., NeurIPS 2023, arXiv:2304.15004, which argues that ~92% of apparent emergence is a metric-discontinuity artifact and only ~8% of BIG-bench tasks show emergence under continuous metrics; either way, on a discontinuous primary the proxy is uninformative below threshold.) Proxy / scaling candidates whose target metric is on the discontinuous-emergence list are flagged below the emergence threshold; the proxy-LR / rank arbitrage is valid only for monotone loss / rank quantities.

6. mlx_lm.evaluate on Apple Silicon is a hard kill for reasoning / thinking-token models (DeepSeek-R1 / Qwen3-Next class). Reason: <think> tokens break the evaluation harness's answer extraction, producing pathological 0% winogrande / 100% openbookqa (vs ~80% reference). The kill is already triggered, not predicted. The Reasoning Specialist of any nested-MoE stack cannot be eval-routed to M5; that decision is binding.

7. The "transferred-LR is inside the 95% CI of the target-scale optimum" verdict is GPU-irreducible. Reason: establishing the 95% CI of the true optimum requires GPU bf16 training runs at the target scale to produce the dynamics; a CPU smoke can do early-abort runnability for it, but cannot establish the CI. μP / MSSP HP-transfer claims must treat the CI verdict as GPU-only; the CPU leg is a runnability gate at most.

8. A ≤64-core paid CPU cluster costs ≥ an H100 spot per hour while delivering 20–100× less throughput. Reason: directly sourced rental economics — Hetzner AX162 EPYC ~$2.92/h, AWS c7i.48xlarge ~$4.51/h vs H100 spot ~$2.50/h. CPU bulk inference is ~10× $/token. The only net-positive cheap substrates are M5 ($0 rental) for rank-order/correctness, lm-eval-cpu classification against fp32 weights, and spot c6i embarrassingly-parallel ranking in a low-interruption region with <2 h idempotent-checkpointed trials.

9. "p > 0.05 on a difference test" does NOT establish CPU↔GPU equivalence. Reason: a non-significant difference test at small n simply reflects low power, not equivalence; with n=3, 80% power exists only for d ≥ 0.8. A cheap-validates-expensive claim is admissible only via a pre-registered TOST against a GPU-anchored smallest effect size of interest. This is the standing antidote to the silent-false-positive failure mode that corrupts the GPU-spend decision.

Each of these is one paragraph because each is a single constraint. We rediscover most of them every quarter in some new clothing; treating them as binding from the outset has saved real money. We will append to this list when we confirm a new one.

← Back to Kryven Research