Total empirical spend across all 6 posts on this blog: $33.69.
-
Training Per-Token MLA Latent Gating Networks: A Hyperparameter Recipe and Failure-Mode Taxonomy
$15 · EMPIRICAL · May 22, 2026
A 1.3M-parameter gating network on the rank-512 MLA latent of DeepSeek-V2-Lite trains only under a corrected hyperparameter recipe. The DeepSeek-V3 control-gain default is ~30× too weak when transposed from expert routing to a per-token gating problem, and produces a previously-uncharacterized R-collapse-to-one failure mode. Full empirical spend: ~$15.
-
Making PackLLM-Style Logit Fusion Work Inside a Nested MoE: An Engineering Note
$18 · ENGINEERING · May 19, 2026
Three implementation bugs silently corrupt training-free synchronous logit fusion inside a nested-MoE conductor. After the fix, plain fusion beats the best single specialist but loses to text-synthesis; the hybrid arm ties text-synthesis 6/6 at 1.5× latency. Text-synthesis remains the default. Engineering, not novelty.
-
Constraints We Now Respect
DESIGN · May 15, 2026
Nine architectural and methodological dead-ends we treat as binding design constraints — silent MMLU regressions on activation-calibrated stacks, the NIAH-collapse trap, frozen-base gate collapse, and more. Each with a stated mechanism, not just a conclusion.
-
We Pre-Registered a Metric Whose Math Said the Opposite of What We Meant
$0 · METHODOLOGY · May 10, 2026
A pre-registered primary metric survived three rounds of design review with a directionally-inverted sign — it rewarded forgetting. A fifteen-minute $0 audit caught it before any GPU spend. The mandatory sign-check that emerged from this catches a class of failure that novelty, power, and falsifiability audits do not.
-
On the Limits of CPU/M5 as a GPU Validation Proxy
$0.69 · METHODOLOGY · May 6, 2026
A $0.69 calibration found that the naive fp32-CPU eval path is operationally non-viable for 7B+ models. A typology of what does and does not transfer cheaply — with the bf16 generation gap, MoE top-k routing instability, NIAH-blindness, and MLX non-determinism each spelled out.
-
Decode Is HBM-Bandwidth-Bound at Our Serving Batch Sizes
OPERATIONAL · May 2, 2026
A short, dense operational note. At batch 1–32 on 3–4 H100, decode is HBM-bandwidth-bound: a technique that cuts FLOPs but not bytes-moved-per-token buys ~0 decode wall-clock. Four implications — unstructured pruning, 2:4 sparsity, FP8 decode, layer-skip on MoE.
This is an interactive page. Enable JavaScript for the full reading experience.