May 2, 2026 · Kryven Research · OPERATIONAL

Decode Is HBM-Bandwidth-Bound at Our Serving Batch Sizes

A short, dense note on the operational reality that gates every compute-reduction technique we evaluate.

At Kryven serving batch sizes (1–32 on 3–4 H100), decode is HBM-bandwidth-bound; a technique that cuts FLOPs but not bytes-moved-per-token buys ~0 decode wall-clock.

This sentence is the master filter for any decode-time compute-reduction claim we evaluate. Four implications worth listing, each one publishable as a single bullet of operational knowledge:

Unstructured pruning: 0× decode speedup at our batch sizes. Killed for decode-time deployment. The matmul still moves the full dense weights into HBM-cache; zeros do not save bytes. Pruning helps training memory and storage; it does not help decode wall-clock until structured sparsity is reached.
2:4 sparsity: 1.25× decode, not the marketed 2×. The 2× number is from compute-bound regimes; under our bandwidth-bound serving regime, the realized speedup tracks the bytes-saved-per-token figure, not the FLOPs-saved figure.
FP8-W8A8 at decode: <25% speedup. The FP8 win is in training, prefill, and footprint (model fits in less HBM) — not decode wall-clock at batch=1–32. The bytes-per-token are halved relative to BF16, but the kernel overhead and the residual high-precision ops blunt the realized speedup.
Token-level layer-skip / Mixture-of-Depths on MoE is a FLOP-mirage at decode unless the serving kernel does not pre-stage the skipped layer's experts. If the routing-aware kernel speculatively loads the experts before the skip decision (which most do), the byte-elision win evaporates to the pure-FLOP figure the master filter predicts. The premise is binary and must be adjudicated by NCU bytes/token measurement before any decode claim is published.

We use this filter destructively. A proposal whose FLOPs-reduction story does not also translate into bytes-moved-per-token is auto-flagged as a decode-wall-clock null result, regardless of how clever the mechanism is. Most published decode-speedup claims that fail to reproduce on our serving stack fail because they were measured at a compute-bound batch size and silently miss this constraint.

We are not the first to point this out. Williams et al. on the roofline model is the canonical reference; the DeepSeek-V3 technical report keeps gating in BF16/FP32 even at FP8 scale partly for exactly the bandwidth/numerics reason. What we want to add to the public record is the specific Kryven-side serving regime where the filter binds: 1–32 batch on 3–4 H100. Outside this regime — single-GPU latency-floor inference, or huge-batch throughput-bound inference at the other end — the filter relaxes. Inside it, it is the operational floor of every decode-time technique we evaluate.

The takeaway is operational: before evaluating any decode-time compute reduction, measure bytes-moved-per-token under your target batch and hardware. If the technique does not cut that number, it does not buy decode wall-clock under our regime — and we suspect, under most serving regimes outside the embarrassingly-compute-bound corner.

← Back to Kryven Research