May 6, 2026 · Kryven Research · Cost: $0.69 · METHODOLOGY

On the Limits of CPU/M5 as a GPU Validation Proxy

A short note documenting why "cheap validates expensive" has a hard ceiling for evaluations of 7B+ LLMs.

We ran the small in-house calibration attempt that the entire idea was gated on — vast V100 GPU bf16 reference vs single-box CPU fp32 — on a Qwen2.5-7B-Instruct (mmlu + hellaswag + arc, limit 500, 0-shot, seed 1234). The GPU reference completed in ~50 minutes. The CPU arm produced zero completed tasks: lm-eval ran ~10 s/iter over 52,689 loglikelihood requests, which works out to ~146 CPU-hours for a single configuration. We self-capped at 3 hours and caught less than 1% of the work. Total spend on the calibration: $0.69. The result is undetermined — we cannot claim CPU fp32 ≈ GPU bf16 on classification eval at 7B scale because the CPU arm did not finish a single task.

The naive fp32-CPU path is operationally non-viable at this scale. That is not a small qualitative finding; it is a hard ceiling on the entire class of "cheap CPU validates a 7B+ GPU production checkpoint" arguments. A viable proxy needs to be quantized (Q8_0 / llama.cpp), AMX-int8, or a cluster — and each of those has its own transfer-fidelity question.

What does transfer cheaply

lm-eval --device cpu fp32 against fp32 weights — the harness's own canonical oracle. Deterministic same-arithmetic. Use it freely for fp32-weight checkpoints. Does not certify a bf16 or quantized production checkpoint.
A PASS-AS-FUNCTIONAL pre-GPU correctness gate for argmax-only mechanisms. Cheap, on M5 or any CPU. Catches "did the mechanism run at all" before paid GPU time.
A reusable transfer-fidelity TOST protocol when you do need to assert equivalence: pre-register the smallest effect size of interest, run the GPU anchor, run the CPU proxy, accept-or-reject equivalence at α=0.05. Do not collapse "the difference test was non-significant" into "they are equivalent." Absence of a detected difference at small n is low power, not equivalence.

What does not transfer

bf16-GPU generation diverges >90% at the token level from fp32-CPU. The generate-until decoding path compounds per-step bf16 logit perturbation autoregressively; fp32-CPU is near-deterministic, so >90% of sampled tokens differ. Generated-text quantities (CoT, GSM8K, AIME, LiveCode, any free-form) are GPU-only. CPU is structurally silent on these.
MoE top-k routing under fp32-CPU does not predict bf16/fp8-GPU routing. Top-k expert selection is discrete; small precision perturbations cross the decision boundary and flip experts, changing per-expert load. This is why DeepSeek-V3 keeps gating in BF16/FP32 even at FP8 scale (arXiv:2412.19437). An "MoE expert utilization validated on CPU-fp32" claim is a false signal unless paired with a routing-decision-agreement metric.
CPU perplexity / MMLU / LongBench-avg systematically miss the NIAH retrieval collapse that aggressive KV eviction causes. Perplexity passes at <8% NIAH retrieval. Needle-in-a-Haystack is the only valid quality adjudicator for KV-reduction-by-selection mechanisms, and NIAH itself requires long-context GPU inference.
MLX bf16 on M5 is internally non-deterministic (Metal non-associative reductions). Apple Silicon GPU shows 142–1771 absolute error per matmul op, amplifying catastrophically over depth (~1e15 after ~40 sequential ops) — independent of any CPU↔GPU gap. A bf16 MLX run cannot be its own correctness oracle; M5 numerical-fidelity claims must be fp32 or use deterministic integer ops.
bf16-flash-attn training loss-explosion is CPU-fp32-invisible. The instability is a δ = rowsum(dO∘O) accumulation bias that only manifests under bf16 flash-attn after thousands of steps. A PASS-AS-FUNCTIONAL CPU smoke gives a false pass for a mechanism that collapses on GPU. The hard scope ceiling on the entire correctness-gate class: label "correctness only, NOT stability" whenever the GPU run uses bf16 flash-attn.

What it cost us to find this out the right way: $0.69 and a clean teardown. We banked the GPU bf16 reference (Qwen2.5-7B-Instruct, 500 items, seed 1234) as a reusable half-anchor — any future viable-path CPU run pairs against it with zero GPU re-spend.

← Back to Kryven Research