May 19, 2026 · Kryven Research · Cost: $18 · ENGINEERING

Making PackLLM-Style Logit Fusion Work Inside a Nested MoE: An Engineering Note

This is engineering, not novelty. Training-free synchronous logit fusion is a known technique — PackLLM (Mavromatis et al., 2024, arXiv:2404.11531). What we publish here is the bug list and the honest cost measurement from integrating that technique into Kryven's nested-MoE Conductor. Three implementation bugs silently corrupted the fused output before they were caught; after the fix, plain fusion beats the best single specialist 5/6 vs 3/6 on a small objective set but loses to the Conductor's existing text-synthesis 5/6 vs 6/6. The hybrid arm (fused draft + text-synthesis extraction) recovers to a 6/6 tie at roughly 1.5× the latency. Text-synthesis remains the default.

The three bugs

1. Weight collapse. The L0 router produces a peaked routing distribution — typically something like (0.9, 0.1) when the conductor is confident which specialist owns a query. Feeding that distribution directly into the fusion mixture Σwᵢpᵢ collapses the result to the dominant model: fusion silently degenerates into a no-op. The fix is to decouple the fusion mixture from the selection signal. We expose a FUSION_WEIGHT_MODE switch defaulting to packllm (the PackLLM-style per-input fit weighting); flat (equal weights), temp (temperature-softened), and raw (use router weights as-is) are also selectable for ablation.

2. Special-token corruption. <|endoftext|> and other control tokens have valid IDs in the shared vocab and were participating in the fused distribution. Mid-stream they leak into the sampled output as role-play garbage. Shared vocab size is not a sufficient fusion-validity check; each model's special-token table must be enumerated and masked out of the fused distribution before token selection. Natural EOS still stops generation; we decode with skip_special_tokens=True.

3. Detok → retok drift. We had been re-tokenizing on every step (base_prompt + decode(prefix)), which introduced subtle token-id drift — Qwen and DeepSeek tokenizers do not round-trip exactly on all inputs. Fixed by carrying raw token IDs (prompt_token_ids) through the loop and adding defensive int-coercion at the boundary.

Methodology fixes that made the result interpretable

Before the bug fixes were locked in, we were grading runs by qualitative inspection — pass-on-vibes. That made every "negative result" inconclusive: was the mechanism broken, or was the prompt unfair? Three changes made it interpretable:

Per-prompt objective pass/fail scorer — substring and regex matches against expected outputs. Six prompts, one seed; not a benchmark, but a deterministic objective.
Positive-control arm (fused_flat, equal weights). If fused underperforms fused_flat, the issue is the weighting; if both underperform, the issue is the mechanism.
Budget raised 256 → 512 tokens. The conductor in this stack is a <think> model; 256 tokens was below the budget required for some prompts to emit the final answer after thinking.

Results — plain fusion (v2)

Arm	pass@6
best_single	3/6
fused (corrected)	5/6
fused_flat (positive control)	5/6
text_synth (Conductor's existing approach)	6/6

Fusion is no longer a no-op: it beats the best single specialist. It does not beat text-synthesis. The single fused miss — a Rust iterative Fibonacci with u128 — spent its budget inside <think> and never emitted the final function. best_single and fused_flat also failed this prompt; only text_synth passed, because the synthesis step distills the answer out of the model's thinking. This is a <think>-budget property of the gptq-4bit Qwen3-Coder conductor, not a fusion defect.

Hybrid (v3)

The hybrid arm uses fused as the draft generator and runs Conductor's text-synthesis as a single extraction pass over the draft:

Arm	pass@6
best_single	4/6
fused	5/6
text_synth	6/6
fused_hybrid	6/6

Hybrid passes the pre-registered falsifier (≥ text_synth on quality). It is the only arm besides text-synthesis to clear the bar on the Fibonacci prompt.

The cost

Honesty about wall-clock is the part of this post that matters most:

Arm	mean wall-clock	mean final tokens
text_synth	9.4 s	379
fused	12.5 s	473
fused_hybrid	14.0 s	276

Hybrid is the slowest — about 1.5× text_synth. The extraction step compresses the answer (final tokens drop from 473 to 276) but does not offset the fusion-draft compute. So hybrid passes the rule and only ties text_synth on quality at higher latency. Its sole potential advantage — in-draft cross-model error suppression giving robustness on adversarial or hallucination-prone inputs — is untested; six easy prompts cannot show it.

Engineering verdict

Training-free synchronous logit fusion is now correctly implemented and functional for the nested Conductor. It beats the best single specialist. It does not beat the Conductor's existing text-synthesis on this small objective set. Text-synthesis remains the lower-effort default. Fusion is a validated, working alternative — worth a fuller evaluation (non-thinking conductor or larger budget, ≥3 seeds, a real benchmark) before productionising it over text-synthesis. Until then, this is documented and banked.

Confounds

Six prompts. One seed. 4-bit conductor. Substring/regex objective checks (weak). Conductor is a <think> model — budget-sensitive. Total empirical spend including all debugging and the hybrid follow-up: ~$18.

Lessons learned

Don't conclude a "research negative" on an unproven implementation. Bug-clear first; add a positive control and an objective metric before declaring.

Raw peaked router weights are a selection signal, not fusion weights. Decouple them.

Shared vocab size ≠ fusion-valid. Special tokens must be masked out of the fused distribution.

A hybrid that combines two methods' strengths can pass a falsifier yet still only tie the incumbent at higher cost. Passes the rule ≠ worth deploying. Report cost honestly.

We are not deprecating text-synthesis. We are publishing the correct implementation and the honest tie measurement.

← Back to Kryven Research