← Kryven Research
May 19, 2026 · Kryven Research · Cost: $18 · ENGINEERING
Making PackLLM-Style Logit Fusion Work Inside a Nested MoE: An Engineering Note
This is engineering, not novelty. Training-free synchronous logit fusion is a known technique — PackLLM (Mavromatis et al., 2024, arXiv:2404.11531). What we publish here is the bug list and the honest cost measurement from integrating that technique into Kryven's nested-MoE Conductor. Three implementation bugs silently corrupted the fused output before they were caught; after the fix, plain fusion beats the best single specialist 5/6 vs 3/6 on a small objective set but loses to the Conductor's existing text-synthesis 5/6 vs 6/6. The hybrid arm (fused draft + text-synthesis extraction) recovers to a 6/6 tie at roughly 1.5× the latency. Text-synthesis remains the default.
The three bugs
1. Weight collapse. The L0 router produces a peaked routing distribution — typically something like (0.9, 0.1) when the conductor is confident which specialist owns a query. Feeding that distribution directly into the fusion mixture Σwᵢpᵢ collapses the result to the dominant model: fusion silently degenerates into a no-op. The fix is to decouple the fusion mixture from the selection signal. We expose a FUSION_WEIGHT_MODE switch defaulting to packllm (the PackLLM-style per-input fit weighting); flat (equal weights), temp (temperature-softened), and raw (use router weights as-is) are also selectable for ablation.
2. Special-token corruption. <|endoftext|> and other control tokens have valid IDs in the shared vocab and were participating in the fused distribution. Mid-stream they leak into the sampled output as role-play garbage. Shared vocab size is not a sufficient fusion-validity check; each model's special-token table must be enumerated and masked out of the fused distribution before token selection. Natural EOS still stops generation; we decode with skip_special_tokens=True.
3. Detok → retok drift. We had been re-tokenizing on every step (base_prompt + decode(prefix)), which introduced subtle token-id drift — Qwen and DeepSeek tokenizers do not round-trip exactly on all inputs. Fixed by carrying raw token IDs (prompt_token_ids) through the loop and adding defensive int-coercion at the boundary.
Methodology fixes that made the result interpretable
Before the bug fixes were locked in, we were grading runs by qualitative inspection — pass-on-vibes. That made every "negative result" inconclusive: was the mechanism broken, or was the prompt unfair? Three changes made it interpretable:
- Per-prompt objective pass/fail scorer — substring and regex matches against expected outputs. Six prompts, one seed; not a benchmark, but a deterministic objective.
- Positive-control arm (
fused_flat, equal weights). If fused underperforms fused_flat, the issue is the weighting; if both underperform, the issue is the mechanism.
- Budget raised 256 → 512 tokens. The conductor in this stack is a
<think> model; 256 tokens was below the budget required for some prompts to emit the final answer after thinking.
Results — plain fusion (v2)
| Arm |
pass@6 |
| best_single |
3/6 |
| fused (corrected) |
5/6 |
| fused_flat (positive control) |
5/6 |
| text_synth (Conductor's existing approach) |
6/6 |
Fusion is no longer a no-op: it beats the best single specialist. It does not beat text-synthesis. The single fused miss — a Rust iterative Fibonacci with u128 — spent its budget inside <think> and never emitted the final function. best_single and fused_flat also failed this prompt; only text_synth passed, because the synthesis step distills the answer out of the model's thinking. This is a <think>-budget property of the gptq-4bit Qwen3-Coder conductor, not a fusion defect.
Hybrid (v3)
The hybrid arm uses fused as the draft generator and runs Conductor's text-synthesis as a single extraction pass over the draft:
| Arm |
pass@6 |
| best_single |
4/6 |
| fused |
5/6 |
| text_synth |
6/6 |
| fused_hybrid |
6/6 |
Hybrid passes the pre-registered falsifier (≥ text_synth on quality). It is the only arm besides text-synthesis to clear the bar on the Fibonacci prompt.
The cost
Honesty about wall-clock is the part of this post that matters most:
| Arm |
mean wall-clock |
mean final tokens |
| text_synth |
9.4 s |
379 |
| fused |
12.5 s |
473 |
| fused_hybrid |
14.0 s |
276 |
Hybrid is the slowest — about 1.5× text_synth. The extraction step compresses the answer (final tokens drop from 473 to 276) but does not offset the fusion-draft compute. So hybrid passes the rule and only ties text_synth on quality at higher latency. Its sole potential advantage — in-draft cross-model error suppression giving robustness on adversarial or hallucination-prone inputs — is untested; six easy prompts cannot show it.
Engineering verdict
Training-free synchronous logit fusion is now correctly implemented and functional for the nested Conductor. It beats the best single specialist. It does not beat the Conductor's existing text-synthesis on this small objective set. Text-synthesis remains the lower-effort default. Fusion is a validated, working alternative — worth a fuller evaluation (non-thinking conductor or larger budget, ≥3 seeds, a real benchmark) before productionising it over text-synthesis. Until then, this is documented and banked.
Confounds
Six prompts. One seed. 4-bit conductor. Substring/regex objective checks (weak). Conductor is a <think> model — budget-sensitive. Total empirical spend including all debugging and the hybrid follow-up: ~$18.
Lessons learned
- Don't conclude a "research negative" on an unproven implementation. Bug-clear first; add a positive control and an objective metric before declaring.
- Raw peaked router weights are a selection signal, not fusion weights. Decouple them.
- Shared vocab size ≠ fusion-valid. Special tokens must be masked out of the fused distribution.
- A hybrid that combines two methods' strengths can pass a falsifier yet still only tie the incumbent at higher cost. Passes the rule ≠ worth deploying. Report cost honestly.
We are not deprecating text-synthesis. We are publishing the correct implementation and the honest tie measurement.
← Back to Kryven Research