KRYVEN CONDUCTOR Docs

A nested mixture of mixtures.

One AI model — itself a Mixture-of-Experts — directing five other AI models, each one also a Mixture-of-Experts. The lead model decides which specialists should answer each query, weighs its confidence in each, and synthesises their outputs into a single coherent response. One MoE on top, five MoEs underneath, one answer at the bottom.

Mechanically: a Qwen3-class sparse MoE Conductor with 128 internal experts dispatches every query across five specialist MoE models. Each specialist then runs its own internal token-level gating across its own experts. Their outputs are combined, weighted, and synthesised by the Conductor. A mixture of experts, directing a mixture of mixtures of experts.

Version0.3 · Production
Orchestration TiersL0 Conductor + L1 Specialists
Specialist Models4 MoE + 1 dense
InfrastructureSelf-hosted · controlled
The Conductor reads every query and emits a weighted distribution over specialist clusters. The selected specialists fire in parallel, each routing tokens through its own internal expert gate. A parallel fast-path always fires alongside, establishing a latency floor. The Conductor is re-invoked as the synthesiser, merging expert outputs weighted by its original routing confidence. Three stacked softmaxes per request — cluster, expert, token — and thousands of distinct routing paths.

Two orchestration tiers of gating, one inherited.

The framework's job is coarse-grained routing across four specialist clusters (L0) plus an always-on fast path. Every selected MoE model then performs its own learned token-level gating (L2) — inherited automatically from choosing sparse-activation experts. This is hierarchical MoE lifted from the layer level to the system level.

FIG.01 · Orchestration Flow
Read the diagram top-to-bottom as nested gating. The L0 Conductor (MoE) at the top routes your query to one or more of the L1 Specialists (each MoE) in the middle. Each chosen specialist then runs its own L2 token-level gating across its internal experts. One MoE directing five MoEs, each routing tokens through their own experts — three nested layers of selection per request.
INPUT L0 CONDUCTOR L0→L1 ROUTING L1 SPECIALISTS L2 INTERNAL MERGE SYNTHESIS t=0 ~50ms ~200ms ~400ms parallel ~1.2s user query text · image · long context opaque to routing layer L0 · Conductor (MoE) SPARSE MoE · 128 INTERNAL EXPERTS reads query · classifies intent · emits cluster softmax same weights re-invoked as synthesiser at merge step top-8 of 128 INPUT CONTRACT • raw query text • optional attachments • session metadata ROUTING DECISION · JSON weights: { code: 0.05, reasoning: 0.72 } capacity: 2 · synthesise: true subqueries: { ... } top-k, weighted DISPATCH BUS · async fan-out with capacity factor = 2 w=0.05 w=0.72 w=0.18 w=0.05 CODE REASONING VISION LONG-CTX parallel fast-path bypasses router 01 Code Specialist (MoE) 35B / 3B active 128 experts · top-8 multimodal · 262K ctx agentic · repo-scale 02 Reasoning Specialist (MoE) 80B / 3B active 512 + 1 shared · top-10 hybrid GDN + attn 3:1 thinking mode · CoT 03 Vision Specialist (MoE) 30B / 3B active VL-MoE · native vision track fp16 OCR · charts · docs 04 Long-Context (MoE) 30.5B / 3.3B active 262K native · 1M ext. extended via YaRN RAG · long docs FAST Latency Floor (dense) 4B dense no internal gating sub-200ms TTFT always on · shadow L2 · Inherited Token-Level MoE Gating not a framework decision — each MoE specialist routes every token through its own learned sparse gate 8 of 128 per token, specialist 01 10+1 of 513 per token, specialist 02 8 of 128 per token, specialist 03 8 of 128 per token, specialist 04 dense bypasses L2 Weighted Synthesis L0 re-invoked · outputs merged with original routing probabilities skipped when k=1 · fast-path attached as context on k>1 streamed response to user
L0 · Conductor (cluster routing + synthesis) L1 · Specialists (parallel, capacity-bounded) L2 · Internal MoE (token gating, inherited)
Fig. 01 — The Conductor emits a softmax over clusters; top-k specialists fire in parallel bounded by capacity factor. The fast-path (gold) runs alongside on every request, establishing a latency floor. The Conductor is re-invoked as synthesiser to merge outputs weighted by their original routing probabilities.

The top-level gate.

The Conductor is the only model that reads every query. It is itself a sparse Mixture-of-Experts with 128 internal experts and top-8 routing — so even the routing decision is computed by an internal expert gate, not a dense forward pass. Chosen specifically because its ~3.3B active parameters make it cheap enough to call on every request without dominating latency. It emits a structured softmax distribution across the specialist clusters — never a hard pick.

Tier L0 · Conductor
Conductor
Qwen3-class · 30.5B total · 3.3B active · 128 experts · top-8 routing · 32K native context (131K via YaRN)
Reads the query, classifies intent, decomposes into per-cluster subqueries, and emits a probability distribution across the specialist clusters. When more than one cluster fires, the same Conductor is re-invoked as the weighted synthesiser — each specialist output annotated with its cluster probability so the merge is probability-weighted, not a flat concatenation. Two roles, one model, one warm pool.
Architecture Class
Sparse MoE · 128 experts · 8 active per token
Structured Output
Guided decoding (outlines / xgrammar)
Invoked At
L0 Routing  ·  Weighted Synthesis
Prompt Caching
System prompt cached · high hit-rate target
Fallback Policy
Retry once on malformed JSON · then fast-path

Six specialists, all open-weight.

Every specialist except the fast-path is itself a Mixture-of-Experts. The pool is drawn from the state of the open-weight ecosystem (Qwen3 family in the current rotation) and rotated as better options ship. What is stable is the shape of the pool: role, architecture class, and the behaviour each role must guarantee — specific checkpoints are an implementation detail that evolves with the frontier.

Code · Agentic · Vision 01
Code Specialist
Qwen3-Coder-class · 30.5B total · 3B active · 128 experts · 256K native ctx (1M via YaRN)

Top-tier open-weight model in its active-parameter band for agentic code and repo-scale reasoning. Native function-calling and tool-use, optimised for instruction-following without thinking mode. Pairs with the Vision Specialist when a query mixes screenshots of code with text. Leads its size class on SWE-bench Verified.

ROLE · agentic coding & repo-scale reasoning
Reasoning · Thinking Mode 02
Reasoning Specialist
Qwen3-Next-class · 80B total · 3B active · 512+1 shared experts · 262K ctx

Hybrid Gated DeltaNet + Gated Attention MoE in a 3:1 layer ratio, tuned for chain-of-thought reasoning. Despite 80B total parameters only ~3B activate per token, making it cheaper per request than its size implies. Runs an explicit thinking pass before committing to an answer — the reasoning trace stays internal unless a user asks to see it.

ROLE · deep reasoning · math · logic · planning
Vision · Multimodal 03
Vision Specialist
Qwen3-VL-class · 30B total · 3B active · VL-MoE · vision pathway at full precision

Dedicated vision-language specialist with the vision encoder kept intact at full precision. Handles images, OCR, charts, screenshots, and visual Q&A natively — no external OCR pipeline sits between the image and the response. A dedicated role means vision queries never compete with text queries for the same weights.

ROLE · native multimodal understanding
Long Context · RAG 04
Long-Context Specialist
Qwen3-class · 30.5B total · 3.3B active · 256K native context (1M via YaRN)

Built for RAG payloads, full codebases, long transcripts, multi-document summaries. Native 256K context extended to 1M via YaRN scaling. Shares its weight topology with the Conductor but serves a completely different role — retrieval and long-doc synthesis, not classification.

ROLE · long-document synthesis & retrieval
Fast · Shadow
Latency Floor
Qwen3-class · 4B dense · no internal gating · sub-200ms TTFT

Deliberately dense, not MoE — dense models have lower first-token latency, which matters for the parallel shadow role. Fires on every request alongside L0. For trivial queries the Conductor routes directly to this path and skips synthesis. For complex queries its output is attached to the synthesis prompt as additional context, never used as a blind fallback.

ROLE · latency floor & always-on shadow
Why this lineup works. Four MoE experts averaging ~3B active parameters per token — the framework's real compute load is comparable to running a single small model, even when multiple specialists fire in parallel. The 80B reasoning specialist is the highest-capacity component, but because only ~3B activate per token, it does not dominate latency or cost. The dense fast-path stays under 5B and carries no gating overhead.

Four routable clusters, one always-on.

The five specialists are grouped into five clusters. Four are L0-routable (code, reasoning, vision, long-ctx). The fifth — fast — is never routed to; it simply fires on every request, outside the capacity budget. This keeps cost accounting clean: capacity is bounded on the routable side, and the shadow path is a constant.

CODE (MoE)
Agentic coding, debugging, tool-use, shell, SQL, function-calling, repository-scale reasoning. Single-specialist cluster — the Code Specialist covers the full band with multimodal input.
01  Code Specialist
REASONING (MoE)
Math, logic, multi-step analysis, planning. Single-specialist cluster using thinking-mode chain-of-thought. Heavyweight but sparse — 80B total, only ~3B active per token.
02  Reasoning Specialist
VISION (MoE)
Images, charts, diagrams, OCR, screenshots, visual Q&A. Single-specialist cluster with the vision pathway preserved at full precision.
03  Vision Specialist
LONG-CTX (MoE)
RAG payloads, long documents, full codebases, transcripts. 1M-token context via YaRN when needed; 256K native context is the default.
04  Long-Context Specialist
FAST · SHADOW
Never routed to. Always fires parallel with L0 to establish a latency floor. Outside the capacity budget — its cost is constant, not variable.
 Latency Floor

Six MoE techniques, lifted to orchestration.

Six well-known techniques from Mixture-of-Experts serving literature, adapted honestly to system-level orchestration. Each maps a genuine parallel from the training / inference literature onto a concrete system behavior.

01
Soft routing with weights
L0 emits a full softmax over the routable clusters — not a hard pick. Top-k clusters fire, each weighted by its probability. The synthesizer uses those weights as merge priors.
↳ MoE analog · top-k gating
02
Capacity factor
Max 2 concurrent L1 experts per request, separate from the always-on shadow. If the softmax spreads mass across three clusters, only the top two fire. Bounds cost and latency without complicating accounting.
↳ MoE analog · expert capacity limit
03
Load-balance surveillance
Every expert invocation is logged. If any cluster fires for <5% or >60% of routable traffic over a rolling 24-hour window, an alert fires and the router prompt is reviewed.
↳ MoE analog · auxiliary load-balancing loss
04
Shadow path as latency floor
The Latency Floor fires on every request, outside the capacity budget. Its output is attached as context to the synthesis prompt, or returned directly for trivial queries. Not a blind fallback — a deliberate parallel-path answer.
↳ MoE analog · shared expert with guaranteed activation
05
Probability-weighted synthesis
When k > 1 experts fire, the synthesis prompt includes each output annotated with its cluster probability. The synthesizer is primed to trust higher-weighted outputs more — forward-pass only, no gradients.
↳ MoE analog · gated expert combination (forward-pass)
06
Routing-collapse detection
Real-time monitor flags pathological cluster distributions (one cluster >80% or any cluster <2%). Triggers automatic fall-back to uniform routing plus a manual prompt review.
↳ MoE analog · dead-expert detection

Private. Open-weight. User-funded.

Kryven is funded directly by its users. Subscription revenue pays for the three things that make the product better: stronger models, tighter latency, and cleaner product experience. No ads, no data sales, no upstream vendor deciding what you can or can't ask.

WHERE SUBSCRIPTION REVENUE GOES
Subscription revenue funds model upgrades, infrastructure improvements, and product development. We don't sell user data, and we don't run ads. When you pay for Kryven, that money is what keeps the platform running and getting better.
01
Better models
We swap in newer, stronger open-weight specialists the moment they prove out on our evaluations. The specialist pool is never frozen — the framework is model-agnostic by design, and the roles are stable even as the checkpoints behind them evolve.
02
Tighter latency
Warmer pools during peak hours, faster hardware as it becomes viable, better caching, better streaming. Every fraction of a second shaved off is a direct measurable win for the person waiting on a response.
03
Better user experience
Cleaner chat, richer conversation history, saner exports, more responsive mobile, better prompt tooling. The UI is where you actually live — we spend aggressively on making the whole product feel obvious and fast.
04 · PRIVATE
Self-hosted on infrastructure we operate
Every specialist runs on inference infrastructure Kryven operates directly — not routed through consumer-facing AI APIs. Conversations are not logged for training. Delete your history and we purge it from active storage immediately and from backups within 30 days.
05 · OPEN
Built on open weights
Every specialist in the Conductor pool is an open-weight model published under a permissive licence. No proprietary black boxes, no vendor lock-in, no sudden API deprecations. The framework layer itself is proprietary; the model layer is auditable by anyone in the open-source community.