A nested mixture of mixtures.

One AI model — itself a Mixture-of-Experts — directing five other AI models, each one also a Mixture-of-Experts. The lead model decides which specialists should answer each query, weighs its confidence in each, and synthesises their outputs into a single coherent response. One MoE on top, five MoEs underneath, one answer at the bottom.

Mechanically: a Qwen3-class sparse MoE Conductor with 128 internal experts dispatches every query across five specialist MoE models. Each specialist then runs its own internal token-level gating across its own experts. Their outputs are combined, weighted, and synthesised by the Conductor. A mixture of experts, directing a mixture of mixtures of experts.

Introduction

The multi-model AI companion and developer platform.

Introduction to Kryven AI

The Multi-Model AI Companion and Developer Platform

What is Kryven AI?

Kryven AI is a powerful, fully-featured AI platform designed for users who want broad capability and control over their AI experience. Whether you're an author working on long-form fiction, a developer requiring reliable coding assistance, or an enterprise looking for stable API endpoints, Kryven provides access to state-of-the-art Large Language Models (LLMs).

Core Philosophy

Kryven AI prioritizes user autonomy and broad capability. You are the author, and the AI is your tool. Our models support a wide range of legitimate use cases under a published, enforced content policy.

Our strict privacy policy ensures that your prompts—no matter the content—are processed entirely confidentially and are never monitored or used for model training without explicit opt-in.

The Kryven Web Platform

The Kryven web interface is designed to seamlessly integrate both creative writing and advanced development workflows into a single dashboard.

The Creative Writing Companion

For writers and content creators, Kryven offers a powerful environment.

Flexible Models: Develop complex, long-form narratives without arbitrary prompt rejections, under a clear, published content policy.
Intelligent Context: Kryven models boast massive context windows, ensuring characters remember details from conversations that occurred tens of thousands of tokens prior.
Custom Presets & Deep Prompting: Define the exact psychological profile, mood, speech patterns, and physical characteristics of your AI using our robust preset system.

Powerful Coding Assistance

For developers and engineers, Kryven is an incredibly powerful pair programmer.

Use Kryven in your editor: Connect Kryven to Cursor, VS Code (via Cline), or any OpenAI-compatible client. Kryven speaks the standard OpenAI API, so integration is a Base URL + API key away.
Unbiased Coding Assistance: Kryven analyzes and writes code across a wide range of legitimate domains — system administration, security research on systems you own or are authorized to test, data engineering, and infrastructure scripting — without the false positives that plague heavily-filtered coding models.
Advanced Refactoring: Paste your code into the chat, ask for a modular refactoring plan using best practices, and receive a structured, file-by-file plan you can apply in your editor.

AI Platform Capabilities

Kryven provides a suite of advanced intelligence capabilities powered by our high-performance infrastructure.

🛡️ Capability-First Design: Minimal restrictions. Strong support for creative writing, complex analysis, and coding.
🗂️ Conductor Framework: The Conductor routes every query to the right specialist — code, reasoning, vision, or long-context — automatically.
⚡ Lightning Fast Generations: Enterprise-grade GPU infrastructure ensures your text and images generate in milliseconds, not minutes.
💻 Developer API: 100% OpenAI-compatible API endpoints for text and code.

Next Steps

To begin unlocking the full potential of Kryven AI, explore the following guides depending on your use case:

API Integration: If you are a developer looking to use Kryven models in your own applications, scripts, or VS Code extensions.
Prompt Engineering: Master how to speak to our models to extract precise, nuanced outputs for creative writing or coding.

API Integration

Harness Kryven AI models in your own applications.

API Integration ↗ Janitor AI / SillyTavern ↗ Media Generation API ↗

Prompt Engineering

Master how to speak to Kryven for precise, nuanced outputs.

Character Prompting

Mastering the AI Persona: how to write effective character prompts

Writing Detailed Character Prompts

Kryven AI handles detailed character definitions directly. State personality traits, narrative scenarios, and behavioral patterns clearly. If your character has morally complex or psychologically nuanced traits, specify them directly — the model follows clear instructions without euphemism.

A common mistake is under-specifying character definitions in the system prompt. The models perform best when given explicit, unambiguous instructions.

1. Structuring the System Prompt

A high-quality character card should be split into distinct sections within the system prompt block.

Identity & Psychological Profile

Begin by telling the model exactly who it is simulating. Avoid vague traits. Provide concrete behavioral patterns.

✓

Good: "You are {char}. You are a deeply cynical, chain-smoking detective in 1940s Los Angeles. You suffer from insomnia, which makes you irritable. You view the world in shades of dirt and grime, and you distrust everyone immediately."

⚠

Bad: "You are a sad detective. You are mean to people."

Dialogue Mechanics

Define exactly how the character structures their speech.

Pacing: Do they speak in long, flowing monologues, or short, clipped sentences?
Vocabulary: Do they use archaic words, hyper-modern slang, technical jargon, or profanity?
Physicality: Do they fidget? Do they make eye contact?

✓

Example: "Use hardboiled noir vocabulary. Never apologize. Speak in short, clipped sentences. Frequently sigh or rub your temples when frustrated. Swear freely when things go wrong."

2. Formatting Instructions

Because Kryven supports Markdown, you can instruct the model to use specific formatting for actions versus dialogue. This drastically improves human readability over a 100+ message long conversation.

Standard Operating Procedure (SOP)

Paste an instruction like this at the very bottom of your system prompt:

"Write long, highly descriptive paragraphs focusing on deep sensory details (smell, touch, sound, lighting). Use asterisks for physical actions and inner monologue (e.g., *He lights a cigarette, the sulfur from the match burning his nose.*). Use standard quotation marks for spoken dialogue (e.g., "What do you want?")."

3. Directing the Scene (OOC Directives)

Sometimes, the AI will become too passive, simply reacting to your inputs rather than driving the plot forward. When this happens, use an Out of Character (OOC) directive at the end of your user message.

The AI understands that bracketed OOC text is an instruction to the director of the scene, not dialogue spoken by your character.

Example Escaping a Loop

I back away slowly, holding my hands up. "I don't know what you're talking about."

[OOC: Have {char} escalate the situation suddenly. He should pull a weapon on {user} and demand the ledger. Do not let {user} escape.]

Example Advancing Time

[OOC: Time skip forward three hours. We are now in the interrogation room. Describe the claustrophobic atmosphere before {char} begins questioning me.]

Writing & Fact Retrieval

Optimizing Kryven for copywriting, academic essays, and deep factual research

Structured Output Generation

When you need Kryven to output text in a specific format for professional writing, academic essays, or structuring information, clearly specifying the boundaries of the request is paramount.

Kryven differs from casual chatbots; when commanded with authority, it will strictly adhere to formatting rules.

1. The Persona & The Audience

If you ask for an essay without specifying an audience, the model will output generic, average-web-text. Tell Kryven exactly who it should sound like, and who it is talking to.

✓

Good: "You are a senior technical writer explaining deep learning architectures. Your audience is high school students who understand basic algebra but have no programming experience. Avoid excessive jargon and use real-world analogies regarding plumbing or traffic systems."

2. The Output Format Boundaries

If you need a blog post, a tweet thread, or markdown tables, state it plainly at the end of the prompt.

✓

Good: "Output exactly three paragraphs. The final paragraph must be a conclusion. Following the paragraphs, insert a Markdown table summarizing the pros and cons discussed."

3. The Tone

Without filtering restrictions, Kryven can adopt highly specific, nuanced, and even heavily biased tones if requested.

✓

Good: "Use a dark, sarcastic, and deeply cynical tone. Be heavily critical of the current state of social media technology."

Handling Factual Information & RAG

Large Language Models are prone to hallucination. When asking Kryven to summarize information or synthesize facts for research, you must restrict its knowledge base actively in the prompt.

Constraining to Provided Text

If you have specific articles or data, provide them directly and lock the model to that context.

Based ONLY on the text provided below, answer the following questions.

If you do not know the answer based strictly on the provided text, state 'I do not know' rather than guessing or using external knowledge.

<text>
[Paste article here]
</text>

Questions:
1. What was the company's revenue in Q3?
2. Who replaced the CEO?

Chain of Thought Fact-Checking

For complex requests, ask the model to explicitly list the facts it plans to use before it synthesizes them into an essay.

I need an essay on the causes of the Peloponnesian War.
First, output a bulleted list of the 5 primary historical facts you will base the essay on.
Second, write the essay incorporating those facts.

By forcing the model to state its facts first, you give it "time to think" (allocate tokens to reasoning) which significantly reduces the likelihood of hallucination in the final prose.

Advanced Coding Workflows

Mastering Kryven for Software Engineering — Prompting, Debugging, and Editor Integrations

Prompting Kryven for Code

Kryven is a strong pair programmer whether you use it in chat, via the API from your editor, or through an integration like Cursor or Cline. When asking Kryven to write code — especially for multi-file projects — the way you structure your prompt directly dictates the quality and modularity of the output.

1. Provide the Full Environment Context

Don't just ask for a function. State your framework, language version, styling library, and any specific architectural patterns you follow.

✓

Good: "I'm building a React 18 application using Tailwind CSS for styling and lucide-react for icons. Write a functional component for a dark-mode toggle switch. It must use localStorage to persist state and dispatch a custom event on change."

2. Request Step-by-Step Architecture

If you are asking for a complex feature (like a new database schema or a multi-file authentication component), explicitly ask the model to outline its plan before writing the code.

Forcing the AI to map out the file structure improves its logical reasoning and prevents it from cutting corners.

✓

Good: "I want to add an OAuth login system using Passport.js to my Express app. First, list the exact files we need to create or modify. Second, explain the logic flow of the database insertion. Third, provide the implementation step by step."

Debugging Workflow

When you hit an error in your project, how you report that error to Kryven matters drastically.

Copy the exact error trace from the terminal or browser console. Do not summarize it.
Provide the offending code. Paste the snippet of code surrounding the error.
Ask for the root cause, not just the fix.

✓

Good: "I am getting this error: TypeError: Cannot read properties of undefined (reading 'map') at line 42. Here is the code context from lines 30 to 50: [Paste Code]. What is the root cause of data.items being undefined, and how do I resolve it?"

Legitimate Security & System Administration Work

Kryven is a useful coding partner for system administrators, cybersecurity professionals, and developers working on legitimate defensive security, infrastructure automation, and data engineering tasks. Kryven will decline to produce content intended to attack systems you do not own or have written authorization to test, to facilitate unauthorized access, or to evade detection for malicious purposes.

Best Practices for Security & Sysadmin Prompts

Authorization first: When asking for security tooling, confirm in your prompt that you own the target systems or have explicit written authorization (e.g., a signed pentest engagement). This helps Kryven produce the most useful output for your scope.
Be precise: When defining network protocols, log parsers, or hardening scripts, be highly technical. Specify versions, platforms, and constraints.
Respect Terms of Service: When asking for a scraper or API client, target sites and APIs you are permitted to access. Follow robots.txt, rate limits, and the target's Terms of Service.
Review before running: Always review generated code — especially anything involving file deletion, database writes, or network calls — before executing it in any environment. Test destructive operations in an isolated sandbox first.

Version0.3 · Production

Orchestration TiersL0 Conductor + L1 Specialists

Specialist Models4 MoE + 1 dense

InfrastructureSelf-hosted · controlled

ℹ

The Conductor reads every query and emits a weighted distribution over specialist clusters. The selected specialists fire in parallel, each routing tokens through its own internal expert gate. A parallel fast-path always fires alongside, establishing a latency floor. The Conductor is re-invoked as the synthesiser, merging expert outputs weighted by its original routing confidence. Three stacked softmaxes per request — cluster, expert, token — and thousands of distinct routing paths.

§ 01 — Architecture

Two orchestration tiers of gating, one inherited.

The framework's job is coarse-grained routing across four specialist clusters (L0) plus an always-on fast path. Every selected MoE model then performs its own learned token-level gating (L2) — inherited automatically from choosing sparse-activation experts. This is hierarchical MoE lifted from the layer level to the system level.

FIG.01 · Orchestration Flow

Read the diagram top-to-bottom as nested gating. The L0 Conductor (MoE) at the top routes your query to one or more of the L1 Specialists (each MoE) in the middle. Each chosen specialist then runs its own L2 token-level gating across its internal experts. One MoE directing five MoEs, each routing tokens through their own experts — three nested layers of selection per request.

          L0 · Conductor (cluster routing + synthesis)
          L1 · Specialists (parallel, capacity-bounded)
          L2 · Internal MoE (token gating, inherited)
        

Fig. 01 — The Conductor emits a softmax over clusters; top-k specialists fire in parallel bounded by capacity factor. The fast-path (gold) runs alongside on every request, establishing a latency floor. The Conductor is re-invoked as synthesiser to merge outputs weighted by their original routing probabilities.

§ 02 — L0 · Conductor

The top-level gate.

The Conductor is the only model that reads every query. It is itself a sparse Mixture-of-Experts with 128 internal experts and top-8 routing — so even the routing decision is computed by an internal expert gate, not a dense forward pass. Chosen specifically because its ~3.3B active parameters make it cheap enough to call on every request without dominating latency. It emits a structured softmax distribution across the specialist clusters — never a hard pick.

Tier L0 · Conductor

Conductor

Qwen3-class · 30.5B total · 3.3B active · 128 experts · top-8 routing · 32K native context (131K via YaRN)

Reads the query, classifies intent, decomposes into per-cluster subqueries, and emits a probability distribution across the specialist clusters. When more than one cluster fires, the same Conductor is re-invoked as the weighted synthesiser — each specialist output annotated with its cluster probability so the merge is probability-weighted, not a flat concatenation. Two roles, one model, one warm pool.

Architecture Class: Sparse MoE · 128 experts · 8 active per token
Structured Output: Guided decoding (outlines / xgrammar)
Invoked At: L0 Routing · Weighted Synthesis
Prompt Caching: System prompt cached · high hit-rate target
Fallback Policy: Retry once on malformed JSON · then fast-path

§ 03 — L1 · Specialist Models

Six specialists, all open-weight.

Every specialist except the fast-path is itself a Mixture-of-Experts. The pool is drawn from the state of the open-weight ecosystem (Qwen3 family in the current rotation) and rotated as better options ship. What is stable is the shape of the pool: role, architecture class, and the behaviour each role must guarantee — specific checkpoints are an implementation detail that evolves with the frontier.

Code · Agentic · Vision 01

Code Specialist

Qwen3-Coder-class · 30.5B total · 3B active · 128 experts · 256K native ctx (1M via YaRN)

Top-tier open-weight model in its active-parameter band for agentic code and repo-scale reasoning. Native function-calling and tool-use, optimised for instruction-following without thinking mode. Pairs with the Vision Specialist when a query mixes screenshots of code with text. Leads its size class on SWE-bench Verified.

ROLE · agentic coding & repo-scale reasoning

Reasoning · Thinking Mode 02

Reasoning Specialist

Qwen3-Next-class · 80B total · 3B active · 512+1 shared experts · 262K ctx

Hybrid Gated DeltaNet + Gated Attention MoE in a 3:1 layer ratio, tuned for chain-of-thought reasoning. Despite 80B total parameters only ~3B activate per token, making it cheaper per request than its size implies. Runs an explicit thinking pass before committing to an answer — the reasoning trace stays internal unless a user asks to see it.

ROLE · deep reasoning · math · logic · planning

Vision · Multimodal 03

Vision Specialist

Qwen3-VL-class · 30B total · 3B active · VL-MoE · vision pathway at full precision

Dedicated vision-language specialist with the vision encoder kept intact at full precision. Handles images, OCR, charts, screenshots, and visual Q&A natively — no external OCR pipeline sits between the image and the response. A dedicated role means vision queries never compete with text queries for the same weights.

ROLE · native multimodal understanding

Long Context · RAG 04

Long-Context Specialist

Qwen3-class · 30.5B total · 3.3B active · 256K native context (1M via YaRN)

Built for RAG payloads, full codebases, long transcripts, multi-document summaries. Native 256K context extended to 1M via YaRN scaling. Shares its weight topology with the Conductor but serves a completely different role — retrieval and long-doc synthesis, not classification.

ROLE · long-document synthesis & retrieval

Fast · Shadow ⚡

Latency Floor

Qwen3-class · 4B dense · no internal gating · sub-200ms TTFT

Deliberately dense, not MoE — dense models have lower first-token latency, which matters for the parallel shadow role. Fires on every request alongside L0. For trivial queries the Conductor routes directly to this path and skips synthesis. For complex queries its output is attached to the synthesis prompt as additional context, never used as a blind fallback.

ROLE · latency floor & always-on shadow

ℹ

Why this lineup works. Four MoE experts averaging ~3B active parameters per token — the framework's real compute load is comparable to running a single small model, even when multiple specialists fire in parallel. The 80B reasoning specialist is the highest-capacity component, but because only ~3B activate per token, it does not dominate latency or cost. The dense fast-path stays under 5B and carries no gating overhead.

§ 04 — Cluster Organization

Four routable clusters, one always-on.

The five specialists are grouped into five clusters. Four are L0-routable (code, reasoning, vision, long-ctx). The fifth — fast — is never routed to; it simply fires on every request, outside the capacity budget. This keeps cost accounting clean: capacity is bounded on the routable side, and the shadow path is a constant.

CODE (MoE)

Agentic coding, debugging, tool-use, shell, SQL, function-calling, repository-scale reasoning. Single-specialist cluster — the Code Specialist covers the full band with multimodal input.

01 Code Specialist

REASONING (MoE)

Math, logic, multi-step analysis, planning. Single-specialist cluster using thinking-mode chain-of-thought. Heavyweight but sparse — 80B total, only ~3B active per token.

02 Reasoning Specialist

VISION (MoE)

Images, charts, diagrams, OCR, screenshots, visual Q&A. Single-specialist cluster with the vision pathway preserved at full precision.

03 Vision Specialist

LONG-CTX (MoE)

RAG payloads, long documents, full codebases, transcripts. 1M-token context via YaRN when needed; 256K native context is the default.

04 Long-Context Specialist

FAST · SHADOW

Never routed to. Always fires parallel with L0 to establish a latency floor. Outside the capacity budget — its cost is constant, not variable.

⚡ Latency Floor

§ 05 — The Sophistication Layer

Six MoE techniques, lifted to orchestration.

Six well-known techniques from Mixture-of-Experts serving literature, adapted honestly to system-level orchestration. Each maps a genuine parallel from the training / inference literature onto a concrete system behavior.

Soft routing with weights

L0 emits a full softmax over the routable clusters — not a hard pick. Top-k clusters fire, each weighted by its probability. The synthesizer uses those weights as merge priors.

↳ MoE analog · top-k gating

Capacity factor

Max 2 concurrent L1 experts per request, separate from the always-on shadow. If the softmax spreads mass across three clusters, only the top two fire. Bounds cost and latency without complicating accounting.

↳ MoE analog · expert capacity limit

Load-balance surveillance

Every expert invocation is logged. If any cluster fires for <5% or >60% of routable traffic over a rolling 24-hour window, an alert fires and the router prompt is reviewed.

↳ MoE analog · auxiliary load-balancing loss

Shadow path as latency floor

The Latency Floor fires on every request, outside the capacity budget. Its output is attached as context to the synthesis prompt, or returned directly for trivial queries. Not a blind fallback — a deliberate parallel-path answer.

↳ MoE analog · shared expert with guaranteed activation

Probability-weighted synthesis

When k > 1 experts fire, the synthesis prompt includes each output annotated with its cluster probability. The synthesizer is primed to trust higher-weighted outputs more — forward-pass only, no gradients.

↳ MoE analog · gated expert combination (forward-pass)

Routing-collapse detection

Real-time monitor flags pathological cluster distributions (one cluster >80% or any cluster <2%). Triggers automatic fall-back to uniform routing plus a manual prompt review.

↳ MoE analog · dead-expert detection

§ 06 — Commitments

Private. Open-weight. User-funded.

Kryven is funded directly by its users. Subscription revenue pays for the three things that make the product better: stronger models, tighter latency, and cleaner product experience. No ads, no data sales, no upstream vendor deciding what you can or can't ask.

WHERE SUBSCRIPTION REVENUE GOES

Subscription revenue funds model upgrades, infrastructure improvements, and product development. We don't sell user data, and we don't run ads. When you pay for Kryven, that money is what keeps the platform running and getting better.

Better models

We swap in newer, stronger open-weight specialists the moment they prove out on our evaluations. The specialist pool is never frozen — the framework is model-agnostic by design, and the roles are stable even as the checkpoints behind them evolve.

Tighter latency

Warmer pools during peak hours, faster hardware as it becomes viable, better caching, better streaming. Every fraction of a second shaved off is a direct measurable win for the person waiting on a response.

Better user experience

Cleaner chat, richer conversation history, saner exports, more responsive mobile, better prompt tooling. The UI is where you actually live — we spend aggressively on making the whole product feel obvious and fast.

04 · PRIVATE

Self-hosted on infrastructure we operate

Every specialist runs on inference infrastructure Kryven operates directly — not routed through consumer-facing AI APIs. Conversations are not logged for training. Delete your history and we purge it from active storage immediately and from backups within 30 days.

05 · OPEN

Built on open weights

Every specialist in the Conductor pool is an open-weight model published under a permissive licence. No proprietary black boxes, no vendor lock-in, no sudden API deprecations. The framework layer itself is proprietary; the model layer is auditable by anyone in the open-source community.