moonshotai/Kimi-K2-Thinking

Kimi K2 Thinking is the first open-weights model to achieve SOTA performance against leading closed-source models (GPT-5, Claude 4.5 Sonnet) across major benchmarks including HLE (44.9%), BrowseComp (60.2%), and SWE-Bench Verified (71.3%). Built on a 1T parameter MoE architecture with 32B active per token and native INT4 quantization via QAT, it maintains stable tool-use across 200–300 sequential calls within a 256K context window.

Moonshot AI Chat 256K Tokens
Get API Key
Try in Playground
Free Trial Credit No Credit Card Required
$1.00

api_example.sh

curl -X POST "https://platform.qubrid.com/v1/chat/completions" \
  -H "Authorization: Bearer QUBRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "moonshotai/Kimi-K2-Thinking",
  "messages": [
    {
      "role": "user",
      "content": "Explain quantum computing in simple terms"
    }
  ],
  "temperature": 1,
  "max_tokens": 16384,
  "stream": true,
  "top_p": 0.95
}'

Technical Specifications

Model Architecture & Performance

Variant Thinking
Model Size 1T params (32B active)
Context Length 256K Tokens
Quantization INT4 (QAT)
Tokens/Second 50
Architecture Sparse MoE Transformer — 1T total / 32B active, 61 layers (1 dense), 384 experts (8 selected per token), MLA attention, SwiGLU
Precision INT4 (QAT)
License Modified MIT License
Release Date November 2025
Developers Moonshot AI

Pricing

Pay-per-use, no commitments

Input Tokens $0.0006/1K Tokens
Output Tokens $0.0025/1K Tokens

API Reference

Complete parameter documentation

Parameter Type Default Description
stream boolean true Enable streaming responses for real-time output.
temperature number 1 Recommended temperature is 1.0 for Kimi-K2-Thinking.
max_tokens number 16384 Maximum number of tokens to generate.
top_p number 0.95 Controls nucleus sampling.

Explore the full request and response schema in our external API documentation

Performance

Strengths & considerations

Strengths Considerations
First open-source model to beat closed frontier models (HLE, BrowseComp, SWE-bench)
1T MoE with only 32B active per token
Native INT4 via QAT — 2x speed vs FP8
Interleaved chain-of-thought with dynamic tool calling
Stable across 200-300 sequential tool calls
256K context window
Requires 512GB+ RAM for full deployment
~600GB model size (large infrastructure needed)
Thinking mode means higher latency than non-reasoning models
Temperature must be set to 1.0 for recommended performance

Use cases

Recommended applications for this model

Complex agentic research workflows
Long-horizon coding and debugging
Advanced mathematical reasoning
Multi-step tool orchestration
Autonomous writing and analysis
Scientific reasoning tasks

Enterprise
Platform Integration

Docker

Docker Support

Official Docker images for containerized deployments

Kubernetes

Kubernetes Ready

Production-grade KBS manifests and Helm charts

SDK

SDK Libraries

Official SDKs for Python, Javascript, Go, and Java

Don't let your AI control you. Control your AI the Qubrid way!

Have questions? Want to Partner with us? Looking for larger deployments or custom fine-tuning? Let's collaborate on the right setup for your workloads.

"Qubrid AI reduced our document processing time by over 60% and significantly improved retrieval accuracy across our RAG workflows."

Enterprise AI Team

Document Intelligence Platform