Qwen/Qwen3-Next-80B-A3B-Thinking

Qwen3-Next-80B-A3B-Thinking is a next-generation foundation model from Alibaba's Qwen team featuring a revolutionary Hybrid Attention mechanism (Gated DeltaNet + Gated Attention) with High-Sparsity MoE architecture. With 80B total parameters and only 3.9B active per token, it delivers 10x higher throughput than Qwen3-32B on long contexts while outperforming Gemini-2.5-Flash-Thinking on multiple benchmarks.

Alibaba Cloud Chat 256K Tokens
Get API Key
Try in Playground
Free Trial Credit No Credit Card Required
$1.00

api_example.sh

curl -X POST "https://platform.qubrid.com/v1/chat/completions" \
  -H "Authorization: Bearer QUBRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "Qwen/Qwen3-Next-80B-A3B-Thinking",
  "messages": [
    {
      "role": "user",
      "content": "Explain quantum computing in simple terms"
    }
  ],
  "temperature": 0.6,
  "max_tokens": 8192,
  "stream": true,
  "top_p": 0.95
}'

Technical Specifications

Model Architecture & Performance

Variant Thinking
Model Size 80B params (3.9B active)
Context Length 256K Tokens
Quantization bf16
Tokens/Second 200
Architecture Hybrid Transformer-Mamba (Gated DeltaNet + Gated Attention) with High-Sparsity MoE and Multi-Token Prediction (MTP)
Precision bf16
License Apache 2.0
Release Date September 2025
Developers Alibaba Cloud (QwenLM)

Pricing

Pay-per-use, no commitments

Input Tokens $0.0002/1K Tokens
Output Tokens $0.0018/1K Tokens

API Reference

Complete parameter documentation

Parameter Type Default Description
stream boolean true Enable streaming responses for real-time output.
temperature number 0.6 Controls randomness. Lower values recommended for reasoning tasks.
max_tokens number 8192 Maximum number of tokens to generate.
top_p number 0.95 Nucleus sampling parameter.

Explore the full request and response schema in our external API documentation

Performance

Strengths & considerations

Strengths Considerations
Hybrid Attention (Gated DeltaNet + Gated Attention)
10x throughput vs Qwen3-32B on 32K+ contexts
Only 3.9B active parameters from 80B total
Native 256K context window
Thinking-only mode for deep reasoning
Outperforms Gemini-2.5-Flash-Thinking
Thinking mode only — no fast non-thinking mode
Longer thinking traces increase latency
New architecture with limited community tooling

Use cases

Recommended applications for this model

Complex multi-step reasoning
Mathematical proofs
Code synthesis
Logical analysis
Agentic planning
Long-context document analysis

Enterprise
Platform Integration

Docker

Docker Support

Official Docker images for containerized deployments

Kubernetes

Kubernetes Ready

Production-grade KBS manifests and Helm charts

SDK

SDK Libraries

Official SDKs for Python, Javascript, Go, and Java

Don't let your AI control you. Control your AI the Qubrid way!

Have questions? Want to Partner with us? Looking for larger deployments or custom fine-tuning? Let's collaborate on the right setup for your workloads.

"Qubrid's medical OCR and research parsing cut our document extraction time in half. We now have traceable pipelines and reproducible outputs that meet our compliance requirements."

Clinical AI Team

Research & Clinical Intelligence