nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

Nemotron 3 Nano 30B-A3B is NVIDIA's flagship open reasoning model, featuring a revolutionary hybrid Mamba-Transformer Mixture-of-Experts architecture. With 31.6B total parameters but only 3.2B active per forward pass, it delivers up to 3.3× higher throughput than comparable models while achieving state-of-the-art accuracy on reasoning, coding, and agentic benchmarks. The model supports up to 1M token context length and features configurable reasoning depth with thinking budget control.

NVIDIA Chat 262k Tokens
Get API Key
Try in Playground
Free Trial Credit No Credit Card Required
$1.00

api_example.sh

curl -X POST "https://platform.qubrid.com/v1/chat/completions" \
  -H "Authorization: Bearer QUBRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
  "messages": [
    {
      "role": "user",
      "content": "Explain quantum computing in simple terms"
    }
  ],
  "temperature": 0.3,
  "max_tokens": 8192,
  "stream": true,
  "top_p": 1
}'

Technical Specifications

Model Architecture & Performance

Variant FP8
Model Size 31.6B Total / 3.2B Active
Context Length 262k Tokens
Max Context Length 1M Tokens
Quantization FP8
Tokens/Second 220
Architecture Hybrid Mamba-2 + Transformer MoE
Precision FP8, BF16 supported
License NVIDIA Open Model License
Release Date December 15, 2025
Developers NVIDIA
Supported Languages English, Spanish, French, German, Japanese, Italian, Chinese, Korean, Arabic, and 11 more
Knowledge Cutoff June 25, 2025

Pricing

Pay-per-use, no commitments

Input Tokens $0.00004/1K Tokens
Output Tokens $0.00022/1K Tokens

API Reference

Complete parameter documentation

Parameter Type Default Description
stream boolean true Enable streaming responses for real-time output.
temperature number 0.3 Controls randomness. Higher values mean more creative but less predictable output.
max_tokens number 8192 Maximum number of tokens to generate in the response.
top_p number 1 Nucleus sampling: considers tokens with top_p probability mass.
enable_thinking boolean true Enable chain-of-thought reasoning traces before final response.
thinking_budget number 16384 Maximum tokens for reasoning traces. Controls inference cost and reasoning depth.

Explore the full request and response schema in our external API documentation

Performance

Strengths & considerations

Strengths Considerations
Hybrid Mamba-2 + Transformer MoE architecture for optimal efficiency
3.3× faster inference than Qwen3-30B-A3B with better accuracy
Only 3.2B active parameters from 31.6B total (10% activation)
1M token context window for long-horizon tasks
Configurable reasoning ON/OFF modes
Thinking budget control for predictable inference costs
Native tool calling and function execution
FP8 quantization for reduced memory and faster inference
State-of-the-art on SWE-Bench, GPQA Diamond, AIME 2025
Fully open: weights, datasets, and training recipes available
Requires 32GB+ VRAM for FP8, 60GB+ for BF16
Hybrid architecture less tested in production than pure transformers
May underperform on vanilla MMLU compared to harder benchmark variants
FlashInfer backend requires CUDA toolkit for JIT compilation
New architecture may have limited community tooling support

Use cases

Recommended applications for this model

Agentic AI systems and multi-agent orchestration
Complex reasoning and problem-solving tasks
Code generation, debugging, and optimization
Function calling and tool integration
Long-document analysis and RAG applications
Mathematical reasoning and STEM tasks
Instruction following and task automation
Enterprise chatbots with reasoning capabilities
Financial analysis and decision support
Software development assistants

Enterprise
Platform Integration

Docker

Docker Support

Official Docker images for containerized deployments

Kubernetes

Kubernetes Ready

Production-grade KBS manifests and Helm charts

SDK

SDK Libraries

Official SDKs for Python, Javascript, Go, and Java

Don't let your AI control you. Control your AI the Qubrid way!

Have questions? Want to Partner with us? Looking for larger deployments or custom fine-tuning? Let's collaborate on the right setup for your workloads.

"Qubrid helped us turn a collection of AI scripts into structured production workflows. We now have better reliability, visibility, and control over every run."

AI Infrastructure Team

Automation & Orchestration