nvidia/NVIDIA-Nemotron-3-Super-120B-A12B

NVIDIA Nemotron-3-Super-120B-A12B is an open-weight LLM built for agentic reasoning and high-volume workloads. Using a hybrid LatentMoE architecture (Mamba-2 + MoE + Attention) with Multi-Token Prediction (MTP) and native NVFP4 pretraining on 25T tokens, it delivers up to 2.2x higher throughput than GPT-OSS-120B and 7.5x higher than Qwen3.5-122B. With a native 1M-token context window and configurable thinking mode, it is purpose-built for collaborative agents, long-context reasoning, and IT automation across 7 languages.

NVIDIA Chat 256K Tokens (up to 1M)
Get API Key
Try in Playground
Free Trial Credit On first TopUp of minimum $5
$1.00

api_example.sh

curl -X POST "https://platform.qubrid.com/v1/chat/completions" \
  -H "Authorization: Bearer QUBRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
  "messages": [
    {
      "role": "user",
      "content": "Explain quantum computing in simple terms"
    }
  ],
  "temperature": 1,
  "max_tokens": 16000,
  "stream": true,
  "top_p": 0.95
}'

Technical Specifications

Model Architecture & Performance

Variant Instruct
Model Size 120B params (12B active)
Context Length 256K Tokens (up to 1M)
Quantization FP8
Tokens/Second 451
Architecture LatentMoE — Mamba-2 + MoE + Attention hybrid with Multi-Token Prediction (MTP); 512 experts, 22 active per token; 120B total / 12B active parameters
Precision FP8
License NVIDIA Nemotron Open Model License
Release Date March 11, 2026
Developers NVIDIA Corporation

Pricing

Pay-per-use, no commitments

Input Tokens $0.10/1M Tokens
Output Tokens $0.50/1M Tokens

API Reference

Complete parameter documentation

Parameter Type Default Description
stream boolean true Enable streaming responses for real-time output.
temperature number 1 Controls randomness in output. Recommended: 1.0 for all tasks.
max_tokens number 16000 Maximum tokens to generate.
top_p number 0.95 Controls nucleus sampling. Recommended: 0.95 for all tasks.

Performance

Strengths & considerations

Strengths Considerations
LatentMoE: 512 experts / 22 active per token at same compute cost as standard MoE
2.2x throughput vs GPT-OSS-120B; 7.5x vs Qwen3.5-122B
60.47% SWE-Bench Verified (OpenHands); 83.73% MMLU-Pro; 79.23% GPQA
Native 1M token context — 91.75% on RULER @ 1M vs GPT-OSS-120B's 22.30%
MTP speculative decoding: 3.45 avg acceptance length (up to 3x wall-clock speedup)
Configurable reasoning mode via enable_thinking=True/False in chat template
Requires minimum 2× H100-80GB GPUs for local deployment
Thinking mode adds latency overhead; low-effort mode recommended for simple queries
Not optimized for vision or multimodal inputs

Use cases

Recommended applications for this model

Agentic workflows & multi-agent collaboration
Long-context reasoning (up to 1M tokens)
IT ticket automation & high-volume enterprise workloads
Complex tool use & multi-step function calling
RAG (Retrieval-Augmented Generation)
Software engineering & cybersecurity triaging

Enterprise
Platform Integration

Docker

Docker Support

Official Docker images for containerized deployments

Kubernetes

Kubernetes Ready

Production-grade KBS manifests and Helm charts

SDK

SDK Libraries

Official SDKs for Python, Javascript, Go, and Java

Don't let your AI control you. Control your AI the Qubrid way!

Have questions? Want to Partner with us? Looking for larger deployments or custom fine-tuning? Let's collaborate on the right setup for your workloads.

"Qubrid scaled our personalized outreach from hundreds to tens of thousands of prospects. AI-driven research and content generation doubled our campaign velocity without sacrificing quality."

Demand Generation Team

Marketing & Sales Operations