nvidia/NVIDIA-Nemotron-3-Super-120B-A12B

NVIDIA Nemotron-3-Super-120B-A12B is an open-weight LLM built for agentic reasoning and high-volume workloads. Using a hybrid LatentMoE architecture (Mamba-2 + MoE + Attention) with Multi-Token Prediction (MTP) and native NVFP4 pretraining on 25T tokens, it delivers up to 2.2x higher throughput than GPT-OSS-120B and 7.5x higher than Qwen3.5-122B. With a native 1M-token context window and configurable thinking mode, it is purpose-built for collaborative agents, long-context reasoning, and IT automation across 7 languages.

NVIDIA Chat 256K Tokens (up to 1M)

Get API Key

Deposit $5 to get started Unlock API access and start running inference right away. See how many million tokens $5 gets you

api_example.sh

curl -X POST "https://platform.qubrid.com/v1/chat/completions" \
  -H "Authorization: Bearer QUBRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
  "messages": [
    {
      "role": "user",
      "content": "Explain quantum computing in simple terms"
    }
  ],
  "temperature": 1,
  "max_tokens": 16000,
  "stream": true,
  "top_p": 0.95
}'

from openai import OpenAI

# Initialize the OpenAI client with Qubrid base URL
client = OpenAI(
    base_url="https://platform.qubrid.com/v1",
    api_key="QUBRID_API_KEY",
)

stream = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
    messages=[
      {
        "role": "user",
        "content": "Explain quantum computing in simple terms"
      }
    ],
    max_tokens=16000,
    temperature=1,
    top_p=0.95,
    stream=True
)

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

print("\n")

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://platform.qubrid.com/v1',
  apiKey: 'QUBRID_API_KEY',
});

const stream = await client.chat.completions.create({
  model: 'nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8',
  messages: [
    {
      "role": "user",
      "content": "Explain quantum computing in simple terms"
    }
  ],
  max_tokens: 16000,
  temperature: 1,
  top_p: 0.95,
  stream: true
});

for await (const chunk of stream) {
  if (chunk.choices[0]?.delta?.content) {
    process.stdout.write(chunk.choices[0].delta.content);
  }
}
console.log('\n');

package main

import (
  "bytes"
  "encoding/json"
  "net/http"
)

func main() {
  url := "https://platform.qubrid.com/v1/chat/completions"

  data := {
  "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
  "messages": [
    {
      "role": "user",
      "content": "Explain quantum computing in simple terms"
    }
  ],
  "temperature": 1,
  "max_tokens": 16000,
  "stream": true,
  "top_p": 0.95
}
  jsonData, _ := json.Marshal(data)

  req, _ := http.NewRequest("POST", url, bytes.NewBuffer(jsonData))
  req.Header.Set("Authorization", "Bearer QUBRID_API_KEY")
  req.Header.Set("Content-Type", "application/json")

  client := &http.Client{}
  res, _ := client.Do(req)
}

Pricing

Pay-per-use, no commitments

Input Tokens $0.10/1M Tokens

Output Tokens $0.50/1M Tokens

Cached Input Tokens $0.04/1M Tokens

Technical Specifications

Model Architecture & Performance

Variant Instruct

Model Size 120B params (12B active)

Context Length 256K Tokens (up to 1M)

Quantization FP8

Tokens/sec 451

Architecture LatentMoE — Mamba-2 + MoE + Attention hybrid with Multi-Token Prediction (MTP); 512 experts, 22 active per token; 120B total / 12B active parameters

Precision FP8

License NVIDIA Nemotron Open Model License

Release Date March 11, 2026

Developers NVIDIA Corporation

API Reference

Complete parameter documentation

Parameter	Type	Default	Description
stream	boolean	true	Enable streaming responses for real-time output.
temperature	number	1	Controls randomness in output. Recommended: 1.0 for all tasks.
max_tokens	number	16000	Maximum tokens to generate.
top_p	number	0.95	Controls nucleus sampling. Recommended: 0.95 for all tasks.

Resources

Learn, watch, and build faster

Blog

Deploy GPT-OSS on GPUs - Open GPT Model Guide with Qubrid AI

https://qubrid.com/blog/nvidia-nemotron-3-super-for-the-next-generation-of-agentic-ai-available-on-qubrid-ai Read

Performance

Strengths & considerations

Strengths	Considerations
LatentMoE: 512 experts / 22 active per token at same compute cost as standard MoE 2.2x throughput vs GPT-OSS-120B; 7.5x vs Qwen3.5-122B 60.47% SWE-Bench Verified (OpenHands); 83.73% MMLU-Pro; 79.23% GPQA Native 1M token context — 91.75% on RULER @ 1M vs GPT-OSS-120B's 22.30% MTP speculative decoding: 3.45 avg acceptance length (up to 3x wall-clock speedup) Configurable reasoning mode via enable_thinking=True/False in chat template	Requires minimum 2× H100-80GB GPUs for local deployment Thinking mode adds latency overhead; low-effort mode recommended for simple queries Not optimized for vision or multimodal inputs

Strengths

Considerations

LatentMoE: 512 experts / 22 active per token at same compute cost as standard MoE

2.2x throughput vs GPT-OSS-120B; 7.5x vs Qwen3.5-122B

60.47% SWE-Bench Verified (OpenHands); 83.73% MMLU-Pro; 79.23% GPQA

Native 1M token context — 91.75% on RULER @ 1M vs GPT-OSS-120B's 22.30%

MTP speculative decoding: 3.45 avg acceptance length (up to 3x wall-clock speedup)

Configurable reasoning mode via enable_thinking=True/False in chat template

Requires minimum 2× H100-80GB GPUs for local deployment

Thinking mode adds latency overhead; low-effort mode recommended for simple queries

Not optimized for vision or multimodal inputs

Use cases

Recommended applications for this model

Agentic workflows & multi-agent collaboration

Long-context reasoning (up to 1M tokens)

IT ticket automation & high-volume enterprise workloads

Complex tool use & multi-step function calling

RAG (Retrieval-Augmented Generation)

Software engineering & cybersecurity triaging

Enterprise
Platform Integration

Docker Support

Official Docker images for containerized deployments

Kubernetes Ready

Production-grade KBS manifests and Helm charts

SDK Libraries

Official SDKs for Python, Javascript, Go, and Java

Don't let your AI control you. Control your AI the Qubrid way!

Have questions? Want to Partner with us? Looking for larger deployments or custom fine-tuning? Let's collaborate on the right setup for your workloads.

Get Started

"Qubrid scaled our personalized outreach from hundreds to tens of thousands of prospects. AI-driven research and content generation doubled our campaign velocity without sacrificing quality."

Demand Generation Team

Marketing & Sales Operations