NVIDIA Nemotron-3 Super for the Next Generation of Agentic AI, Available on Qubrid AI

Nemotron-3 Super is a 120-billion-parameter model with 12 billion active parameters, built specifically for modern AI workloads that require planning, reasoning, and interaction with tools. The model is designed to handle the growing demands of multi-agent systems where multiple AI components collaborate to complete workflows.

This release highlights the growing shift toward agentic AI systems that can reason, plan, and execute complex workflows beyond traditional chatbots. Developers building these next-generation applications need access to powerful models without dealing with complicated infrastructure.

Through Qubrid AI, developers can instantly experiment with Nvidia Nemotron-3 Super 120b A12b, enabling them to build AI agents, reasoning systems, and large-scale automation workflows directly from the platform. Qubrid removes the need to manage GPUs or deployment pipelines, allowing teams to focus on building real AI applications.

You can try Nvidia Nemotron-3 Super 120b A12b on Qubrid AI here:
👉 https://platform.qubrid.com/model/nvidia-nemotron-3-super-120b-a12b

The Challenges of Building Agentic AI

As companies transition from traditional chatbots to multi-agent AI systems, new challenges emerge. Two of the most important challenges are context explosion and the thinking tax.

Context Explosion

Agent workflows generate significantly more tokens than standard chat applications. Each step in a workflow requires sending the entire interaction history, including tool outputs and intermediate reasoning.

This means multi-agent systems can generate up to 15× more tokens than typical conversations, increasing compute costs and sometimes causing agents to drift away from the original goal over long workflows.

Nemotron-3 Super addresses this problem with an extremely large context window of up to one million tokens, allowing agents to retain full workflow state without repeatedly recomputing context.

The Thinking Tax

Another challenge is the computational cost of reasoning. Complex AI systems often require reasoning at every step, but using large models continuously can make systems slow and expensive. Nemotron-3 Super is designed to reduce this cost by improving reasoning efficiency and throughput.

Benchmark Performance

For more information check out: NVIDIA Nemotron 3 Super Technical Report
The benchmark results highlight how Nemotron-3 Super performs across multiple reasoning and agent-focused tasks. The model demonstrates strong performance in instruction following (IFBench) and mathematical reasoning (HMMT Feb25) while also delivering competitive results in coding benchmarks such as SWE-Bench.

It also performs well in scientific reasoning tasks like HLE and tool-use benchmarks such as Tau Bench, which measure how effectively a model can interact with external tools during workflows. In long-context tasks like RULER, Nemotron-3 Super maintains high accuracy even at 1 million token contexts, showing its ability to manage extremely large inputs.

Another important aspect shown in the chart is throughput performance. Compared with other large models, Nemotron-3 Super achieves significantly higher inference efficiency. This is largely due to its Latent Mixture-of-Experts architecture, where only 12B of the 120B parameters are activated during inference, allowing the model to generate tokens faster while maintaining strong reasoning capabilities.

Benchmark	Nemotron-3-Super-120B-A12B	Qwen3.5-122B-A10B	GPT-OSS-120B
Terminal Bench (Hard)	25.78	26.80	24.00
Terminal Bench Core 2.0	31.00	37.50	18.70
SWE-Bench (OpenHands)	60.47	66.40	41.90
SWE-Bench (OpenCode)	59.20	67.40	—
SWE-Bench Multilingual	45.78	—	30.80
TauBench (Average)	61.15	74.53	61.00
IFBench (Instruction Following)	72.56	73.77	68.32
Scale AI Multi-Challenge	55.23	61.50	58.29
Arena-Hard-V2	73.88	75.15	90.26
AA-LCR (Long-Context Reasoning)	58.31	66.90	51.00
RULER @256K Context	96.30	96.74	52.30
RULER @512K Context	95.67	95.95	46.70
RULER @1M Context	91.75	91.33	22.30
MMLU-ProX (Multilingual)	79.36	85.06	76.59
WMT24++ Translation	86.67	87.84	88.89

In particular, the model performs well on RULER long-context benchmarks up to 1M tokens, where many transformer-only models degrade significantly. While Qwen3.5-122B leads in several coding and reasoning benchmarks, Nemotron-3-Super is optimized for higher throughput and agent-based workflows, enabling faster inference with fewer active parameters during execution.

A New Hybrid Architecture

Nemotron-3 Super uses a hybrid Mixture-of-Experts architecture that combines several innovations to improve both speed and accuracy.

The model integrates three major components:

Mamba Layers: These layers provide improved memory efficiency and allow the model to process long sequences more effectively.
Transformer Layers: Transformer components enable advanced reasoning capabilities and language understanding.
Mixture-of-Experts (MoE): Only 12 billion of the model’s 120 billion parameters are activated during inference, significantly improving efficiency.

The architecture also includes a technique called Latent MoE, which improves accuracy by activating multiple expert specialists while keeping computational cost low.

Another key innovation is multi-token prediction, which allows the model to generate multiple tokens simultaneously, enabling up to three times faster inference speeds.

When running on the NVIDIA Blackwell platform, the model uses NVFP4 precision, reducing memory requirements and enabling inference speeds up to four times faster compared to FP8 on NVIDIA Hopper GPUs.

Built for Real Agent Workflows

Nemotron-3 Super is designed to operate as part of a multi-agent system, where different agents collaborate to complete tasks.

For example:

Software development agents: A development agent can load an entire codebase into memory and generate fixes without breaking the project into smaller pieces.
Financial analysis agents: AI systems can analyze thousands of pages of financial reports simultaneously.
Security automation systems: Agents can coordinate across multiple tools to perform cybersecurity analysis and automated responses.

Nemotron-3 Super also includes high-accuracy tool calling, allowing agents to navigate large tool libraries reliably without generating incorrect function calls.

Open Weights and Training Data

NVIDIA is releasing Nemotron-3 Super with open weights and a permissive license, allowing developers to deploy and customize the model across different environments. The training process used more than 10 trillion tokens of datasets, including synthetic data generated using advanced reasoning models.

NVIDIA is also publishing the full training methodology, evaluation recipes, and reinforcement learning environments used during development. Researchers can further fine-tune the model using the NVIDIA NeMo platform to build custom AI applications.

Running Nemotron Models with Qubrid AI

Running large AI models typically requires significant GPU infrastructure and complex deployment setups. Platforms like Qubrid AI simplify this process by giving developers access to advanced models through serverless APIs and an interactive playground, allowing teams to experiment without managing hardware or model infrastructure.

Qubrid AI is designed for developers who want quick results, affordable pricing, and minimal setup.

Step 1: Create a Qubrid AI Account

Start by signing up on the Qubrid AI platform: 👉 https://platform.qubrid.com

Once your account is created, you can access the model playground and API dashboard.

Step 2: Add Credits to Your Account

Top up your account with \(5, and you will receive \)1 worth of tokens free to explore the platform and run real workloads.

This allows developers to test models and build prototypes without committing to large infrastructure costs.

Step 3: Open the Nemotron Model Playground

You can access the Nemotron model directly from the playground:
👉 https://platform.qubrid.com/model/nvidia-nemotron-3-super-120b-a12b

From the playground, you can enter a prompt, adjust parameters if needed, and run the model instantly to test its reasoning and long-context capabilities.

Simply enter a prompt and run the model to see results immediately.
For example: "Write a short story about a robot learning to paint"

Step 4: Integrate the Model Using the API (Optional)

Qubrid provides OpenAI-compatible APIs, making integration into existing applications straightforward. Below is a simple Python example showing how to call the model.

from openai import OpenAI

# Initialize the OpenAI client with Qubrid base URL
client = OpenAI(
    base_url="https://platform.qubrid.com/v1",
    api_key="QUBRID_API_KEY",
)

response = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8",
    messages=[
      {
        "role": "user",
        "content": "Write a short story about a robot learning to paint"
      }
    ],
    max_tokens=500,
    temperature=0.7
)

print(response.choices[0].message.content)

Our Thoughts

The launch of NVIDIA Nemotron-3 Super highlights the growing shift toward agentic AI systems capable of reasoning, planning, and executing complex tasks autonomously.

With its hybrid architecture, long-context reasoning capabilities, and improved efficiency, Nemotron-3 Super sets a new benchmark for models designed specifically for multi-agent workflows.

For developers exploring this new generation of AI systems, models like available on Qubrid AI provide an accessible starting point to build advanced AI applications without managing infrastructure.

You can explore all models on our platform here: 👉 https://platform.qubrid.com/models