Qwen/Qwen3.5-35B-A3B

Qwen3.5-35B-A3B is the breakout model of the Qwen3.5 Medium Series and arguably the biggest efficiency breakthrough in recent open-source AI. Despite having only 3B active parameters per token (8.6% of total), it outperforms the previous generation's 235B model on most benchmarks, as well as GPT-5 mini and Claude Sonnet 4.5 on knowledge (MMMLU) and visual reasoning (MMMU-Pro). It runs on an 8GB GPU and supports 256K context natively.

Alibaba Cloud (Qwen) Vision 256K Tokens (up to 1M)
Get API Key
Try in Playground
Free Trial Credit No Credit Card Required
$1.00

api_example.sh

curl -X POST "https://platform.qubrid.com/v1/chat/completions" \
  -H "Authorization: Bearer QUBRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "Qwen/Qwen3.5-35B-A3B",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What is in this image? Describe the main elements."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
          }
        }
      ]
    }
  ],
  "max_tokens": 8192,
  "temperature": 0.6,
  "stream": true,
  "top_p": 0.95
}'

Technical Specifications

Model Architecture & Performance

Variant Instruct
Model Size 35B params (3B active)
Context Length 256K Tokens (up to 1M)
Quantization bf16 / 4-bit
Tokens/Second 350
Architecture Hybrid Gated DeltaNet + Sparse MoE Transformer — 60 layers, 256 experts (8 routed + 1 shared per token), 3:1 linear-to-full attention ratio, early fusion vision encoder, Multi-Token Prediction (MTP)
Precision bf16 (4-bit quantization available via GGUF)
License Apache 2.0
Release Date February 24, 2026
Developers Alibaba Cloud (QwenLM)

Pricing

Pay-per-use, no commitments

Input Tokens $0.00025/1K Tokens
Output Tokens $0.002/1K Tokens

API Reference

Complete parameter documentation

Parameter Type Default Description
stream boolean true Enable streaming responses for real-time output.
temperature number 0.6 Use 0.6 for non-thinking mode, 1.0 for thinking/reasoning mode.
max_tokens number 8192 Maximum number of tokens to generate.
top_p number 0.95 Nucleus sampling parameter.
top_k number 20 Limits token sampling to top-k candidates.
enable_thinking boolean false Toggle chain-of-thought reasoning. Set temperature=1.0 when enabled.

Explore the full request and response schema in our external API documentation

Performance

Strengths & considerations

Strengths Considerations
Beats Qwen3-235B-A22B with only 3B active params — historic efficiency
Outperforms GPT-5 mini and Claude Sonnet 4.5 on MMMLU and MMMU-Pro
35B total / 3B active — 256 experts, 8 routed + 1 shared per token
Runs on 8GB GPU (4-bit quantization), 22GB Mac M-series
256K context natively, extensible to 1M tokens
Near-lossless 4-bit quantization
Apache 2.0 license — fully open source
MoE routing overhead vs dense models on short contexts
4-bit quantization needed for edge/consumer deployment
Thinking mode generates verbose traces that increase latency
Requires framework support for hybrid DeltaNet attention

Use cases

Recommended applications for this model

Consumer and edge device deployment (8GB GPU)
Agentic coding and tool-calling workflows
Multimodal chat (text, image, video via early fusion)
Cost-efficient enterprise inference at scale
Long-context document analysis
Complex reasoning with thinking mode

Enterprise
Platform Integration

Docker

Docker Support

Official Docker images for containerized deployments

Kubernetes

Kubernetes Ready

Production-grade KBS manifests and Helm charts

SDK

SDK Libraries

Official SDKs for Python, Javascript, Go, and Java

Don't let your AI control you. Control your AI the Qubrid way!

Have questions? Want to Partner with us? Looking for larger deployments or custom fine-tuning? Let's collaborate on the right setup for your workloads.

"Qubrid enabled us to deploy production AI agents with reliable tool-calling and step tracing. We now ship agents faster with full visibility into every decision and API call."

AI Agents Team

Agent Systems & Orchestration