Qwen/Qwen-Image

Qwen-Image is a 20B parameter MMDiT (Multimodal Diffusion Transformer) image generation foundation model — all 20B parameters are activated (not MoE). Developed by Alibaba's Tongyi Qwen team and released August 4, 2025 under Apache 2.0. It uses Qwen2.5-VL-7B as its text encoder (qwen_2.5_vl_7b_fp8_scaled.safetensors) — a full vision-language model with SwiGLU ViT, mRoPE, and window attention — enabling deep multilingual semantic understanding far beyond standard T5/CLIP encoders. The model's VAE (qwen_image_vae.safetensors) is custom-built for high-fidelity spatial encoding. Qwen-Image supports text-to-image generation, image-to-image editing (appearance and semantic), LoRA fine-tuning, and image understanding tasks (object detection, segmentation, depth/edge estimation, novel view synthesis, super-resolution). ControlNet support (Canny, Depth, Inpaint) is available via DiffSynth-Studio patches.

Alibaba Cloud Image Context N/A
Get API Key
Try in Playground
Free Trial Credit On first TopUp of minimum $5
$1.00

api_example.sh

curl -X POST "https://platform.qubrid.com/v1/images/generations" \
  -H "Authorization: Bearer QUBRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "qwen-image",
  "prompt": "cinematic shot of a lone astronaut standing on a desolate alien planet, glowing orange sunset sky, dust storms swirling, dramatic lighting, ultra-wide lens composition, movie still aesthetic, realistic space suit details, volumetric atmosphere, 8k sci-fi film scene",
  "enhance_prompt": false,
  "go_fast": true,
  "guidance": 3,
  "negative_prompt": "",
  "num_inference_steps": 30,
  "seed": null,
  "disable_safety_checker": false,
  "image": null,
  "strength": 0.9,
  "lora_weights": null,
  "lora_scale": 1,
  "extra_lora_weights": null,
  "extra_lora_scale": null,
  "aspect_ratio": "16:9",
  "image_size": "optimize_for_quality",
  "output_format": "webp",
  "output_quality": 80
}'

Technical Specifications

Model Architecture & Performance

Variant Base T2I (20B fully activated, non-MoE)
Model Size 20B parameters (~21.8 GB at Q8, ~11.9 GB at Q4)
Quantization None (fp8 official; NF4, GGUF community variants available)
Architecture MMDiT (Multimodal Diffusion Transformer) — 20B fully-activated parameters, non-MoE, diffusion backbone with Qwen2.5-VL-7B cross-modal text conditioning
Precision bfloat16 / fp8
License Apache 2.0
Release Date August 4, 2025
Developers Alibaba Tongyi Qwen Team

Pricing

Pay-per-use, no commitments

Per Image $0.025/Image

API Reference

Complete parameter documentation

Parameter Type Default Description
enhance_prompt boolean false Automatically enhance the prompt with positive magic for better composition, detail, and visual quality.
go_fast boolean true Run faster predictions with additional inference optimizations. Slight quality trade-off for significantly faster generation.
guidance number 3 Controls how closely the model follows the prompt. Higher values produce more literal interpretation; lower values allow more creative freedom.
negative_prompt string Describe what to avoid in the generated image (e.g. 'blurry, low quality, distorted, watermark').
num_inference_steps number 30 Number of denoising steps. More steps produce higher quality but slower generation.
seed number null Random seed for reproducible generation. Leave unset (null) for random results.
disable_safety_checker boolean false Disable the built-in safety checker for generated images.
image string null URL of an input image to use for image-to-image generation. When provided, the model edits or transforms the input image based on the prompt.
strength number 0.9 Controls how much the output deviates from the input image in img2img mode. 1.0 = fully regenerated, 0.0 = unchanged.
lora_weights string null URL to a LoRA weights file (.safetensors, .tar, or .zip) to apply custom styles or concepts to the generation.
lora_scale number 1 Strength of the primary LoRA application. Higher values apply the LoRA style more strongly.
extra_lora_weights array null Array of additional LoRA weight URLs to stack on top of the primary LoRA. Supports .safetensors, .tar, or .zip formats.
extra_lora_scale array null Array of scale values for the extra LoRAs. Must exactly match the number of entries in extra_lora_weights.
aspect_ratio string 16:9 Aspect ratio of the output image. Options: 1:1, 16:9, 9:16, 4:3, 3:4.
image_size string optimize_for_quality Resolution optimization mode. 'optimize_for_quality' uses higher resolution for better detail; 'optimize_for_speed' uses lower resolution for faster generation.
output_format string webp Format of the generated image. Options: webp, jpg, png.
output_quality number 80 Compression quality for jpg/webp output (0–100). Higher values retain more detail.

Explore the full request and response schema in our external API documentation

Performance

Strengths & considerations

Strengths Considerations
20B fully-activated MMDiT — state-of-the-art across image generation and editing benchmarks
Industry-leading multilingual text rendering: English, Chinese, Korean, Japanese with layout coherence
Qwen2.5-VL-7B text encoder for rich, deep cross-modal semantic conditioning
Native img2img pipeline with tunable strength
Native LoRA support — stack multiple LoRAs with independent scales
Apache 2.0 license — fully open for commercial use
DiffSynth layer-by-layer offload enables inference in as little as 4GB VRAM
ControlNet support via DiffSynth (Canny, Depth, Inpaint)
Distilled variants available (15-step at cfg 1.0; Lightning LoRA at 8 steps)
20B parameters require ~24GB+ VRAM at bfloat16; NF4/fp8 quantization introduces some artifacts
Distilled variant (non-official) may degrade quality on complex scenes
enhance_prompt is opinionated and may alter intended artistic style
extra_lora_scale must exactly match length of extra_lora_weights or validation fails
May reflect biases present in training data
Not designed to produce factually accurate or grounded outputs

Use cases

Recommended applications for this model

Text-to-image generation with complex multilingual text rendering (EN, ZH, KO, JA)
Image-to-image editing — style transfer, object insertion/removal, text editing, pose manipulation
Poster, infographic, and advertising creative generation with embedded text
LoRA fine-tuning for custom styles and branded concepts
Image understanding — detection, segmentation, depth/edge estimation, super-resolution
Photorealistic, anime, impressionist, and minimalist style generation

Enterprise
Platform Integration

Docker

Docker Support

Official Docker images for containerized deployments

Kubernetes

Kubernetes Ready

Production-grade KBS manifests and Helm charts

SDK

SDK Libraries

Official SDKs for Python, Javascript, Go, and Java

Don't let your AI control you. Control your AI the Qubrid way!

Have questions? Want to Partner with us? Looking for larger deployments or custom fine-tuning? Let's collaborate on the right setup for your workloads.

"Qubrid enabled us to deploy production AI agents with reliable tool-calling and step tracing. We now ship agents faster with full visibility into every decision and API call."

AI Agents Team

Agent Systems & Orchestration