Qwen/Qwen-Image
Qwen-Image is a 20B parameter MMDiT (Multimodal Diffusion Transformer) image generation foundation model — all 20B parameters are activated (not MoE). Developed by Alibaba's Tongyi Qwen team and released August 4, 2025 under Apache 2.0. It uses Qwen2.5-VL-7B as its text encoder (qwen_2.5_vl_7b_fp8_scaled.safetensors) — a full vision-language model with SwiGLU ViT, mRoPE, and window attention — enabling deep multilingual semantic understanding far beyond standard T5/CLIP encoders. The model's VAE (qwen_image_vae.safetensors) is custom-built for high-fidelity spatial encoding. Qwen-Image supports text-to-image generation, image-to-image editing (appearance and semantic), LoRA fine-tuning, and image understanding tasks (object detection, segmentation, depth/edge estimation, novel view synthesis, super-resolution). ControlNet support (Canny, Depth, Inpaint) is available via DiffSynth-Studio patches.
api_example.sh
Technical Specifications
Model Architecture & Performance
Pricing
Pay-per-use, no commitments
API Reference
Complete parameter documentation
| Parameter | Type | Default | Description |
|---|---|---|---|
| enhance_prompt | boolean | false | Automatically enhance the prompt with positive magic for better composition, detail, and visual quality. |
| go_fast | boolean | true | Run faster predictions with additional inference optimizations. Slight quality trade-off for significantly faster generation. |
| guidance | number | 3 | Controls how closely the model follows the prompt. Higher values produce more literal interpretation; lower values allow more creative freedom. |
| negative_prompt | string | Describe what to avoid in the generated image (e.g. 'blurry, low quality, distorted, watermark'). | |
| num_inference_steps | number | 30 | Number of denoising steps. More steps produce higher quality but slower generation. |
| seed | number | null | Random seed for reproducible generation. Leave unset (null) for random results. |
| disable_safety_checker | boolean | false | Disable the built-in safety checker for generated images. |
| image | string | null | URL of an input image to use for image-to-image generation. When provided, the model edits or transforms the input image based on the prompt. |
| strength | number | 0.9 | Controls how much the output deviates from the input image in img2img mode. 1.0 = fully regenerated, 0.0 = unchanged. |
| lora_weights | string | null | URL to a LoRA weights file (.safetensors, .tar, or .zip) to apply custom styles or concepts to the generation. |
| lora_scale | number | 1 | Strength of the primary LoRA application. Higher values apply the LoRA style more strongly. |
| extra_lora_weights | array | null | Array of additional LoRA weight URLs to stack on top of the primary LoRA. Supports .safetensors, .tar, or .zip formats. |
| extra_lora_scale | array | null | Array of scale values for the extra LoRAs. Must exactly match the number of entries in extra_lora_weights. |
| aspect_ratio | string | 16:9 | Aspect ratio of the output image. Options: 1:1, 16:9, 9:16, 4:3, 3:4. |
| image_size | string | optimize_for_quality | Resolution optimization mode. 'optimize_for_quality' uses higher resolution for better detail; 'optimize_for_speed' uses lower resolution for faster generation. |
| output_format | string | webp | Format of the generated image. Options: webp, jpg, png. |
| output_quality | number | 80 | Compression quality for jpg/webp output (0–100). Higher values retain more detail. |
Explore the full request and response schema in our external API documentation
Performance
Strengths & considerations
| Strengths | Considerations |
|---|---|
| 20B fully-activated MMDiT — state-of-the-art across image generation and editing benchmarks Industry-leading multilingual text rendering: English, Chinese, Korean, Japanese with layout coherence Qwen2.5-VL-7B text encoder for rich, deep cross-modal semantic conditioning Native img2img pipeline with tunable strength Native LoRA support — stack multiple LoRAs with independent scales Apache 2.0 license — fully open for commercial use DiffSynth layer-by-layer offload enables inference in as little as 4GB VRAM ControlNet support via DiffSynth (Canny, Depth, Inpaint) Distilled variants available (15-step at cfg 1.0; Lightning LoRA at 8 steps) | 20B parameters require ~24GB+ VRAM at bfloat16; NF4/fp8 quantization introduces some artifacts Distilled variant (non-official) may degrade quality on complex scenes enhance_prompt is opinionated and may alter intended artistic style extra_lora_scale must exactly match length of extra_lora_weights or validation fails May reflect biases present in training data Not designed to produce factually accurate or grounded outputs |
Use cases
Recommended applications for this model
Enterprise
Platform Integration
Docker Support
Official Docker images for containerized deployments
Kubernetes Ready
Production-grade KBS manifests and Helm charts
SDK Libraries
Official SDKs for Python, Javascript, Go, and Java
Don't let your AI control you. Control your AI the Qubrid way!
Have questions? Want to Partner with us? Looking for larger deployments or custom fine-tuning? Let's collaborate on the right setup for your workloads.
"Qubrid enabled us to deploy production AI agents with reliable tool-calling and step tracing. We now ship agents faster with full visibility into every decision and API call."
AI Agents Team
Agent Systems & Orchestration
