Qwen/Qwen-Image

Qwen-Image is a 20B parameter MMDiT (Multimodal Diffusion Transformer) image generation foundation model — all 20B parameters are activated (not MoE). Developed by Alibaba's Tongyi Qwen team and released August 4, 2025 under Apache 2.0. It uses Qwen2.5-VL-7B as its text encoder (qwen_2.5_vl_7b_fp8_scaled.safetensors) — a full vision-language model with SwiGLU ViT, mRoPE, and window attention — enabling deep multilingual semantic understanding far beyond standard T5/CLIP encoders. The model's VAE (qwen_image_vae.safetensors) is custom-built for high-fidelity spatial encoding. Qwen-Image supports text-to-image generation, image-to-image editing (appearance and semantic), LoRA fine-tuning, and image understanding tasks (object detection, segmentation, depth/edge estimation, novel view synthesis, super-resolution). ControlNet support (Canny, Depth, Inpaint) is available via DiffSynth-Studio patches.

Alibaba Cloud Image Context N/A

Get API Key

Try in Playground

Free Trial Credit On first TopUp of minimum $5

$1.00

api_example.sh

curl -X POST "https://platform.qubrid.com/v1/images/generations" \
  -H "Authorization: Bearer QUBRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "qwen-image",
  "prompt": "cinematic shot of a lone astronaut standing on a desolate alien planet, glowing orange sunset sky, dust storms swirling, dramatic lighting, ultra-wide lens composition, movie still aesthetic, realistic space suit details, volumetric atmosphere, 8k sci-fi film scene",
  "enhance_prompt": false,
  "go_fast": true,
  "guidance": 3,
  "negative_prompt": "",
  "num_inference_steps": 30,
  "seed": null,
  "disable_safety_checker": false,
  "image": null,
  "strength": 0.9,
  "lora_weights": null,
  "lora_scale": 1,
  "extra_lora_weights": null,
  "extra_lora_scale": null,
  "aspect_ratio": "16:9",
  "image_size": "optimize_for_quality",
  "output_format": "webp",
  "output_quality": 80
}'

import requests
import json

url = "https://platform.qubrid.com/v1/images/generations"
headers = {
    "Authorization": "Bearer QUBRID_API_KEY",
    "Content-Type": "application/json"
}

data = {
  "model": "qwen-image",
  "prompt": "cinematic shot of a lone astronaut standing on a desolate alien planet, glowing orange sunset sky, dust storms swirling, dramatic lighting, ultra-wide lens composition, movie still aesthetic, realistic space suit details, volumetric atmosphere, 8k sci-fi film scene",
  "enhance_prompt": False,
  "go_fast": True,
  "guidance": 3,
  "negative_prompt": "",
  "num_inference_steps": 30,
  "seed": None,
  "disable_safety_checker": False,
  "image": None,
  "strength": 0.9,
  "lora_weights": None,
  "lora_scale": 1,
  "extra_lora_weights": None,
  "extra_lora_scale": None,
  "aspect_ratio": "16:9",
  "image_size": "optimize_for_quality",
  "output_format": "webp",
  "output_quality": 80
}

response = requests.post(url, headers=headers, json=data)

if response.status_code == 200:
    with open("generated_image.png", "wb") as f:
        f.write(response.content)
        print("Image saved to generated_image.png")
else:
    print(f"Error: {response.status_code}")
    print(response.text)

const body = {
  "model": "qwen-image",
  "prompt": "cinematic shot of a lone astronaut standing on a desolate alien planet, glowing orange sunset sky, dust storms swirling, dramatic lighting, ultra-wide lens composition, movie still aesthetic, realistic space suit details, volumetric atmosphere, 8k sci-fi film scene",
  "enhance_prompt": false,
  "go_fast": true,
  "guidance": 3,
  "negative_prompt": "",
  "num_inference_steps": 30,
  "seed": null,
  "disable_safety_checker": false,
  "image": null,
  "strength": 0.9,
  "lora_weights": null,
  "lora_scale": 1,
  "extra_lora_weights": null,
  "extra_lora_scale": null,
  "aspect_ratio": "16:9",
  "image_size": "optimize_for_quality",
  "output_format": "webp",
  "output_quality": 80
};

const res = await fetch("https://platform.qubrid.com/v1/images/generations", {
  method: "POST",
  headers: {
    Authorization: "Bearer QUBRID_API_KEY",
    "Content-Type": "application/json"
  },
  body: JSON.stringify(body)
});

const result = await res.json();

package main

import (
  "bytes"
  "encoding/json"
  "net/http"
)

func main() {
  url := "https://platform.qubrid.com/v1/images/generations"

  data := {
  "model": "qwen-image",
  "prompt": "cinematic shot of a lone astronaut standing on a desolate alien planet, glowing orange sunset sky, dust storms swirling, dramatic lighting, ultra-wide lens composition, movie still aesthetic, realistic space suit details, volumetric atmosphere, 8k sci-fi film scene",
  "enhance_prompt": false,
  "go_fast": true,
  "guidance": 3,
  "negative_prompt": "",
  "num_inference_steps": 30,
  "seed": null,
  "disable_safety_checker": false,
  "image": null,
  "strength": 0.9,
  "lora_weights": null,
  "lora_scale": 1,
  "extra_lora_weights": null,
  "extra_lora_scale": null,
  "aspect_ratio": "16:9",
  "image_size": "optimize_for_quality",
  "output_format": "webp",
  "output_quality": 80
}
  jsonData, _ := json.Marshal(data)

  req, _ := http.NewRequest("POST", url, bytes.NewBuffer(jsonData))
  req.Header.Set("Authorization", "Bearer QUBRID_API_KEY")
  req.Header.Set("Content-Type", "application/json")

  client := &http.Client{}
  res, _ := client.Do(req)
}

Technical Specifications

Model Architecture & Performance

Variant Base T2I (20B fully activated, non-MoE)

Model Size 20B parameters (~21.8 GB at Q8, ~11.9 GB at Q4)

Quantization None (fp8 official; NF4, GGUF community variants available)

Architecture MMDiT (Multimodal Diffusion Transformer) — 20B fully-activated parameters, non-MoE, diffusion backbone with Qwen2.5-VL-7B cross-modal text conditioning

Precision bfloat16 / fp8

License Apache 2.0

Release Date August 4, 2025

Developers Alibaba Tongyi Qwen Team

Pricing

Pay-per-use, no commitments

Per Image $0.025/Image

API Reference

Complete parameter documentation

Parameter	Type	Default	Description
enhance_prompt	boolean	false	Automatically enhance the prompt with positive magic for better composition, detail, and visual quality.
go_fast	boolean	true	Run faster predictions with additional inference optimizations. Slight quality trade-off for significantly faster generation.
guidance	number	3	Controls how closely the model follows the prompt. Higher values produce more literal interpretation; lower values allow more creative freedom.
negative_prompt	string		Describe what to avoid in the generated image (e.g. 'blurry, low quality, distorted, watermark').
num_inference_steps	number	30	Number of denoising steps. More steps produce higher quality but slower generation.
seed	number	null	Random seed for reproducible generation. Leave unset (null) for random results.
disable_safety_checker	boolean	false	Disable the built-in safety checker for generated images.
image	string	null	URL of an input image to use for image-to-image generation. When provided, the model edits or transforms the input image based on the prompt.
strength	number	0.9	Controls how much the output deviates from the input image in img2img mode. 1.0 = fully regenerated, 0.0 = unchanged.
lora_weights	string	null	URL to a LoRA weights file (.safetensors, .tar, or .zip) to apply custom styles or concepts to the generation.
lora_scale	number	1	Strength of the primary LoRA application. Higher values apply the LoRA style more strongly.
extra_lora_weights	array	null	Array of additional LoRA weight URLs to stack on top of the primary LoRA. Supports .safetensors, .tar, or .zip formats.
extra_lora_scale	array	null	Array of scale values for the extra LoRAs. Must exactly match the number of entries in extra_lora_weights.
aspect_ratio	string	16:9	Aspect ratio of the output image. Options: 1:1, 16:9, 9:16, 4:3, 3:4.
image_size	string	optimize_for_quality	Resolution optimization mode. 'optimize_for_quality' uses higher resolution for better detail; 'optimize_for_speed' uses lower resolution for faster generation.
output_format	string	webp	Format of the generated image. Options: webp, jpg, png.
output_quality	number	80	Compression quality for jpg/webp output (0–100). Higher values retain more detail.

Explore the full request and response schema in our external API documentation

Performance

Strengths & considerations

Strengths	Considerations
20B fully-activated MMDiT — state-of-the-art across image generation and editing benchmarks Industry-leading multilingual text rendering: English, Chinese, Korean, Japanese with layout coherence Qwen2.5-VL-7B text encoder for rich, deep cross-modal semantic conditioning Native img2img pipeline with tunable strength Native LoRA support — stack multiple LoRAs with independent scales Apache 2.0 license — fully open for commercial use DiffSynth layer-by-layer offload enables inference in as little as 4GB VRAM ControlNet support via DiffSynth (Canny, Depth, Inpaint) Distilled variants available (15-step at cfg 1.0; Lightning LoRA at 8 steps)	20B parameters require ~24GB+ VRAM at bfloat16; NF4/fp8 quantization introduces some artifacts Distilled variant (non-official) may degrade quality on complex scenes enhance_prompt is opinionated and may alter intended artistic style extra_lora_scale must exactly match length of extra_lora_weights or validation fails May reflect biases present in training data Not designed to produce factually accurate or grounded outputs

Strengths

Considerations

20B fully-activated MMDiT — state-of-the-art across image generation and editing benchmarks

Industry-leading multilingual text rendering: English, Chinese, Korean, Japanese with layout coherence

Qwen2.5-VL-7B text encoder for rich, deep cross-modal semantic conditioning

Native img2img pipeline with tunable strength

Native LoRA support — stack multiple LoRAs with independent scales

Apache 2.0 license — fully open for commercial use

DiffSynth layer-by-layer offload enables inference in as little as 4GB VRAM

ControlNet support via DiffSynth (Canny, Depth, Inpaint)

Distilled variants available (15-step at cfg 1.0; Lightning LoRA at 8 steps)

20B parameters require ~24GB+ VRAM at bfloat16; NF4/fp8 quantization introduces some artifacts

Distilled variant (non-official) may degrade quality on complex scenes

enhance_prompt is opinionated and may alter intended artistic style

extra_lora_scale must exactly match length of extra_lora_weights or validation fails

May reflect biases present in training data

Not designed to produce factually accurate or grounded outputs

Use cases

Recommended applications for this model

Text-to-image generation with complex multilingual text rendering (EN, ZH, KO, JA)

Image-to-image editing — style transfer, object insertion/removal, text editing, pose manipulation

Poster, infographic, and advertising creative generation with embedded text

LoRA fine-tuning for custom styles and branded concepts

Image understanding — detection, segmentation, depth/edge estimation, super-resolution

Photorealistic, anime, impressionist, and minimalist style generation

Enterprise
Platform Integration

Docker Support

Official Docker images for containerized deployments

Kubernetes Ready

Production-grade KBS manifests and Helm charts

SDK Libraries

Official SDKs for Python, Javascript, Go, and Java

Don't let your AI control you. Control your AI the Qubrid way!

Have questions? Want to Partner with us? Looking for larger deployments or custom fine-tuning? Let's collaborate on the right setup for your workloads.

Get Started

"Qubrid enabled us to deploy production AI agents with reliable tool-calling and step tracing. We now ship agents faster with full visibility into every decision and API call."

AI Agents Team

Agent Systems & Orchestration