PIXART-α

ICLR 2024 Spotlight

¹Huawei Noah's Ark Lab, ²Dalian University of Technology, ³The University of Hong Kong, ⁴The Hong Kong University of Science and Technology
^*Equal contribution. Work done during the internships of the four students at Huawei Noah's Ark Lab.
^‡Corresponding author.

Abstract

The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO₂ emissions. This paper introduces PIXART-α, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-α's training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-α only takes 10.8% of Stable Diffusion v1.5's training time (~675 vs. ~6,250 A100 GPU days), saving nearly $300,000 ($26,000 vs. $320,000) and reducing 90% CO₂ emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-α excels in image quality, artistry, and semantic control. We hope PIXART-α will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.

Training Efficiency

Comparisons of CO₂ emissions and training cost among T2I generators. PIXART-α achieves an exceptionally low training cost of $26,000. Compared to RAPHAEL, our CO₂ emissions and training costs are merely 1.1% and 0.85%, respectively.

More Samples

8k uhd A man looks up at the starry sky, lonely and ethereal, Minimalism, Chaotic composition Op Art

A baby painter trying to draw very simple picture, white background

A dog that has been meditating all the time

A snowy mountain

A worker that looks like a mixture of cow and horse is working hard to type code

Half human, half robot, repaired human

knolling of a drawing tools for painter

Van Gogh painting of a teacup on the desk

Chinese painting of grapes

Stars, water, brilliantly, gorgeous large scale scene

A sureal parallel world where mankind avoid extinction

Pirate ship trapped in a cosmic maelstrom nebula

BibTeX

@misc{chen2023pixartalpha, title={PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis}, author={Junsong Chen and Jincheng Yu and Chongjian Ge and Lewei Yao and Enze Xie and Yue Wu and Zhongdao Wang and James Kwok and Ping Luo and Huchuan Lu and Zhenguo Li}, year={2023}, eprint={2310.00426}, archivePrefix={arXiv}, primaryClass={cs.CV} }

PIXART-α:
Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

ICLR 2024 Spotlight

Abstract

Online Demo

Training Efficiency

ControlNet

ControlNet customization samples from PIXART-α. We use the reference images to generate the corresponding HED edge images and use them as the control signal for PIXART-α ControlNet.

Dreambooth

More Samples

BibTeX

PIXART-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

ICLR 2024 Spotlight

Abstract

Online Demo

Training Efficiency

ControlNet

ControlNet customization samples from PIXART-α. We use the reference images to generate the corresponding HED edge images and use them as the control signal for PIXART-α ControlNet.

Dreambooth

More Samples

BibTeX

PIXART-α:
Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis