PIXART-Σ

PIXART-Σ:
Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

¹Huawei Noah's Ark Lab, ²Dalian University of Technology, ³The University of Hong Kong,
^*Equal contribution. Work done during the internships of the first two students at Huawei Noah's Ark Lab.
^‡Corresponding author.

Abstract

In this paper, we introduce PixArt-Σ, a Diffusion Transformer model (DiT) capable of directly generating images at 4K resolution. PixArt-Σ represents a significant advancement over its predecessor, PixArt-α, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-Σ is its training efficiency. Leveraging the foundational pre-training of PixArt-α, it evolves from the ‘weaker’ baseline to a ‘stronger’ model via incorporating higher quality data, a process we term “weak-to-strong training”. The advancements in PixArt-Σ are twofold: (1) High-Quality Training Data: PixArt-Σ incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-Σ achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-Σ’s capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.

More Samples

Drone view of waves crashing against the rugged cliffs along Big Sur’s Garay Point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore.

A Ukiyoe-style painting, an astronaut riding a unicorn, In the background there is an ancient Japanese architecture.

A Japanese girl walking along a path, surrounded by blooming oriental cherries, pink petals slowly falling down to the ground.

Astronaut on Mars During sunset.

Color photo of a corgi made of transparent glass, standing on the riverside in Yosemite National Park.

Happy dreamy owl monster sitting on a tree branch, colorful glittering particles, forest background, detailed feathers.

Game-Art - An island with different geographical properties and multiple small cities floating in space

Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee.

A car made out of vegetables.

A serene lakeside during autumn with trees displaying a palette of fiery colors.

A realistic landscape shot of the Northern Lights dancing over a snowy mountain range in Iceland.

A deep forest clearing with a mirrored pond reflecting a galaxy-filled night sky.

BibTeX

@misc{chen2024pixartsigma, title={PixArt-\Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation}, author={Junsong Chen and Chongjian Ge and Enze Xie and Yue Wu and Lewei Yao and Xiaozhe Ren and Zhongdao Wang and Ping Luo and Huchuan Lu and Zhenguo Li}, year={2024}, eprint={2403.04692}, archivePrefix={arXiv}, primaryClass={cs.CV} }

PIXART-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

Abstract

More Samples

BibTeX

PIXART-Σ:
Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation