Abstract

In this paper, we introduce PixArt-Σ, a Diffusion Transformer model (DiT) capable of directly generating images at 4K resolution. PixArt-Σ represents a significant advancement over its predecessor, PixArt-α, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-Σ is its training efficiency. Leveraging the foundational pre-training of PixArt-α, it evolves from the ‘weaker’ baseline to a ‘stronger’ model via incorporating higher quality data, a process we term “weak-to-strong training”. The advancements in PixArt-Σ are twofold: (1) High-Quality Training Data: PixArt-Σ incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-Σ achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-Σ’s capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.

More Samples

4K_image
compare
sample5
Drone view of waves crashing against the rugged cliffs along Big Sur’s Garay Point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore.
sample6
A Ukiyoe-style painting, an astronaut riding a unicorn, In the background there is an ancient Japanese architecture.
sample7
A Japanese girl walking along a path, surrounded by blooming oriental cherries, pink petals slowly falling down to the ground.
sample8
Astronaut on Mars During sunset.
sample9
Color photo of a corgi made of transparent glass, standing on the riverside in Yosemite National Park.
sample10
Happy dreamy owl monster sitting on a tree branch, colorful glittering particles, forest background, detailed feathers.
sample11
Game-Art - An island with different geographical properties and multiple small cities floating in space
sample12
Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee.
sample1
A car made out of vegetables.
sample2
A serene lakeside during autumn with trees displaying a palette of fiery colors.
sample3
A realistic landscape shot of the Northern Lights dancing over a snowy mountain range in Iceland.
sample4
A deep forest clearing with a mirrored pond reflecting a galaxy-filled night sky.

BibTeX

@misc{chen2024pixartsigma,
      title={PixArt-\Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation},
      author={Junsong Chen and Chongjian Ge and Enze Xie and Yue Wu and Lewei Yao and Xiaozhe Ren and Zhongdao Wang and Ping Luo and Huchuan Lu and Zhenguo Li},
      year={2024},
      eprint={2403.04692},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}