Stability AI recently released the version of Stable Diffusion 3 called Medium, a more advanced and sophisticated model for generating images from text. This new model represents a significant step forward from previous versions, offering improved performance in handling complex prompts, image quality, and text recognition capabilities.
Stable Diffusion 3 uses an architecture called the Multimodal Diffusion Transformer (MMDiT), which leverages separate sets of weights for textual and visual representations, thereby improving text comprehension and spelling capabilities over previous versions. This new architecture is particularly effective in faithfully following complex prompts, outperforming competing models such as DALL-E 3 and Midjourney v6 in human ratings of aesthetic quality, prompt adherence, and typography.
Stable Diffusion 3 also employs Rectified Flow technology, which facilitates more direct inference paths and more efficient sampling, improving final image quality. In addition, it uses three text encoders: CLIP L/14, OpenCLIP bigG/14 and T5-v1.1-XXL, which contribute to more accurate text understanding and better integration of text into images.