Image generation using AI methods comes with an inherent bargain – the image delivered either strictly follows the prompt or comes with higher quality, with finding the sweet spot being the goal. Solvd’s research team tackles this challenge, giving the image generation networks better guidance.
Generating images is one of the key skills of modern neural networks, standing at the foundations of the generative AI explosion. The text-to-image models are built on the foundation of Classifier-Free-Guidance, where conditional and unconditional predictions are blended when generating the image, without the need to use an external classifier.
What is Classifier-free Guidance?
Classifier-Free Guidance (CFG) is a technique used in generative AI models, especially diffusion models, to control how strongly the model follows a given prompt. It helps balance creativity and accuracy in image or text generation. In practice, the model generates two predictions (one with the prompt and one without) and then combines them using a guidance scale.
There are two main settings: low and high guidance values. A low guidance value produces more diverse and creative results. A high guidance value, in turn, makes the output more faithful to the given prompt.
Example:
User asks for an image of a cat wearing a hat.
The neural network needs to generate both unconditional elements of the image (the core idea of “hat” and a “cat”) and conditional ones (colors, the fact that the cat is wearing a hat, which is uncommon for a cat).
These models need to balance several aspects when delivering the image:
- Adherence to the prompt. The image needs to follow the user’s request.
- Quality. The image has to be as good as possible, in multiple aspects – aesthetics, precision, sometimes typography.
- Connection to the ground truth. Delivered images need to stick to reality, without distortions or modifications. The well-known issue with earlier image generation models struggling with palm or teeth generation is one of the best examples of this issue.
In a typical approach, there is a bargain, where one needs to sacrifice some aspect for the sake of delivering another one. Following the example above:
There are little to no real-life examples of cats wearing hats, so the neural network needs to generate something new to follow the user’s request. When given too much freedom in generation, the “hat” may also impact the rest of the image, changing the background, colors or fur pattern. If forced to stick to the prompt, the image would be of lower quality due to underrepresentation in the dataset. Last but not least, the image needs to show a cat, not a feline-like creature, wearing a hat, not a cap, helmet or having armor plates growing straight out of the skull.
This bargain is an inherent limitation of the text-to-image applications. But Solvd’s team came up with a way to tackle this limitation.
Solvd’s approach – Classifier-free guidance with adaptive scaling
The team introduced β-adaptive scaling in Classifier-Free Guidance to solve the challenge above. The new mechanism stabilizes the effects of guiding using gradient-based adaptive normalization. By that, the system is capable of applying more strict guidance to fit the prompt where necessary, yet give the neural network creative freedom to deliver a better-quality image whenever strict adherence is not required.
The outcome
The results delivered by generative models using β-adaptive scaling in Classifier-Free Guidance delivered better results in benchmarks compared to previous models, while maintaining a high fidelity to the prompt.
The research was delivered by Dawid Malarz, Artur Kasymov, Maciej Zięba, Jacek Tabor and Przemysław Spurek, representing Jagiellonian University and University of Science and Technology Wrocław. The reseach paper can be found on Arxiv.
The paper will be published during this year’s edition of the ECAI conference, that will take place in Bologna, Italy. The event starts on October 25 and will end October 30. This conference is a great opportunity to share thoughts and remarks on modern machine learning, meet fellow researchers and get to know the most inspiring topics in modern Artificial Intelligence.