For image generation with diffusion models (DMs), a negative prompt n can be used to complement the text prompt p, helping define properties not desired in the synthesized image. While this improves prompt adherence and image quality, finding good negative prompts is challenging. We argue that this is due to a semantic gap between humans and DMs, which makes good negative prompts for DMs appear unintuitive to humans. To bridge this gap, we propose a new diffusion-negative prompting (DNP) strategy. DNP is based on a new procedure to sample images that are least compliant with p under the distribution of the DM, denoted as diffusion-negative sampling (DNS). Given p, one such image is sampled, which is then translated into natural language by the user or a captioning model, to produce the negative prompt n*. The pair (p, n*) is finally used to prompt the DM. DNS is straightforward to implement and requires no training. Experiments and human evaluations show that DNP performs well both quantitatively and qualitatively and can be easily combined with several DM variants.
DNP improves quality of synthesis for prompts p (green) of SD's images (top-right). A diffusion-negative image, Ĩ, is sampled using DNS, enabling the user to visualize the negation of p under DM's distribution. The user translates the Ĩ into a negative prompt n* (red), by a process denoted as DNP, and the DM is prompted with the pair (p, n*). This increases compliance and quality of the synthesized image (bottom-right). Replacing the user with a captioning model is denoted as auto-DNP.
Our Method outperforms SD in all tasks. Human evaluators also showed a strong preference for the images generated by our method, favoring them by significant margins. Please refer to our paper for more ablations.
We observe that both prompt adherence and image quality have improved compared to the SD-generated images. It is also worth noting that, even though the negative prompts do not perfectly align with human interpretations of negatives, they still contribute to enhancing the images.
This work was partially funded by NSF award IIS-2303153 a gift from Qualcomm, and NVIDIA GPU donations. We also acknowledge and thank the use of the Nautilus platform for the experiments discussed above.