中文题名: | 基于文本描述的人脸图像合成 |
姓名: | |
保密级别: | 公开 |
论文语种: | chi |
学科代码: | 080901 |
学科专业: | |
学生类型: | 学士 |
学位: | 理学学士 |
学位年度: | 2023 |
校区: | |
学院: | |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2023-06-16 |
答辩日期: | 2023-05-15 |
外文题名: | Face image generation based on text description |
中文关键词: | 生成对抗网络 ; GAN反演 ; 文本到图像的人脸图像生成 ; 人脸编辑 |
外文关键词: | Generative adversarial network ; text-to-image face image generation ; GAN inversion ; face editing |
中文摘要: |
基于文本生成人脸照片是图像生成领域的重要研究内容,已有工作存在图像的生成质量低,同一文本输入图片的多样性较差以及基于文本的人脸图像编辑等方面的不足。TediGAN[10]由StyleGAN反演模块、视觉语言相似学习和实例级优化三部分组成。其中,反转模块将真实图像映射到训练好的有素的StyleGAN的潜在空间。视觉语言相似度通过将图像和文本映射到一个共同的嵌入空间来学习文本-图像匹配。实例级优化用于操作中的人脸图像特征保存。在此基础上,借助于预训练的CLIP文本编码器来代替视觉语言相似性学习部分,方便了图片的生成,可以不使用原有的繁琐的训练过程,并且不破坏原有的代码结构。在预训练的语言模型的指导下,直接优化预训练GAN模型潜在空间中的潜在代码。潜在代码可以从潜在空间中随机采样,也可以从给定的图像中反向采样,为多样性人脸图像生成和多模态输入的操作提供了支持。 通过使用TediGAN模型可产生多样化和高质量的图像,合成的图像分辨率可达到1024*1024。通过使用基于风格的控制机制,TediGAN支持多模态输入的图像合成,例如草图或语义标签,以及实例图片的引导。该工作中也提出的Multi-Modal CelebA-HQ,一个由真实人脸图像和相应的语义分割地图、草图和文本描述组成的大规模数据集。 本论文系统地复现了TediGAN模型,并基于TediGAN 进行如下实验: (1)输入文本合成多样性人脸;(2)基于文本的人脸图像编辑;(3)真实人脸图像、语义图像与草图引导的人脸图像属性编辑;(4)基于文本描述的素描人脸图像编辑;实验结果验证了方法的有效性与优越性。 |
外文摘要: |
To generate a face photo from text is vital in image synthesis. Up to now, various methods have been proposed. However, the synthesized images suffer from low quality, low diversity and not suitable for editing. The TediGAN consists of three components: StyleGAN inversion module, visual-linguistic similarity learning, and instance-level optimization. The inversion module maps real images to the latent space of a well-trained StyleGAN. The visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space. The instancelevel optimization is for identity preservation in manipulation. On this basis, we can also use the CLIP text encoder that has been pre-trained to replace the visual language similarity learning part, facilitates the generation of images, eliminates the tedious training process and does not damage the original code structure. Under the guidance of the pre-trained language model, the latent code in the latent space of the pre-trained GAN model is directly optimized. The latent code can be sampled randomly from the latent space or backsampled from a given image, which provides support for the operation of image generation and multimodal input. Our model can produce diverse and high-quality images with an unprecedented resolution at 1024*1024. Using a control mechanism based on style-mixing, the TediGAN inherently supports image synthesis with multi-modal inputs, such as sketches or semantic labels, with or without instance guidance. To facilitate text-guided multimodal synthesis, in TediGAN we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of the proposed method. We give experiments on: (1) Diverse face image synthesis based on text; (2) Image attribute editing from text; (3) Image attribute editing from real photo, semantic image and sketch. Extensive experiments on the introduced dataset demonstrate the superior performance of the method. |
参考文献总数: | 35 |
馆藏号: | 本080901/23042 |
开放日期: | 2024-06-15 |