中文题名: | 基于CogView模型的中文文图生成 |
姓名: | |
保密级别: | 公开 |
论文语种: | chi |
学科代码: | 080901 |
学科专业: | |
学生类型: | 学士 |
学位: | 理学学士 |
学位年度: | 2023 |
校区: | |
学院: | |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2023-06-15 |
答辩日期: | 2023-05-10 |
外文题名: | Chinese Text-to-image Generation based on CogView Model |
中文关键词: | |
外文关键词: | Chinese Text-to-image Generation ; CogView model ; natural language processing ; deep learning |
中文摘要: |
近两年,文图生成领域不断涌现出新的模型和理论,Stable Diffusion、DALL-E 2等模型展现出的强大的生成能力引起了人们对于文图生成任务的广泛讨论。然而,目前大部分较为成熟的文图生成模型都是基于英文文本的,基于中文文本的文图生成模型研究起步较晚,尚有许多值得探索之处。CogView模型是一个开源的、基于自回归Transformer的中文文图生成模型,其结构与DALL-E和ERNIE-ViLG模型类似。因此,本文选择在该模型的基础上,在以下三个阶段中对中文文图生成任务展开深入探讨,进一步提升模型的生成能力: 在训练阶段,本文使用了13,335个自采集图像-文本数据对对预训练模型进行了微调,使之在特定领域的下游任务中展示出了更好的性能。在生成阶段,为解决用户输入的Prompt与模型所学知识不匹配的问题,本文使用同义词扩充了Prompt,使得生成图片的质量普遍得到了提高。在评估阶段,为解决模型生成效果不稳定的问题,本文根据图片反向生成的文字描述和经过同义词扩充后的Prompt之间的文本相似度对生成图片进行了重新排序,使得模型能够帮助用户提前过滤掉一部分效果不佳的图片,提高了用户的使用体验。 |
外文摘要: |
In the past two years, new models and theories have emerged in the field of Text-to-image Generation. The powerful generative capabilities of models such as Stable Diffusion and DALL-E 2 have aroused extensive discussions on the task. However, most of the mature text-to-image generation models are based on English texts, and the research on the models based on Chinese texts started relatively late, and it is worth exploring. CogView is an open-source Chinese text-to-image generation model based on autoregressive Transformer, its structure is similar to the DALL-E and ERNIE-ViLG. Therefore, on the basis of this model, this paper conducts in-depth discussions on the task of Chinese Text-to-image Generation in the following three stages to further improve the generation ability of the model: In the training phase, this paper uses 13,335 self-collected image-text pairs to fine-tune the pre-trained model to show better performance in domain-specific downstream tasks. In the generation stage, in order to solve the problem that the prompt input by the users does not match the knowledge learned by the model, this paper uses synonyms to expand the prompt, which generally improves the quality of the generated images. In the evaluation stage, in order to solve the problem of unstable model generation effect, this paper re-ranks the generated images according to the text similarity between the reverse generated text description of the image and the Prompt after synonym expansion, so that the model can help users filter out some bad generated images in advance, which improves the users’ experience. |
参考文献总数: | 42 |
馆藏号: | 本080901/23018 |
开放日期: | 2024-06-15 |