中文题名: | 基于BriVL模型的中文图像描述语句生成 |
姓名: | |
保密级别: | 公开 |
论文语种: | chi |
学科代码: | 080901 |
学科专业: | |
学生类型: | 学士 |
学位: | 理学学士 |
学位年度: | 2023 |
校区: | |
学院: | |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2023-06-21 |
答辩日期: | 2023-05-10 |
外文题名: | Chinese Image Captioning based on BriVL |
中文关键词: | |
外文关键词: | Image Captioning ; BriVL ; self-attention mechanism ; visual and Language pre-training |
中文摘要: |
图像描述语句生成(Image Captioning)任务是计算机视觉和自然语言处理的跨领域课题,用于完成从图像信息到文本内容的跨模态转化。在深度学习技术的支持下,图像描述语句生成任务虽然已取得了令人瞩目的成果,但其中基于中文数据集的研究较少,且大多存在多模态融合不灵活、图像特征不准确等问题。因此,在视觉与语言预训练方法的支持下,本文提出了一种基于BriVL(Bridging Vision and Language)大规模多模态预训练模型的图像描述语句生成方法,使用BriVL视觉特征提取器提取图像编码特征,使用融合自注意力机制的长短期记忆网络(Long Short Term Memory,LSTM)作为解码器最终生成描述文本。基于AIC-ICC和Wukong两个公开的中文标注数据集的实验结果表明,本文所提出的模型在BLEU、CIDEr、ROUGE-L各项指标上均获得了明显的性能提升,且具有较高的准确度和良好的泛化性能。 |
外文摘要: |
Image Captioning is a cross-domain subject of computer vision and natural language processing, which can transform image information into text content. With the support of deep learning technology, remarkable achievements have been made in this task, but there are few researches based on Chinese data sets, and many problems exist such as inflexible multi-mode fusion and inaccurate image features. Therefore, Supported by visual and Language pre-training methods, this paper proposes an image captioning method based on BriVL (Bridging Vision and Language), which is a large-scale multi-mode pre-training model. BriVL’s visual feature extractor is used to extract image coding features and Long Short Term Memory (LSTM) integrated with self-attention mechanism is used as decoder to generate descriptive text. Experimental results based on two public Chinese annotation data sets, AIC-ICC and Wukong, show that the performance of the proposed model is improved significantly on BLEU, CIDEr and ROUGE-L, and its accuracy and generalization are good. |
参考文献总数: | 35 |
馆藏号: | 本080901/23001 |
开放日期: | 2024-06-21 |