查看论文信息

查看全文

查看论文信息

中文题名：	基于CLIP的实例级跨模态检索方法研究
姓名：	熊锐成
保密级别：	公开
论文语种：	chi
学科代码：	080717T
学科专业：	人工智能
学生类型：	学士
学位：	工学学士
学位年度：	2024
校区：	北京校区培养
学院：	人工智能学院
研究方向：	跨模态检索
第一导师姓名：	张立保
第一导师单位：	人工智能学院
提交日期：	2024-06-12
答辩日期：	2024-05-24
外文题名：	Research on the Instance-Level Cross-Modal Retrieval Method Based on CLIP
中文关键词：	跨模态检索 ; 扩散模型 ; 对比学习
外文关键词：	Cross-Modal Retrieval ; Diffusion ; Contrastive Learning
中文摘要：	︿跨模态检索是一种面向多种模态数据之间交叉检索的检索方式。随着深度学习的发展，以CLIP为代表的视觉语言预训练大模型因其强大的跨模态语义关联能力在检索任务中表现出了优异的性能。但是，目前大多数基于CLIP的跨模态检索相关研究都局限在类别级检索或模态一一对应（如：一文对一图）的强约束检索。实际上以图文检索为例，一个文本描述通常可对应多张候选图像，本文将具有相同语义信息的文本与多张图像视为同一实例。因此，在实例级跨模态检索任务中，要求返回给定查询所对应的所有实例数据。针对实例级跨模态检索任务中，户外自然场景多模态实例数据集的空白、CLIP缺少实例级对比损失以及CLIP在图文间语义关联能力不足的问题，本文首先利用级联扩散模型生成与文本对应的多张实例图像，设计了一种投票机制从而构建了户外自然场景多模态实例数据集；其次将信息对比损失与三元组对比损失有效地扩展至实例级；最后提出一种基于视觉预训练模型的排序修正策略，通过优化检索阶段候选图像的排序结果提升CLIP的实例级跨模态检索精度。实验结果表明，户外自然场景多模态实例数据集可以较好地应用于实例级跨模态检索任务，且本文方法的检索准确率优于已有方法。﹀
外文摘要：	︿ Cross-modal retrieval is a kind of retrieval method which is oriented to cross-retrieval between data of multiple modalities. With the development of deep learning, the vision-language pre-trained large model represented by CLIP shows excellent performance in retrieval tasks because of its powerful cross-modal semantic relevance ability. However, at present, most CLIP-based cross-modal retrieval researches are limited to category-level retrieval or modality one-to-one correspondence (such as “one text to one image”) strongly constrained retrieval. In fact, taking image-text retrieval as an example, a text description can usually correspond to multiple candidate images. In this paper, one text and multiple images with the same semantic information are regarded as the same instance. Therefore, in the instance-level cross-modal retrieval task, all the instance data corresponding to the given query is required to be returned. Aiming at the problems of gaps in multimodal instance dataset of outdoor natural scene, the lack of instance-level contrastive loss of CLIP and the insufficient semantic relevance ability of CLIP between image and text, this paper firstly uses stable cascade model to generate multiple instance images corresponding to text, and a voting mechanism is designed to construct the Outdoor Natural Scene Multimodal Instance dataset. Secondly, the infoNCE contrastive loss and triplet contrastive loss are effectively extended to the instance-level. Finally, a ranking correction strategy based on the visual pre-trained model is proposed to improve instance-level cross-modal retrieval precision of CLIP by optimizing the ranking results of candidate images in the retrieval stage. The experimental results show that the Outdoor Natural Scene Multimodal Instance dataset can be well applied to the instance-level cross-modal retrieval task, and the retrieval accuracy of the proposed method is better than that of the existing methods. ﹀
参考文献总数：	59
插图总数：	19
插表总数：	6
馆藏号：	本080717T/24020
开放日期：	2025-06-12

附件下载