中文题名: | 基于关系网络和注意力机制的视觉问答系统 |
姓名: | |
保密级别: | 公开 |
论文语种: | 中文 |
学科代码: | 081203 |
学科专业: | |
学生类型: | 硕士 |
学位: | 工学硕士 |
学位类型: | |
学位年度: | 2021 |
校区: | |
学院: | |
研究方向: | 视觉问答 |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2021-06-07 |
答辩日期: | 2021-06-07 |
外文题名: | Visual Question Answering Based on Relation Network and Atention Mechanisms |
中文关键词: | |
外文关键词: | Deep Learning ; Computer Vision ; Natural Language Processing ; Visual Question Answering ; Relation Network ; Attention Mechanisms |
中文摘要: |
随着信息技术的高速发展,互联网产生的数据越来越丰富,而这些数据往往涉及多个模态信息:语音、文本、图片等,多模态学习越来越受到研究者的重视。在多模态学习中一个重要的分支是图片视觉问答任务,利用图片和文本这两种模态的数据进行跨模态的问答任务。视觉问答任务是一个计算机视觉(Computer Visual, CV)和自然语言处理(Natual Language Processing, NLP)交叉的任务,具体表现为给定一张图片和一个相关的问题,视觉问答模型需要根据图片给出问题的答案。本文将视觉问答模型大致分为以下四个类别,基于多模态融合的方法,基于注意力机制的方法,基于复合模型的方法和基于预训练模型的方法。除预训练模型外,大部分视觉问答任务模型都可以分为图像的特征提取模块、文本的特征提取模块、注意力模块、多模态特征融合模块以及答案生成模块等,对这些模型的优化主要集中在更好的特征提取、更好的注意力机制和更多样化的多模态特征融合方式。预训练模型则是通过大规模的跨模态预训练来得到更好的图片、文本语义以及图片和文本的对齐信息。本文对视觉问答系统的改进主要集中在图片特征提取模块和注意力模块,本文的主要研究工作包括:首先,随着自上而下和自底向上的注意力机制在视觉问答(Visual Question Answering, VQA)任务的特征提取中取得了巨大成功,视觉注意机制已成为VQA 模型不可或缺的一部分。但是,这些基于注意力的特征表达方法不会考虑图像区域之间空间或内容上的任何联系,这对于模型对图像的全面理解至关重要。为了更好地理解图像并捕获图像区域之间的关系,本文使用基于自注意力机制的关系网络捕捉图像区域之间全局的关系信息。同时本文改进了注意力机制,使用多个注意力模块各自获取问题与图片的联系,最后结合所有模块信息达到更好的视觉问答效果。其次,在基于自注意力机制的关系网络基础上,本文提出局域关系网络(Local Relation Network, LRN)为每个图像区域生成上下文感知的图像特征,得到更精细化的关系信息;同时为了更好的利用原始特征和模型获取到的包含关系信息的特征,本文提出了一种多语义层次的注意力机制,将来自LRN 和原始图像区域的语义信息进行组合,从而使模型的决策更加合理。通过这两项改进,我们改善了图像区域特征的表达,得到了更好的注意力效果和VQA 性能。
﹀
|
外文摘要: |
With the rapid development of information technology, the data generated by the Internet is becoming more and more abundant, and these data often involve multiple modal information: voice, text, pictures, etc. Multi-modal learning is getting more and more attention from researchers. Image question answering task is an important branch in multi-modal learning, which uses the data of the two modalities of image and text for cross-modal question answering tasks. Visual question answering task is a crossover task of Computer Visual (CV) and Natural Language Processing (NLP). Specifically, given image and related question, the image question answering model needs to answer the question based on the image. In this paper, we roughly divide visual question answering models into four categories: methods based on multi-modal fusion, attention mechanisms, composite models, and pre-training models. In addition to pre-training models, most visual question answering models can be divided into image feature extraction module, text feature extraction module, attention module, multi-modal feature fusion module and answer generation module, etc. Improvements to these models are mainly focused on better features extraction, better attention mechanism and more diverse multimodal feature fusion methods. The pre-training models utilize large-scale cross-modal pre-training to obtain better image,text semantics and image-text alignment information. In this paper, we mainly focuses on the image feature extraction module and the attention module to improve visual question answering system. The main research work of this paper includes: First, as the top-down and bottom-up attention mechanisms have achieved great success in the feature extraction of Visual Question Answering (VQA) tasks, the visual attention mechanism has become an indispensable part of the VQA model. However, these attention-based feature expression methods do not consider any spatial or content relationship between image regions, which is essential for the model’s comprehensive understanding of the image. In order to better understand the image and capture the relationship between image regions, we proposal a relationnetwork based on self-attention mechanism to capture the global relationship information between image regions. Moreover, we improve the attention mechanism, using multiple attention modules to obtain the connection between the question and the picture, and finally combines all the module information to achieve better visual question answering performance. Second, on the basis of the relation network based on self-attention mechanism, we proposes a Local Relation Network (LRN) to generate context-aware image features for each image region to obtain more refined relational information; To make better use of the original features and the features that contain relational information obtained by the model, we propose a multi-level attention mechanism, which combines the semantic information from the LRNs and the original image regions, so as to make the model’s decision-making more reasonable. Through these two measures, we have improved the expression of image area features and obtained better attention mechanism effects and VQA performance.
﹀
|
参考文献总数: | 86 |
馆藏号: | 硕081203/21008 |
开放日期: | 2022-06-07 |