查看论文信息

查看全文

查看论文信息

中文题名：	面向新闻图像的具体化字幕生成研究
姓名：	刘雪滢
保密级别：	公开
论文语种：	chi
学科代码：	081203
学科专业：	计算机应用技术
学生类型：	硕士
学位：	工学硕士
学位类型：	学术学位
学位年度：	2022
学校：	北京师范大学
校区：	北京校区培养
学院：	人工智能学院
第一导师姓名：	吴昊
第一导师单位：	人工智能学院
提交日期：	2022-06-30
答辩日期：	2022-06-01
外文题名：	Research on the generation of unique captions for news images
中文关键词：	图像字幕 ; 具体化 ; 注意力机制 ; 实例归一化
外文关键词：	Image caption ; Uniqueness ; Attention ; Instance normalization
中文摘要：	︿随着电子产品存储空间的扩大，以及网络传输速度的不断提升，目前在互联网当中传播着海量的图像，分布在社交媒体、新闻网站、购物网站等多个平台，但是很多的图像都缺少相关的描述，如果对于每张图片都由人来进行描述，则会耗费很多的人力，由此，使用某些深度学习模型，来自动完成给图像添加字幕描述的任务应运而生。虽然，图像字幕这一任务已经产生了很多研究成果，但是针对于新闻图像进行字幕描述的模型还存在一些不足。对于新闻图像而言，图像的描述不仅仅只是将图像的内容进行正确描述，而是要包含一些新闻要素，并且满足一定的新闻性，例如新闻事件参与的人物、发生的地点等。这就要求模型生成的结果具有具体化，不单单使用一些通用的单词来描述图像，而在生成的结果中包含具体的人名、地名等专有名词。针对这一任务，本文的核心创新点主要集中在以下几个方面: (1)本文设计了一个基于几何注意力机制的编码器，在使用卷积神经网络提取出的图像特征上，加入几何注意力机制，这有助于模型关注图像当中物体的空间关系，例如其相对的位置与大小。关于物体间的空间位置及大小关系已经被用于目标检测当中，并且取得了较好的效果。这同时是一种基于注意力机制的编码器，能够帮助解码器更好的理解图像内容，从而生成更加具体的字幕结果。 (2)本文设计了一种训练优化的实例归一化方法，使用实例归一化来固定输入分布，更加优化了模型的训练效果。模型的解码部分使用Transformer作为解码器，其中自注意力层受内部协变量偏移的影响，其注意力的权重是将查询向量输入到一个完全连接层，在训练的过程中，由于网络参数的变化导致查询向量的分布发生变化时，后续层必须不断适应新的输入分布。使用实例归一化有助于固定完全连接层的输入分布，达到更好的训练效果。 (3)本文设计了具体化的评价指标,用来评价生成结果的具体化程度。目前的大部分评价指标都主要关注生成字幕结果的正确性，而忽视了针对不同的图像生成千篇一律字幕结果的问题，面对不同的图像，图像字幕模型也倾向于生成表述简单的、通用的词汇，这样这些模型生成的结果在以正确性作为主要奖励的评价指标当中获取较高的分数。但是对于新闻图像而言，需要生成的字幕结果能够描述具体的新闻人物、地点、时间等，这就要求对生成结果的具体化程度进行评价。本文将三种设计相结合，并使用了通过《纽约时报》的公开API获取到的数据集NYTimes800k，通过在此数据集上进行的一系列实验结果的评价与分析对比，可以看出本文所提出的方法能够确实满足具体化的新闻图像字幕生成任务，通过几何注意力机制获取物体间的空间与大小关系，通过实例归一化提高训练效果，通过具体化的评价指标评价是否达到具体化的要求。﹀
外文摘要：	︿ With the expansion of electronic product storage space and the improvement of network transmission speed, the Internet is currently communicating a large number of images, which are distributed on social media, news websites, shopping websites and other platforms. However. many images lack relevant description. If describe each picture manually, it will consume a lot of manpower. Therefore, the task of adding captions to images automatically was generated by using AI techniques. There has been a lot of research on the task of image captioning, but there are still some shortcomings in the model for the field of news image captioning. For news images, the description of the image is not only to correctly describe the content of the image, but also to include some news elements, such as the people involved in the news event and the place where it happened. This requires the words generated by the model to be unique, not only using some general words to describe the image, but also including specific names of people. places and other proper nouns. The core contributions of this paper mainly are: (1) An encoder based on the geometric attention mechanism. The geometric attention mechanism is added to the image features extracted by the convolutional neural network. The geometric attention helps the model to pay attention to the spatial relationship of the objects in the image. For example, its relative position and size. The spatial position and size relationship between objects have already been used in object detection, and have achieved good results. This is also an attention-based encoder, which can help the decoder to better understand the image content and generate more unique captioning results. (2) An instance normalization method for training optimization, which uses instance normalization to fix the input distribution, which further optimizes the training effect of the model. The decoding part of the model uses Transformer as the decoder, in which the self-attention layer is affected by internal covariate shift problem. When the distribution of query vectors changes, subsequent layers must constantly adapt to the new input distribution. Using instance normalization helps to fix the input distribution of the fully connected layer for better training results. (3) A unique evaluation metric to evaluate the specificity of the generated results. Many current evaluation indicators focus on the correctness of the generated subtitle results, but ignore the problem of captions results for different images. Faced with different images, the image caption model also tends to generate simple and general vocabulary. In this way, the results generated by these models obtain high scores in the evaluation indicators with correctness as the main reward. However, for news images, the generated caption results need to be able to describe specific news persons, locations and times. That requires an evaluation of the specificity of the generated results. This paper combines three designs and uses the data set NYTimes800k obtained through the public API of "The New York Times". Through the evaluation and analysis of a series of experimental results on this data set, it can be seen that the method proposed can indeed meet the task of generating captions for news images. ﹀
参考文献总数：	82
馆藏号：	硕081203/22018
开放日期：	2024-03-14

附件下载