- 无标题文档
查看论文信息

中文题名:

 基于图神经网络的人-物交互检测模型研究    

姓名:

 吴通通    

保密级别:

 公开    

论文语种:

 中文    

学科代码:

 081002    

学科专业:

 信号与信息处理    

学生类型:

 硕士    

学位:

 工学硕士    

学位类型:

 学术学位    

学位年度:

 2021    

校区:

 北京校区培养    

学院:

 人工智能学院    

研究方向:

 计算机视觉    

第一导师姓名:

 段福庆    

第一导师单位:

 北京师范大学人工智能学院    

提交日期:

 2021-06-09    

答辩日期:

 2021-06-05    

外文题名:

 HUMAN-OBJECT INTERACTION DETECTION BASED ON GRAPH CONVOLUTION NETWORK    

中文关键词:

 人-物交互 ; 目标检测 ; 交互分类 ; 图神经网络 ; 深度学习    

外文关键词:

 Human-object interaction ; Object detection ; Interaction classification ; Graph convolution network ; Deep learning    

中文摘要:
 

-物交互检测是一项计算机视觉领域的深层语义任务,其任务旨在定位图像中的行人和物体,并对行人和物体发生的交互关系进行判别,即对图像中的行人-谓词-物体三元组进行检测。这项任务研究视觉场景下围绕行人发生的一系列交互活动,通过对行人和物体之间产生的交互关系进行分类,以实现以人为中心的图像深层语义的理解。在智能物流、智能交通等领域具有广泛的应用前景。

现有的人-物交互检测方法大多聚焦于人-物组合本身的视觉信息,通过设计各种复杂的卷积神经网络结构提取细粒度的视觉特征,以提升人-物交互检测的精度。然而,人-物交互检测问题具有本身独特的特性,典型的两点特性为(1)人-物交互类别之间存在强相关性;(2)上下文信息与人-物交互类别具有很强的联系。针对这两点特性,本文以图神经网络为基础,结合注意力机制、聚类等多种方法,设计了两种新颖的人-物交互网络框架。

提出了多分支人-物交互神经网络模型。为了利用好人-物交互类别之间的强相关性,本文基于外部知识,从三方面(物体外观、语义和功能)对物体类别进行了向量嵌入,并利用无监督聚类方法,基于嵌入向量对物体类别的相似性进行衡量。包含相同谓词和相似物体类别的人-物交互组合被视为语义相似。随后,根据聚类结果,对语义相似的人-物交互类别进行了合并,重定义了人-物交互类别。对人-物交互类别进行重定义不仅能够降低模型分类空间,提升每一类类别的训练数据,同时,也以最小代价避免了类内方差的上升。为了利用多层次细粒度的聚类信息,本文基于图神经网络设计了多分支人-物交互神经网络,对人-物交互类别进行了由粗到细的分类。在公开人-物交互检测数据集HICO-DET上进行了标准实验和零样本实验,实验结果显示,使用本文提出的方法能够充分利用好人-物交互类别之间的强相关性,有效提升模型的性能。

提出了基于上下文信息的人-物交互神经网络模型。为了利用好视觉场景中的上下文信息,本文提出使用图模型对上下文信息从视觉和语义两方面进行建模。相应地,建模得到的图分为两种,视觉图和语义图。为了促进图的信息传递,本文提出了一组图更新模块,包含图内部更新模块和图交叉更新模块,以促进视觉图和语义图的消息传递。在图内部更新模块中,消息在视觉图或语义图的内部结点之间进行传递,而在图交叉更新模块中,消息在视觉图和语义图彼此之间进行传递。最后,本文从更新过的视觉图和语义图中提取上下文特征,与人-物组合的视觉特征进行融合,以推理人-物之间的交互关系。在公开人-物交互检测数据集HICO-DETV-COCO上的标准实验结果显示,使用本文提出的方法对视觉场景中的上下文信息进行建模,能够提取有效的上下文特征,充分提升模型的性能。

为了全面分析本文提出的模型结构,本文在开源数据集HICO-DET上对人-物交互检测实验进行了详细的定量分析和对比探讨,揭示了模型组件的亮点。同时,也对本文模型思路的局限性和不足之处做了定性讨论,为未来的工作提供了可借鉴的思路。

外文摘要:
 

Human-object interaction (HOI) detection is a human-centric, deep semantic task in computer vision field. Its purpose is to locate human and objects in images and reason the interaction relationship between human and objects, that is, to detect the “human-predicate-object” triples in images. The task studies a series of interaction surrounding human in visual scenes. By classifying the interaction between human and objects, the deep understanding of human-centered semantic can be realized. Thus, HOI is widely used in smart logistics, smart transportation and other smart scenarios.

Existing human-object interaction detection algorithms mostly focus on the visual information of human-object pairs. These algorithms extract fine-grained visual features by designing various complex neural network structures. However, HOI detection task has its own unique characteristics. Two typical characteristics are strong correlation between different HOI categories and strong relationship between context information and HOI categories. To utilize the two characteristics, we design two novel HOI network structures based on graph neural network with lots of methods, e.g., attention mechanism and clustering.

We propose a multi-branch network for learning human object interaction. To make good use of the strong correlation between different HOI categories, we utilize external knowledge to embed object categories into vectors from three aspects, visual appearance, semantic and function. After that, we use unsupervised clustering method to measure the similarity of object categories based on object embeddings. HOI categories with same predicates and similar objects are regarded as semantic-related. All semantic-related HOI categories are merged into a super HOI category, thus all HOI categories are redefined. Redefining HOI categories not only reduces the classification space and increases the average training data for each category, but also avoids excessive rise in intra-class variance. To make good use of multi-level fine-grained clustering information, we design a multi-branch network based on graph convolution network, which classifies the HOI categories from coarse to fine. We conduct standard experiments and zero-shot experiments on the public HOI dataset HICO-DET. Experimental results show that the proposed method can make full use of correlation between different HOI categories and improve model performance.

We propose a human-object interaction network based on context information. To make good use of the strong relationship between context information and HOI categories, we use graph neural network to model context visually and semantically by combining a Visual Graph and a Semantic Graph. To promote message passing between graphs, we propose a group of graph update modules, including a Graph Inner Update module and a Graph Cross Update module. In the Graph Inner Update module, messages are passed within each graph, while in the Graph Cross Update module, messages are passed between the Visual Graph and the Semantic Graph. Finally, we fuse the contextual features from the Visual Graph and the Semantic Graph with the visual features of the human-object pairs to detect HOI categories. We evaluate the proposed model on two challenging datasets, HICO-DET and V-COCO. Experimental results show that our model is useful in modeling context and improving model performance.

To make fully analysis of the proposed models, we do quantitative analysis and comparative experiments in public HICO-DET dataset, which reveals the effectiveness of model components. Furthermore, we qualitatively discuss the limitations and shortcomings of the paper, which provides a reference for future work.

参考文献总数:

 56    

作者简介:

 2018年获北京科技大学智能科学与技术专业工学学士学位。研究生期间主攻计算机视觉,研究人-物交互检测问题。    

馆藏号:

 硕081002/21006    

开放日期:

 2022-06-09    

无标题文档

   建议浏览器: 谷歌 360请用极速模式,双核浏览器请用极速模式