查看论文信息

查看全文

查看论文信息

中文题名：	基于嵌入空间的零样本视频行为识别方法研究
姓名：	赵晓冲
保密级别：	公开
论文语种：	chi
学科代码：	081002
学科专业：	信号与信息处理
学生类型：	硕士
学位：	工学硕士
学位类型：	学术学位
学位年度：	2024
校区：	北京校区培养
学院：	人工智能学院
研究方向：	计算机视觉
第一导师姓名：	何珺
第一导师单位：	人工智能学院
提交日期：	2024-06-16
答辩日期：	2024-05-26
外文题名：	RESEARCH ON ZERO-SHOT VIDEO ACTION RECOGNITION METHOD BASED ON EMBEDDING SPACE
中文关键词：	零样本行为识别 ; 语义拓展 ; 运动信息 ; 多层视频引导提示
外文关键词：	Zero-Shot Action Recognition ; Semantic Expansion ; Motion Information ; Multi-Layer Video Guidance Prompt
中文摘要：	︿作为视频理解的热点研究方向，行为识别是自动识别视频序列中的人类动作和活动。在智能监控、交互式媒体和机器人技术等领域，行为识别具有广泛的应用潜力，但传统行为识别方法依然面临诸多挑战，如：海量数据持续精标注的负担、模型在开放环境下的泛化有限性等。实际应用中，由于新的行为类型持续出现，而传统方法又是在闭集环境下训练、测试，难以泛化识别未知行为类别。零样本行为识别方法旨在让模型学习如何识别在训练阶段未曾见过的行为，为解决数据持续精标注、提升模型开放环境下的泛化性能提供了很好的解决办法。然而，该领域存在视觉特征与语义标签间的对齐困难、视觉相似但语义不同的行为类别难以区分的困境。针对现有模型在语义表达和视觉特征捕捉方面的不足，本研究提出了三种创新方法，三个方法相互联系并共同作用于增强模型的整体性能：语义拓展策略：这一策略应用于模型的文本分支，通过自动化工具从互联网资源中抓取与特定动作相关的描述性文本，并进行审核和优化，以确保信息的准确性和相关性。经过深入的文本分析，这些动作描述不仅包括基本信息，还融入了丰富的上下文和变体信息，有效地增强了模型对动作的深度理解及其语义区分能力。运动信息模块：在视觉分支中，此模块通过局部运动信息分析来增强模型对视频帧间细微变化的捕捉能力。利用卷积网络系列与局部相关性计算，该模块能精确捕捉并分析相邻帧之间的微小动作，从而有效提升模型对复杂或相似动作的识别能力。多层视频引导提示方法：这一方法通过多层次特征融合策略，解决视觉特征与文本特征对齐的挑战，进而提升模型的语义对齐精确度和动作分类的准确性。该策略不仅强化了模型在全局和局部特征上的内容理解，而且有效地整合视觉与语言特征，提供了强大的上下文理解能力，从而更好地识别未知行为类别。本研究在UCF101和HMDB51数据集上开展了系列实验。实验结果显示，所提出的方法在UCF101数据集上的最高准确率提升到68.1%，在HMDB51数据集上提升到47.1%，证明了方法的有效性和泛化性。﹀
外文摘要：	︿ As a hot research direction in video understanding, action recognition aims to automatically identify human actions and activities in video sequences. It has extensive application potential in fields such as intelligent surveillance, interactive media, and robotics. However, traditional action recognition methods still face numerous challenges, such as the burden of continuous precise annotation of massive data and limited generalization in open environments. In practical applications, the continuous emergence of new types of actions and the training and testing of traditional methods in closed-set environments make it difficult to generalize and recognize unseen action categories. Zero-shot action recognition methods are designed to enable models to recognize actions not seen during the training phase, providing a good solution for continuous data annotation and improving generalization performance in open environments. However, this field faces difficulties in aligning visual features with semantic labels and distinguishing action categories that are visually similar but semantically different. Addressing the deficiencies in semantic expression and visual feature capture of existing models, this study proposes three innovative methods that are interconnected and collectively enhance the model's overall performance: Semantic Expansion Strategy: Applied to the model's textual branch, this strategy utilizes automated tools to extract descriptive texts related to specific actions from internet resources, which are then audited and optimized to ensure accuracy and relevance. Through deep textual analysis, these action descriptions include not only basic information but also rich contextual and variant information, effectively enhancing the model's deep understanding of actions and its ability to distinguish semantically different actions. Motion Information Module: In the visual branch, this module enhances the model's ability to capture fine changes between video frames through local motion information analysis. Utilizing a series of convolutional networks combined with local correlation calculations, this module can precisely capture and analyze subtle movements between adjacent frames, thereby significantly improving the model's ability to recognize complex or similar actions. Multi-layer Video Guidance Prompt Method: This method addresses the challenges of aligning visual and textual features through a multi-level feature fusion strategy, thereby enhancing the model's semantic alignment accuracy and action classification precision. This strategy not only strengthens the model's understanding of content at both global and local levels but also effectively integrates visual and textual features, providing powerful contextual understanding capabilities that better recognize unseen action categories. This research conducted a series of experiments on the UCF101 and HMDB51 datasets. The results show that the proposed methods achieved a maximum accuracy of 68.1% on the UCF101 dataset and 47.1% on the HMDB51 dataset, demonstrating the effectiveness and generalizability of these methods. ﹀
参考文献总数：	78
作者简介：	赵晓冲，北京师范大学人工智能学院信号与信息处理专业学生，研究方向是计算机视觉，主要研究工作工作是图像识别和行为识别。
馆藏号：	硕081002/24006
开放日期：	2025-06-16

附件下载