- 无标题文档
查看论文信息

中文题名:

 基于半监督的声音事件识别方法研究及其在课堂中的应用    

姓名:

 饶鹏昊    

保密级别:

 公开    

论文语种:

 中文    

学科代码:

 081002    

学科专业:

 信号与信息处理    

学生类型:

 硕士    

学位:

 工学硕士    

学位类型:

 学术学位    

学位年度:

 2021    

校区:

 北京校区培养    

学院:

 人工智能学院    

研究方向:

 声音事件识别    

第一导师姓名:

 何珺    

第一导师单位:

 北京师范大学人工智能学院    

提交日期:

 2020-12-31    

答辩日期:

 2020-12-19    

外文题名:

 STUDY ON SOUND EVENT DETECTION METHOD WITH SEMI-SUPERVISED LEARNING AND ITS APPLICATION IN CLASSROOM    

中文关键词:

 深度学习 ; 半监督学习 ; 声音事件识别 ; 卷积神经网络 ; 递归神经网络    

外文关键词:

 Deep Learning ; Semi-supervised learning ; Sound Event Detection ; Convolutional Neural Network ; Recurrent Neural Network    

中文摘要:
 

声音是人类交流、感知环境的重要媒介,声音事件指的是声音中包含的基本内容,如说话声、脚步声等。和语音识别、音乐检索等成熟的研究领域相比,声音事件识别研究起步较晚。早期的声音事件识别研究主要集中在样本数、类别数较少的异常声音事件识别上,自深度学习兴起以来,人们对声音事件识别的要求也逐渐提高:样本数、类别数显著增多;仅标注音频类别而不标注声音事件的发生位置;在部分标注不准确的情况下提高准确率等。声音事件识别走向实际应用还有很长一段路需要探究。

随着时代的发展,教学的形式在不断改变升级。从传统的课堂教学模式,到互联网时代翻转课堂、STEAM(Science, Technology, Engineering, Arts, Mathematics)、慕课等新型教学形式,再到2017年国务院颁发《新一代人工智能发展规划》,众多学者在研究探讨人工智能将对教育注入的新鲜血液。为了方便校园管理,及时反馈教学效果,教室内通常安装有音视频监控系统,每天都会产生大量的音视频监控数据。然而,人工整理这些音视频数据费时费力,标注的工作量与难度都较大,有标注的样本只占数据集的一小部分。本文提出一种基于半监督的课堂声音事件识别方法,可以综合利用少量的精标注数据和大量的无标注数据,自动识别关键课堂声音事件。本文的具体研究内容如下所示。

首先,将多视野、残差连接、注意力机制等创新性地添加到卷积递归神经网络(Convolutional Recurrent Neural NetworkCRNN)中,在Chinese Natural Emotional Audio-Visual Database 2.0 CHEAVD 2.0)数据库上分别对比不同模型和特征的效果。接着,提出一种基于CRNN的协同半监督训练算法,从一维时域信号和二维频谱图像分析无标注音频信号,并多次微扰输入特征,输入分类模型,综合分析多次预测结果得到可靠的伪标签指导模型学习。Detection and Classification of Acoustic Scenes and Events Task2 DCASE2019-Task2)任务的半监督与有监督实验效果对比说明了半监督学习的必要性。然后,收集大量真实环境下的课堂声音数据,标注整理得到完整的数据库。将本文提出的半监督学习方法和直推式支持向量机(Transductive Support Vector MachineTSVM)半监督模型、Mean Teacher等方法进行对比,说明本文提出的半监督学习方法的有效性。最后进行消融实验,验证半监督算法中不同正则化的作用。

 

外文摘要:
 

Sound is an important medium for human to communicate and perceive environment. Sound events refer to the basic content contained in sound, such as speech and footsteps. Compared with mature research fields such as speech recognition and music retrieval, the research of sound event recognition started late. Early sound event recognition research mainly focused on the detection of abnormal sound events with a small number of samples and categories. Since the rise of deep learning, the requirements for sound event recognition have gradually increased that the number of samples and categories increased significantly, only labeling category but not marking the location of the sound event, improving the prediction accuracy in the case of wrong labels. Sound event recognition still has a long way to go towards practical applications.

With the development of the times, the form of teaching is constantly changing and upgrading. From the traditional classroom teaching to flipped classrooms, STEAM (Science, Technology, Engineering, Arts, Mathematics) and MOOC (massive open online course) in the Internet age, than to the " Development Plan of New Generation Artificial Intelligence " issued by the State Council in 2017, researchers are exploring the fresh blood that artificial intelligence will inject into education. In order to facilitate campus management and provide feedback on teaching, audio and video surveillance systems are usually applied in classrooms, which generate a large amount of audio and video surveillance data every day. However, manually sorting these audio and video data is time consuming, laborious and difficult, result in lack of labeled samples. The paper proposes a semi-supervised classroom sound event recognition method, which can make full use of small amount of finely labeled data and large amount of unlabeled data. The specific content of this article is as follows.

Firstly, we designed a suitable Convolutional Recurrent Neural Network (CRNN) with multi-receptive field, residual connection and attention mechanism. The effects of different models and features were compared on the Chinese Natural Emotional Audio-Visual Database 2.0 (CHEAVD2.0) database. Next a co-training semi-supervised algorithm based on CRNN is proposed. The input unlabeled audio signals were analyzed from one-dimensional time-domain signals, two-dimensional spectrum and disturbed multiples. Comprehensively analyzing the predictions of network, “reliable” pseudo labels were obtained to guide model learning. The experiments on the Detection and Classification of Acoustic Scenes and Events Task2 (DCASE2019-Task2) proved the effective of the semi-supervised algorithm. Then a large amount of classroom sound data was collected in real. The semi-supervised learning method proposed in this paper was compared with the Transductive Support Vector Machine (TSVM), Mean Teacher and other methods on the classroom dataset to illustrate the effectiveness of our algorithm. Finally, ablation experiments were performed to verify the role of different regularization in the proposed semi-supervised algorithm.

参考文献总数:

 50    

作者简介:

 北京师范大学人工智能学院信号与信息处理工学硕士    

馆藏号:

 硕081002/21015    

开放日期:

 2021-12-31    

无标题文档

   建议浏览器: 谷歌 360请用极速模式,双核浏览器请用极速模式