查看论文信息

查看全文

查看论文信息

中文题名：	基于《统计学导论》知识图谱与问答系统创建
姓名：	张议文
保密级别：	公开
论文语种：	chi
学科代码：	025200
学科专业：	应用统计
学生类型：	硕士
学位：	应用统计硕士
学位类型：	专业学位
学位年度：	2024
校区：	北京校区培养
学院：	统计学院
研究方向：	经济与金融统计
第一导师姓名：	金蛟
第一导师单位：	统计学院
提交日期：	2024-06-13
答辩日期：	2024-05-14
外文题名：	KNOWLEDGE GRAPH AND QUESTION ANSWERING SYSTEM BUILDING BASED ON INTRODUCTION TO STATISTICS
中文关键词：	知识图谱 ; 问答系统 ; 统计学导论 ; 机器学习
外文关键词：	Knowledge graph ; Question answering system ; Introduction to statistics ; Machine learning
中文摘要：	︿本文对统计学本科教材《统计学导论》文本进行了知识图谱与问答系统构建。基于自顶向下的方法，首先建立了3个本体：重要概念与附属概念及其属性、重要概念与对应附属概念的附属关系、重要概念理论中的上下位关系。依据本体构建数据层，通过模式匹配与人工审核的方法识别出93个主节点与42个副节点与主副节点之间的5种关系：前置解释、延伸学习、涉及性质、涉及方法、实例学习，形成<节点，属性>节点信息数据库和<主节点，附属关系，副节点>附属关系数据库。通过机器学习有监督方法进行主节点之间的上下位关系识别，特征选取为上位概念名称在下位概念含义中出现的次数以及是否为最先出现，解决多分类问题，使用逻辑回归、支持向量机、决策树、随机森林四种方法进行实验，经过参数调优，实验结果显示随机森林方法的平均准确率最高为0.891，线性支持向量机方法次之，平均准确率为0.887但对训练集与测试集的分配有更弱的敏感性，将基于机器学习与人工审核的结果比较整合后形成<主节点A，上下位关系，主节点B>上下位关系数据库，将全部节点属性、关系数据导入到neo4j图数据库中完成知识图谱构建，并基于节点、关系数据和模式匹配的方法创建问答系统并部署在本地，回答概念是什么、学习概念要涉及的相关知识的问题。﹀
外文摘要：	︿ The article creates a knowledge graph and question answering system based on text in an undergraduate textbook for students majoring in statistics—Introduction to statistics. Using top-to-down method, it builds three bodies: important and subsidiary concepts with their properties; affiliation between the two concepts; hyponymy between all important concepts. According to those bodies it creates data layer and by methods of pattern matching and manual inspection it identifies 93 master nodes and 42 secondary nodes and 5 types of relationship between them: pre-explanation, extended learning, characters involved, methods involved, examples, based on which the two forms of databases emerge— and . By supervised approaches to traditional machine learning methods, it identifies hyponymy between master nodes which is a multi-class question, selected features are number of superior concepts’ names occur in subordinate concepts and whether it is the first concept. Using methods including logistic regression, svm, decision tree and random forest to experiment on the text, the outcome shows that random forest has the highest average accuracy 0.891 and linear svm has the second highest average accuracy 0.887 with less sensitivity to partition of train sets and test sets. After comparing and integrating outcomes from manual examination and machine learning, it achieves hyponymy database with form . It imports all database into neo4j which is a graph database and creates a knowledge graph and question answering system, the system deployed on-premises could answer what the notion is and gives dependent knowledge in learning the notion. ﹀
参考文献总数：	47
馆藏号：	硕025200/24025
开放日期：	2025-06-13

附件下载