查看论文信息

查看全文

查看论文信息

中文题名：	随机森林阈值插补方法对认知诊断模型中缺失数据的处理及应用
姓名：	游晓锋
保密级别：	公开
论文语种：	中文
学科代码：	04020005
学科专业：	05心理测量学（040200）
学生类型：	博士
学位：	教育学博士
学位类型：	学术学位
学位年度：	2020
校区：	北京校区培养
学院：	心理学部
研究方向：	教育测量
第一导师姓名：	刘红云
第一导师单位：	北京师范大学心理学部
提交日期：	2020-06-19
答辩日期：	2020-06-19
外文题名：	Random Forest Threshold Imputation Method for Handling Missing Data in Cognitive Diagnosis Model and Its Application
中文关键词：	缺失数据处理 ; 认知诊断评价 ; 机器学习 ; 随机森林阈值插补
外文关键词：	Missing Data Imputation ; Cognitive Diagnostic Assessment ; Machine Learning ; Random Forest Threshold Imputation
中文摘要：	︿认知诊断评价作为一种新兴的测验形式，越来越受到国内外研究者的关注。同时，由测验设计带来的学生在部分题目上作答数据的缺失也是认知诊断评价面临的实际问题。因此，如何更好地处理认知诊断测验中的缺失数据，为学生和教师提供更准确的诊断反馈显得十分重要。机器学习近年来被研究者应用于缺失数据的插补，其中随机森林算法不仅具有很高的分类回归性能，而且运算快速高效，还能有效处理多分类问题，在应对噪声干扰上也具有明显优势，已被证实是一种先进的学习器。在其基础上发展并应用于缺失数据处理的随机森林插补法不需要事先假设数据的缺失机制，而是基于数据本身的特点，充分利用被试作答信息及反应模式特征对缺失数据进行插补。本文借鉴随机森林算法在分类、预测方面的优势，以及随机森林插补法不依赖于缺失机制假设的特点，对已有的随机森林插补法进行改进，使其适用于认知诊断模型中缺失数据的处理。基于认知诊断模型中常用的DINA模型（Deterministic Inputs, Noisy “And” Gate Model, DINA），本文提出将个人拟合指标（Response Conformity Index, RCI）引入缺失数据插补过程，以确定插补阈值，发展出了一种可用于DINA模型缺失数据处理的非完全插补新方法：随机森林阈值插补方法。为验证提出的随机森林阈值插补方法的有效性、影响因素和实际应用，及其与传统缺失数据处理方法相比的优势，本文开展了四个研究，包括三个模拟研究和一个应用研究。研究一：首先详细介绍了随机森林阈值插补（RFT）方法提出的理论基础及其算法实现。之后，通过两个Monte Carlo模拟从缺失数据的插补率和正确率，以及DINA模型参数估计精度两个角度验证了新方法的有效性。同时考虑了不同缺失机制（MIXED、MNAR、MAR、MCAR）和缺失比例（10%、20%、30%、40%、50%）下，新方法的适用性。研究结果表明：（1）RFT方法在正确率上明显高于传统的随机森林插补法，并且在所有条件下随机森林阈值插补方法处理后的数据缺失比例都在10%左右。（2）RFT方法与个体均值插补法、两步插补法、EM算法、多重插补法、随机森林插补法相比，被试属性模式判准率和属性边际判准率在所有条件下都最高，但这一优势受缺失比例和机制影响，MIXED和MNAR缺失机制以及缺失比例较高（大于30%）时，其优势更加明显。然而，RFT方法对DINA模型中项目参数的估计结果并没有表现出优势。研究二：采用Monte Carlo模拟方法研究了样本量、测验长度、属性个数及属性层级关系等因素对RFT方法结果的影响，且与已有的六种处理缺失数据的方法进行了比较，进一步探讨了新方法的适用条件。研究结果表明：（1）对被试属性模式判准率和属性边际判准率而言，随测验长度增加判准率明显提高；随样本量增加判准率虽也有提高的趋势，但相对较小；属性个数增加判准率下降；在属性层级结构为线型时判准率最高，无结构型时判准率最低；（2）对项目参数估计精度而言，RFT方法的估计结果随样本量增大精度变高，但受测验长度影响较小；属性个数和属性层级关系的影响没有一致的趋势。（3）与研究一的结果一致，RFT方法在各种条件下对被试属性掌握模式估计结果均优于其他缺失数据处理方法。研究三：基于Tatsuoka (2002)“经典的带分数减法测验”的完整实测数据，采用Bootstrap模拟方法，通过构造不同缺失机制（MIXED、MNAR、MAR、MCAR）和不同缺失比例（10%、20%、30%、40%、50%）条件下，进一步验证RFT方法的有效性。研究结果发现：（1）与研究一的结果类似，基于实测数据RFT方法在缺失数据插补率和正确率方面的表现较好；（2）采用RFT方法的属性模式判准率的结果在各实现条件下均优于其他缺失数据处理方法，尤其是在缺失机制为MIXED和MNAR时，结果更为明显；然而，采用MI方法得到的项目参数估计结果精度要优于其他方法。研究四：结合教育测评中的两种应用情景，探讨了RFT方法在实际应用中的表现，验证了其实践的可行性和有效性。基于初一年级学生的认知能力测验，分别设置了随机缺失模式和非随机缺失模式两种测验设计和测试情景，获得实际测试中不同特点的缺失数据集，并且通过外部效标来比较不同缺失值处理方法在属性模式判准率上的表现。研究结果表明两种缺失模式下RFT方法的表现都要优于传统的缺失数据处理方法，与模拟研究结果一致，非随机缺失模式下的优势更加明显。同时，RFT方法得到的被试属性掌握个数和学科成绩的相关，与基于基准测验（无缺失数据）和学科成绩的相关更加一致。综上，本研究将机器学习方法与测量模型结合，提出了一种DINA模型下处理缺失数据的新方法，并从理论和实践两个角度验证了该方法在对学生属性掌握模式估计方面的优势。从理论层面，提供了一种弱假设条件下的缺失数据处理方法，丰富了基于认知诊断模型缺失数据处理领域的方法研究；从应用层面，探索了一种如何基于学生在较少测验题目上的作答，相对较准确地分析学生知识状态的方法，也为测验设计或教学实践需要而导致的缺失数据的处理提供了新思路。﹀
外文摘要：	︿ As a new form of test, cognitive diagnostic assessment has attracted wide attention from researchers at home and abroad. At the same time, missing data caused by characteristics of the test design is a rather common issue encountered in cognitive diagnostic tests. It is therefore very important to develop an effective solution for dealing with missing data in cognitive diagnostic assessment ensuring that diagnosis feedback provided to both students and teachers is more accurate and reliable. As a matter of fact, machine learning has been applied to impute missing data in recent years. As one of the machine learning algorithms, the random forest has been proved to be a state-of-the-art learner because it exhibits good performance when handling classification and regression tasks with effectiveness and efficiency, and is capable of solving multi-class classification problems in an efficient manner. Interestingly, this algorithm has a distinct advantage in terms of coping with noise interference. Furthermore, the random forest imputation method, an improved algorithm for dealing with missing data based on the random forest algorithm, makes full use of the available response information and characteristics of response patterns of participants to impute missing data instead of assuming the mechanism of missingness in advance. By combining the advantages of the random forest method in classification and prediction and the assumption-free feature of the random forest imputation method, this paper attempts to improve the existing random forest imputation algorithm so that the method can be properly applied to handle missing data in cognitive diagnostic assessment. On the basis of the DINA (Deterministic Inputs, Noise "And" Gate) model, widely used in cognitive diagnostic assessment, the paper introduces the RCI (Response Conformity Index) into missing data imputation to identify threshold of imputation type and hence proposes a new method for handling missing responses in the DINA model: random forest threshold imputation (RFT) approach. Three simulation studies and one practical study have been conducted in order to validate the effectiveness of RFT, and identify its practical applications and influencing factors during implementation. In addition, the advantages of the new method have been explored by comparing it with traditional techniques for handling missing data. In Study 1, the theoretical basis and algorithm implementation of RFT were described in detail. Then, two Monte Carlo simulations were employed to validate the effectiveness of RFT in terms of imputation rate and accuracy as well as the accuracy in DINA model parameter estimation. Moreover, the applicability of RFT was investigated by considering different mechanisms for missingness (MNAR, MIXED, MAR and MCAR) and different proportions of missing values (10%, 20% 30%, 40% and 50%). According to the results, (1) imputation accuracy of RFT was significantly higher than that of traditional random forest imputation methods, and the data missingness rate treated by RFT was below 10% under all conditions; (2) the highest attribute pattern match ratio and attribute marginal match ratio of participants were observed using RFT under all conditions as compared to that of person mean imputation, Two-way imputation, EM algorithm, multiple imputation and random forest imputation. However, this behavior depended on the proportion and mechanisms of missing data. Results indicated that this phenomenon became more obvious when the missingness mechanism was MNAR and MIXED and the proportion of missing responses were more than 30%. However, the new algorithm failed to show superiority in estimating DINA model parameter. Study 2 examined the influence of sample size, test length, the number of attributes and attribute hierarchy on RFT using Monte Carlo stimulation. Furthermore, applicable conditions of the new method were investigated by comparing it with other six approaches for handling missing data, leading to the following conclusions: (1) in terms of attribute mastery patterns match ratio and attribute marginal match ratio of participants, the match ratio increased greatly with increasing test length; relatively slight increase was found in the match ratio as the number of sample size increased; the match ratio reduced if the number of attributes decreased; the highest match ratio was obtained when attribute hierarchy was linear, and the lowest was produced if presented unstructure; (2) when it came to the accuracy of item parameters estimation, the accuracy of estimation became higher with increasing sample size while it depended less on test length; no definite association was observed between the estimation accuracy of item parameters and the number of attributes and attribute hierarchy; (3) the estimation effectiveness of attribute mastery model of participants using RFT outperformed that produced by other approaches for handling missing data under all conditions, which is consistent with the result of Study 1. In Study 3, the effectiveness of RFT was further studied by constructing different missingness mechanisms including MNAR, MIXED, MAR and MCAR and different proportions of missing data of 10%, 20% 30%, 40% and 50% by means of Bootstrap simulation method based on the complete empirical data obtained from Tatsuoka’ s classical “fraction subtraction test” (2002). As a result, (1) RFT based on empirical data exhibited good performance in imputation rate and accuracy when dealing with incomplete data, as shown in Study 1; (2) pattern match ratio using RFT outperformed other approaches when handling missing responses under all conditions. Especially, RFT showed comparative advantage in pattern match ratio under the mechanism of MIXED; however, MI method achieved the best performance in accurate estimation of item parameters over other missing values handling methods. Study 4 explored the feasibility and effectiveness of implementation of RFT in two scenarios occurred during educational assessment. Based on the self-designed cognitive ability test for junior high school students in grade one, Missing at Random (MAR) and Missing Not at Random (MNAR) modes were designed to generate missing data sets with different features. On the other hand, the influence of different missing data handling methods on the attribute mastery pattern match ratio of participants was examined using external criterion. The results showed that RFT outperformed other missing data imputation methods under both modes, in line with results of simulation studies, and noticeably, RFT exhibited better performance under MNAR. Meanwhile, the relation between the number of mastery attributes and student's achievement in a subject area obtained by RFT showed higher consistency than that between benchmark test (no missing data) and student's achievement In conclusion, this study introduces a new method RFT to handle incomplete data in DINA model by combining machine learning algorithms and measurement models and explores the advantages of the new method in the estimation of attribute mastery pattern of students in both practical and theoretical aspects. Theoretically, the new method provides a less assumptive approach to deal with missing data, enriching the solutions to handle incomplete responses in cognitive diagnosis assessment. Practically, RFT not only enables assessors to understand students’ performance more accurately if incomplete responses are given by students, but makes it possible to offer new thinking to deal with missing data resulted from test designs and needs of teaching activities ﹀
参考文献总数：	130
馆藏号：	博040200-05/20002
开放日期：	2021-06-19

附件下载