- 无标题文档
查看论文信息

中文题名:

 因果效应半参估计与子群识别    

姓名:

 吴鹏    

保密级别:

 公开    

论文语种:

 中文    

学科代码:

 0714Z2    

学科专业:

 应用统计    

学生类型:

 博士    

学位:

 理学博士    

学位类型:

 学术学位    

学位年度:

 2020    

校区:

 北京校区培养    

学院:

 统计学院/国民核算研究院    

研究方向:

 因果推断    

第一导师姓名:

 童行伟    

第一导师单位:

 北京师范大学统计学院    

提交日期:

 2020-06-21    

答辩日期:

 2020-06-21    

外文题名:

 Semiparametric Estimation of Causal Effect and Subgroup Detection    

中文关键词:

 倾向性得分投影 ; 平均因果效应 ; 异质性因果效应 ; 机器学习方法 ; 子群识别    

外文关键词:

 Projection based on propensity score ; Average causal effect ; Heterogeneous causal effect ; Machine learning methods ; Subgroup detection    

中文摘要:

探求事物之间的因果关系是哲学、自然科学和社会科学等众多学科所追求的终极目标之一。近年来,随着机器学习和人工智能技术的发展,越来越多的研究者已经不满足于发现事物之间的相关性,转而进一步探寻蕴藏在大数据中的因果关系。目前,因果推断已经广泛应用于人工智能、计量经济、哲学、认知科学、流行病学、医疗卫生和政策评估等领域。本文研究多维协变量情形下的平均因果效应与异质性(或条件)因果效应的估计和子群识别问题。

首先,基于投影的思想,第2章结合倾向性得分和半参数回归模型,提出了一个估计平均因果效应的有效方法。该方法不使用倾向性得分作为权重来剔除混杂偏差,而是将倾向性得分作为协变量加入到回归模型中。该方法不依赖于回归模型指定,能避免维数灾难,并能有效克服极端权重(倾向性得分趋于01)的问题。此外,通过考虑误差项的异性, 我们进一步改善了该估计。理论方面,基于B样条的性质,本文证明了所提估计方法的相合性与渐近正态性,并给出了渐近方差的估计公式。数值模拟和实证结果表明,在有限样本下,该方法较同类方法表现更好。

然后,当个体因果效应具有异质性时,研究者通常对某些子总体的平均因果效应更加感兴趣。基于倾向性得分和投影理论,第3章提出了一个新的非参数异质性因果效应估计。该估计具有上一章方法的所有优点。理论结果表明,该估计具有相合性与渐近正态性,且渐近方差可以被估计。数值模拟表显示,在有限样本下,该方法较同类方法表现更好。 在实例分析部分,本文使用宾夕法尼亚州白人母亲的观测数据,探讨在不同生育年龄下,孕妇吸烟对婴儿出生体重影响的异质性因果效应。

接下来,第4章通过广泛的数值模拟讨论了使用不同的机器学习方法估计倾向性得分对平均因果效应匹配估计的影响。我们发现集成学习方法,尤其是因果随机森林,表现良好。在实例分析部分,我们在1949年以来发生在中国内陆的台风数据中应用上述方法。

最后,在精准医疗中,对异质性样本进行子群识别是一个关键性问题.临床研究表明,对具有不同基因型和表型的病人给予相同的处理,他们的响应会有较大差异。在实际应用中,研究者希望根据因果效应进行分类,以便于做决策。比如:医生可以根据该分类选择合适的治疗方案。通过结合K-均值方法和凹成对融合惩罚方法,第5章提出了一个在回归分析中自动进行子群识别并估计因果效应的有效算法。该算法可以保证每个子群中的个体,不仅协变量相似,因果效应也相似。在实例分析部分,本文将该算法应用于美国医疗补助评估调查数据,研究购买健康保险对去急诊室就诊次数的异质性因果效应。

外文摘要:

Exploring causality is one of the ultimate goals pursued by many disciplines such as philosophy, natural sciences, and social sciences. In recent years, with the development of machine learning and artificial intelligence technologies, more and more researchers are not satisfied with discovering the correlation between things, and then further explore the causal relationship hidden in big data. At present, causal inference is widely used in the fields of artificial intelligence, econometrics, philosophy, cognitive science, epidemiology, public health, and policy evaluation. In this thesis, we study the estimation of average causal effect and heterogeneous (or conditional) causal effect.

 First of all, based on the idea of projection, in chapter 2, we propose to combine the propensity score with semi-parametric regression model to obtain the average causal effect estimator. Instead of using the propensity score as weight to eliminate confounding bias, we treat it as a covariate and include it directly in the regression model. The proposed method doesn’t rely on outcome model specification and can avoid the “cure of dimensionality”. It also effectively overcomes the hazardous impact due to extreme weights. Furthermore, we modify the estimator by incorporating adjustment for error heteroscedasticity to improve the efficiency. Theoretically, based on the properties of B-spline, the proposed estimators are shown to be consistent and asymptotically normal, and the asymptotic variance can also be obtained. Simulation studies and empirical analysis results show that, compared with the competing methods, the proposed estimators perform better in a finite sample.

Then, given the heterogeneity of individual effects, researchers are often more interested in the average causal effect in various subpopulation. In chapter 3, we propose a new nonparametric estimate strategy for heterogeneous causal effect based on the propensity score and projection theory. The proposed method maintains the advantages mentioned above, and the proposed estimator is shown to be consistent and follows an asymptotical normal distribution with the variance that can be estimated. The simulation studies indicate that the proposed estimator compares favorably with competing ones. In the application, we use the dataset consisting of observations from white mothers in Pennsylvania in the USA to explore the heterogeneity of the effect of maternal smoking on birth weight across ages.

Next, in chapter 4, we employ machine learning and matching techniques to learn the average causal effect. By comparing a variety of machine learning methods in terms of propensity score under extensive scenarios, we find that the ensemble methods, especially causal random forests, perform favorably with others. We apply all the methods to the data of tropical storms that occurred on the mainland of China since 1949.

Finally, one of the most important issues in personalized medicine is the regression analysis of heterogeneous samples with subgroup structures. Clinically, patients with different characteristics in genotype and phenotypes often show heterogeneous responses to a same treatment. In practice, researchers hope to identify the subgroup structures ensuring similar causal effect in each subgroup so that doctors can choose appropriate treatment according to the classification. By combining the K-means and concave pairwise fusion penalty method, in chapter 5, we propose an effective algorithm to automatically conduct subgroup detection and estimate causal effects in regression analysis, and then we apply the algorithm to study the heterogeneous causal effects of having health insurance on the number of emergency room visit by using a dataset from Ohio Medicaid Assessment Survey (2012).

参考文献总数:

 152    

馆藏地:

 图书馆学位论文阅览区(主馆南区三层BC区)    

馆藏号:

 博0714Z2/20001    

开放日期:

 2021-06-21    

无标题文档

   建议浏览器: 谷歌 360请用极速模式,双核浏览器请用极速模式