中文题名: | 融合MD&A文本的上市公司财务舞弊预测研究 |
姓名: | |
保密级别: | 公开 |
论文语种: | chi |
学科代码: | 025200 |
学科专业: | |
学生类型: | 硕士 |
学位: | 应用统计硕士 |
学位类型: | |
学位年度: | 2024 |
校区: | |
学院: | |
研究方向: | 财务舞弊识别;文本分析 |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2024-06-18 |
答辩日期: | 2024-05-18 |
外文题名: | RESEARCH ON FINANCIAL FRAUD PREDICTION OF LISTED COMPANIES INCORPORATING MD&A TEXTS |
中文关键词: | |
外文关键词: | Financial Fraud Identification ; Text Analysis ; SHAP ; XGBoost |
中文摘要: |
在当前充满不确定性的市场环境下,企业财务报告的准确性对于维护投资者权益和市场稳定至关重要。财务舞弊行为严重破坏了市场的公正与透明度,而传统的财务指标在揭露此类舞弊时存在局限。研究表明,结合非财务指标,如公司治理结构,可显著提高财务舞弊的预测能力。随着大数据和信息技术的发展,利用公司报告及其他文本信息成为识别财务舞弊的新兴方法。本研究以A股上市公司为研究对象,选取了2010至2022年间涉及“虚构利润”、“重大遗漏”、“虚假记载(误导性陈述)”、“披露不实”及“虚列资产”等违规行为的上市公司作为样本。依据总资产相近和行业相同原则,按1:1比例抽取共计2888个样本。通过深入分析上市公司年报中的管理层讨论与分析部分,构建了一系列文本指标,包括情感语调、文本相似度、转折词数量占比等。同时,结合财务困境指标,并扩展财务指标的时间维度,丰富了基础特征指标。初步探讨了各指标在舞弊与非舞弊公司间的分布与差异。通过模型建立和参数调优,本研究选用了性能优良的XGBoost模型作为财务舞弊识别模型。分组实验探究了不同类型指标对模型的影响,验证了模型的有效性。在模型解释方面,采用XGBoost结合SHAP的方法进行全局归因和局部归因分析。研究结果显示,经过Mann-Whitney U检验,与前一年相比,文本相似度、文字数量、情感语调、专业词汇数量、情感语义得分等文本特征在舞弊和非舞弊公司间存在显著差异,暗示这些指标可能与舞弊行为相关。此外,制造业舞弊公司数量较多,但房地产业舞弊公司占比最高。在基础指标上,总资产净利润率(ROA)、资产报酬率、权益乘数、资产负债率、总资产周转率、流动资产周转率以及管理层持股比例和股权集中指标在舞弊与非舞弊公司间呈现显著差异性。分组实验表明,财务困境指标的引入显著提升了模型预测性能;文本特征的整合虽然带来了一定性能提升,但其影响相对有限。单独和组合文本特征的测试结果显示,情感语调和专业词汇占比对提升模型召回率和F1值具有积极作用,而与前一年的文本相似度有助于提高准确率、召回率和F1值。SHAP全局归因分析揭示了影响模型预测最为显著的特征,如O-Score、下行风险和流动资产周转率等;局部归因分析通过瀑布图展示了单个样本中每个特征的具体影响,与实际舞弊案例相互验证,进一步证实了模型的有效性。 |
外文摘要: |
In the current market environment filled with uncertainty, the accuracy of corporate financial reporting is vital for safeguarding investor rights and market stability. Financial fraud severely undermines market fairness and transparency, yet conventional financial indices are limited in exposing such fraudulence. Studies have shown that the incorporation of non-financial indicators, such as the structure of corporate governance, can significantly improve the predictive power for financial fraud. With the advancement of big data and information technology, utilizing company reports and other textual information has become an emerging method for detecting financial fraud.This study focuses on A-share listed companies as the research subjects, selecting samples from 2010 to 2022 involving violations such as "fabricating profits," "significant omissions," "false entries (misleading statements)," "disclosing falsehoods," and "inflating assets." Based on the principles of similar total assets and the same industry, a total of 2888 samples were drawn on a 1:1 ratio. By deeply analyzing the Management Discussion and Analysis section in the annual reports of listed companies, a series of textual indicators were constructed, including emotional tone, textual similarity, and the proportion of transitional words. At the same time, combining indicators of financial distress expanded the temporal dimension of financial indicators and enriched the basic feature indicators. A preliminary exploration was conducted on the distribution and differences of various indicators between fraudulent and non-fraudulent companies. Through model establishment and parameter tuning, the study selected the high-performing XGBoost model for the identification of financial fraud. Group experiments explored the impact of different types of indicators on the model and verified its effectiveness. In terms of model interpretation, the XGBoost combined with SHAP method was used for global attribution and local attribution analysis.The results show that according to the Mann-Whitney U test, textual features such as text similarity, word count, emotional tone, the number of technical terms, and sentiment semantic scores exhibit significant differences between fraudulent and non-fraudulent companies, suggesting these indicators may be related to fraudulent activities. Moreover, the manufacturing industry had a greater number of fraudulent companies, but the real estate industry had the highest proportion of fraud. Basic indicators showed significant differences between fraudulent and non-fraudulent companies, reflecting in the net profit margin of total assets (ROA), return on assets, equity multiplier, debt ratio, total asset turnover, current asset turnover, management shareholding ratio, and equity concentration index.Group experiments indicated that the introduction of financial distress indicators significantly enhanced model predictive performance; while the integration of textual features brought certain improvements, their impact was relatively limited. Testing individual and combined textual features showed that emotional tone and the proportion of technical terms actively contributed to improving the model's recall and F1 score, and text similarity to the previous year helped to increase accuracy, recall, and F1 score. Global attribution analysis with SHAP revealed the most significant features affecting model predictions, such as O-Score, downside risk, and current asset turnover; local attribution analysis confirmed the model's effectiveness through waterfall charts that displayed the specific influence of each feature in individual samples, further confirmed by cross-referencing actual cases of fraud. |
参考文献总数: | 45 |
馆藏地: | 总馆B301 |
馆藏号: | 硕025200/24053Z |
开放日期: | 2025-06-18 |