中文题名: | 基于众源数据的突发事件信息提取与分类研究 |
姓名: | |
保密级别: | 公开 |
论文语种: | 中文 |
学科代码: | 0705Z1 |
学科专业: | |
学生类型: | 硕士 |
学位: | 理学硕士 |
学位类型: | |
学位年度: | 2022 |
校区: | |
学院: | |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2022-01-07 |
答辩日期: | 2021-12-21 |
外文题名: | Research on Emergency Event Information Extraction and Classification Based on Public Source Data |
中文关键词: | |
外文关键词: | Public source ; Interest relevance ; Text classification ; Public safety |
中文摘要: |
各类危害公共安全的事件对社会经济、公共安全、自然环境、民众生命造成重大的损失,受到社会各界的高度重视。及时、准确地掌握突发公共事件灾情险情,分析获取其发生位置,能极大提高政府各级部门处置措施的针对性和时效性。传统公共安全信息收集手段落后,响应时间长且工作量繁重,无法满足应急处置需求。众源数据具有数据获取现势性强、数据内容丰富、更新频率高等特点,可以帮助我们迅速掌握突发事件的类型及地点,并及时追踪发展动态,是应急处置的一个重要的补充数据源。
﹀
本研究按照“先实验后建软件系统”的研究思路,根据《国家突发公共事件总体应急预案》分类体系将上海大学的语料库分为恐怖袭击、食物中毒、火灾、交通事故、地震五类,利用词频-逆文本频率(Term Frequency–inverse Document Frequency,TF-IDF)进行关键词提取并赋予不同的权值,将各类公共安全事件进行词向量转化,然后对比分析了逻辑斯蒂回归、余弦相似度、朴素贝叶斯和支撑向量机四种算法进行公共安全事件分类,最后设计和开发了公共安全事件软件系统。形成的主要结论如下: 首先,TF-IDF准确地提取出了公共安全事件的二十个关键词。恐怖袭击安全事件的前五位高频词依次为爆炸、恐怖分子、袭击、事件、恐怖袭击;交通事故安全事件的前五位高频词依次为事故、一辆、轿车、司机、相撞;食物中毒安全事件的前五位高频词依次为食物中毒、学生、症状、中毒、出现;地震安全事件的前五位高频词依次为地震、级地震、震中、震感、公里;火灾安全事件的前五位高频词依次为火灾、起火、大火、火势、扑救。 其次,在利用四种分类方法进行文本分类时,从分类精度出发,建议优先选择支持向量机法,其次选择余弦相似度及朴素贝叶斯方法,最后考虑逻辑斯蒂回归方法;从分类效率出发,建议优先选择余弦相似度,其次选择朴素贝叶斯及支持向量机,最后考虑逻辑斯蒂回归算法。对不同的公共安全事件进行分类时,可以根据实际的需求选择不同的方法。 最后,依据提取和分类实验的结果,利用跨平台语言,设计了突发事件分类系统,验证了基于机器学习方法进行公共突发事件提取的效果。 |
外文摘要: |
Events of public security often cause great losses to social economy, public security, natural environment and people's lives, and are highly valued by all sectors of society. The availability of timely and accurate information regarding public safety emergencies and their locations can significantly improve the pertinence and timeliness of the disposal measures of government departments at all levels. The traditional public safety information collection means are backward, the response time is long and the workload is heavy, which cannot meet the needs of emergency disposal. Public source data has the characteristics of strong current situation of data acquisition, rich data content and high update frequency. It can help us quickly grasp the types and locations of emergencies and track their development trends in time. These attributes make it an invaluable data source for emergency response.
﹀
According to the idea of "experimenting before building a software system" and the classification system of the National General Emergency Response Plan, the corpus of Shanghai University was divided into five categories: terrorist attacks, food poisoning, fires, traffic accidents, and earthquakes. Term Frequency-inverse Document Frequency (TF-IDF) was used to extract keywords and assign weights. The keywords of various public security events were then transformed into word vectors and public security events were classified by four different algorithms, including logistic stiffness regression, cosine similarity, plain Bayesian, and support vector machine. The performance of four algorithms were compared and analyzed. Finally, a software system for public safety events has been developed. The main conclusions of this study are as follows: First, TF-IDF correctly extracted twenty keywords associated with public safety incidents. The top five high-frequency words of terrorist attack security events are explosion, terrorist, attack, event and terrorist attack in turn. The top five high-frequency words of traffic accident safety events are accident, one car, car, driver and collision. The top five high-frequency words of food poisoning safety events are food poisoning, student, symptom, poisoning and appearance. The top five high-frequency words of earthquake safety events are earthquake, m-magnitude earthquake, epicenter, earthquake sensation and kilometer. The top five high-frequency words of fire safety events are conflagration, fire, blaze, fire situation and firefighting. Secondly, when using the four classification methods for text classification, from the perspective of classification accuracy, it is suggested to give priority to the support vector machine method, followed by the cosine similarity and naive Bayesian method, and finally consider the logistic regression method. From the perspective of classification efficiency, it is suggested to give priority to cosine similarity, followed by naive Bayes and support vector machine, and finally consider logistic regression algorithm. When classifying different public security events, different methods can be selected according to the actual needs. Finally, according to the results of extraction and classification experiments, an emergency classification system is designed using a cross-platform language to verify the effect of machine learning-based public emergency extraction. |
参考文献总数: | 91 |
馆藏号: | 硕0705Z1/22003 |
开放日期: | 2023-01-07 |