中文题名: | 基于话题建模的交互文本语境分析技术研究 |
姓名: | |
学科代码: | 040110 |
学科专业: | |
学生类型: | 博士 |
学位: | 理学博士 |
学位年度: | 2012 |
校区: | |
学院: | |
研究方向: | 知识科学与知识工程 |
第一导师姓名: | |
第一导师单位: | |
提交日期: | 2012-06-25 |
答辩日期: | 2012-05-30 |
外文题名: | Research on context anlysis technology for interactive transcript based on topic modeling |
中文摘要: |
计算机支持的协作学习已经成为信息技术环境下一种重要的学习方式。现有的学习管理系统、学习社区几乎都提供学习者交流论坛。学习者在线交流时所产生的交互文本在数量增长上具有突发性、在内容方面具有发散性,从而给学习者和教师带来文本阅读的困难,不利于他们对讨论进展的及时把握。减轻文本阅读的负担将有利于促进协作学习的有效进行。因此,本论文研究如何利用文本挖掘技术支持交互文本的阅读理解。话题建模(topic modeling,又称主题建模)技术主要用于从文本集合中抽取主题信息,是当前文本挖掘领域的研究热点。然而,如何提高所挖掘的主题信息的可解释性是当前主题建模的一个难点。从语用学的角度看,文本符号的语义解释依赖于一定的语境;增加符号的语境信息,可以降低符号解释的不确定性。由此,本研究的主要思想是通过语境信息改善主题建模结果的可解释性。本研究内容包括如下方面:(1)提出一种文本语境的概念模型及其框架表征。这种语境框架表征由一个焦点槽及多个语境槽组成。焦点槽值是一个代表实体或事件的词语,表示语境的中心概念。语境槽包括相关属性槽、相关事件槽与相关实体槽三种,分别表示与焦点相关的三类语义信息。一篇文档可以包含多个语境框架;不同的语境框架间通过共同的槽值进行关联,从而形成象征文本主题的语境框架网络。这种语境框架网络可以采用表格和可视化两种方式呈现。为了从文本中抽取语境框架,设计了fCE算法,适合于处理单主题的文本集合。(2)提出一个新的主题模型fLDA(frame-oriented latent Dirichlet allocation,面向框架的潜在Dirichlet分配),用于从文本集合中挖掘多个主题,并采用语境框架网络表示每个主题。本文通过数学推导证明,在给定同一主题的条件下,fLDA具有与LDA相同的词语概率分布,这说明fLDA可以替代LDA进行主题建模。而且,fLDA在LDA的主题表征方法的基础上增加了主题词的语境信息,从而更好地支持对主题的解释。本文通过三组实验(可操作性评价实验、语义标注性能实验、主题建模性能实验)说明fLDA方法在支持主题解释方面的有效性。可操作性评价实验说明,fLDA辅助下的文本摘要比起完全人工的文本摘要在可操作性方面具有更高的平均用户评分;语义标注性能实验说明本方法的标注结果与人工标注结果无显著性差异;模拟实验对比了标准的Gibbs抽样算法与三种修改方法,说明其中一种方法(M+方法)具有更好的召回性能和稳定性。本文通过案例对比基于fLDA的文本摘要与完全人工的文本摘要,对比分析发现存在这种差异性的一个重要原因是fLDA算法缺乏对用户经验进行建模。(3)为了改进fLDA在用户经验建模方面的缺陷,提出一种半监督的主题建模方法,sfLDA(semi-supervised frame-oriented latent Dirichlet allocation)。sfLDA采用基于规则的方法建模用户经验。案例分析说明sfLDA可以有效地修正fLDA的结果,得到符合用户经验的主题建模结果。(4)为了利用sfLDA支持交互文本的阅读理解,本研究设计和实现了可视化工具ContextPreviewer。ContextPreviewer通过关键词识别、主题识别、语境可视化以及语境相关内容检索来支持读者的阅读理解过程。案例研究展示了ContextPreviewer在辅助文本阅读中的可行性。综合以上研究内容,本文具有如下两个方面创新点:(1)提出了一种文本语境的概念模型及其框架表征方法。这种文本语境的框架表征方法可以为文本的关键词集合补充语境信息,将多个关键词通过语境信息进行关联,从而改善关键词集合所表征的文本主题的可解释性。(2)基于LDA模型提出了一种半监督的主题模型,sfLDA。sfLDA用于自动地从文本集合中抽取以语境框架表征的主题信息,并允许以规则形式约束主题建模结果,使之符合用户的主题划分经验。
﹀
|
外文摘要: |
Computer supported collaboration learning (CSCL) has become an important way of learning in information technology environment. Current learning management system and online learning communities almost include modules to facilitate learners' discussion. However, students' interactive transcripts tend to outburst and its topics tend to vary along the time, which bring obstacles to the progress of their discussion. Thus, reduce the burden of reading discussion transcript will benefit the outcome of learners' collaboration learning. Therefore, this thesis focuses on how to use text mining technology to support comprehension of interactive transcript.Topic modeling, aiming for extracting topical information from text, is a state-of-art technology in text mining area. However, one of the challenges of this technology is how to improve interpretability of its analysis. From a perspective of pragmatics, interpretation of a symbol relies on certain context of the symbol. The more contextual information is known, the less uncertainty of a symbol's interpretation. Based on this, the main idea of this research solution is to improve interpretability of topic modeling analysis by augmenting context of it.The main body of this research includes four aspects:(1) Propose a conceptual model and its frame representation of text context. A frame of context is made up of a "focus slot" and a few kinds of "context slots". The value of a focus slot is represented by a word which refers to an entity or an event, and is the center concept of a context. Context slots have three kinds, i.e. related property slot, related event slot and related entity slot. They represent three semantically different kinds of contextual information associated with a focus in certain context. One document may include several frames of context. Frames can be connected through common slot values, and become a network of context frames which represents topical information of a document. (2) Propose a new topic model called fLDA (frame-oriented latent Dirichlet allocation), used to extract from text multiple topics each of which is represented by context frames. It is proved that fLDA has the same distribution of word as LDA when given a specific topic, which means fLDA can replace LDA for topic modeling. More than this, fLDA augments context of words in topic modeling analysis thus to improve interpretability of topics. In this research, three experiments are carried out to support the effectiveness of fLDA's better topic interpretability effect. The first experiment is to let subjects summarize text. It shows subjects use fLDA to facilitate their summarization mark higher operability than those without any tools averagely. The second experiment shows fLDA achieve similar results with human-coder in semantic labeling. In the third experiment, to choose a suitable inference algorithm, results of the standard Gibbs sampling algorithm and three modified ones (G+, M and M+) are compared. It is found that the M+ approach achieved best recall effect and stability. It also finds a main reason of the differences between fLDA-based text summarization and fully hand-made text summarization: fLDA lacks of human knowledge or users' supervising.(3) To give a solution of supervising fLDA's topic modeling process, this thesis proposes another topic modeling approach, sfLDA (semi-supervised frame-oriented latent Dirichlet allocation). This approach uses if-then rule to model users' knowledge of a specific topic. A case study shows sfLDA can effectively revise the result of fLDA to adapt user's knowledge.(4) Finally, a tool called ContextPreviewer was developed to use sfLDA in supporting reading comprehension. Several case studies show the effectiveness of ContextPreviewer in supporting users' reading comprehension. To sum up, this thesis has two main contributions:(1) Propose a conceptual model of text context and its frame representation.(2) Propose a semi-supervised topic model based on LDA to extract text context of text.
﹀
|
参考文献总数: | 118 |
馆藏地: | 图书馆学位论文阅览区(主馆南区三层BC区) |
馆藏号: | 博040110/1206 |
开放日期: | 2012-06-25 |