Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2024, Vol. 64 Issue (5): 770-779    DOI: 10.16511/j.cnki.qhdxxb.2023.26.059
  专题:社会媒体处理 本期目录 | 过刊浏览 | 高级检索 |
面向课堂教学内容的知识点标题生成
肖思羽1, 赵晖2
1. 新疆大学 软件学院, 乌鲁木齐 830017;
2. 新疆大学 信息科学与工程学院, 乌鲁木齐 830017
Title generation of knowledge points for classroom teaching
XIAO Siyu1, ZHAO Hui2
1. School of Software, Xinjiang University, Urumqi 830017, China;
2. School of Information Science and Engineering, Xinjiang University, Urumqi 830017, China
全文: PDF(2053 KB)   HTML 
输出: BibTeX | EndNote (RIS)      
摘要 互联网时代信息量庞大, 简洁的标题可以提高信息阅读效率。在课堂场景下, 知识点标题生成便于用户整理和记忆课堂内容, 提高课堂学习效率。该文将标题生成应用于课堂教学领域, 制作了课堂知识点文本-标题数据集; 提出了一种改进的TextRank算法——考虑关键字和句子位置的文本排序(textranking considering keywords and sentence positions, TKSP)算法, 该算法综合考虑了关键词和句子位置等因素对句子权重的影响, 能够更准确地提取文本重点信息。使用以召回率为导向的摘要评价(recall-oriented understudy for gisting evaluation, ROUGE)方法, TKSP算法在ROUGE-1、ROUGE-2和ROUGE-L指标上的得分率分别为51.20 %、33.42 %和50.48 %, 将TKSP抽取式算法与统一语言模型(unified language model, UniLM)结合, 并融合文本主题信息, 提出统一语言模型结合考虑关键字和句子位置的文本排序算法的模型(unified language modeling combined textranking considering keywords and sentence positions, UniLM-TK), UniLM-TK在各指标上的得分率分别为73.29 %、58.12 %和72.87 %, 与UniLM模型相比, UniLM-TK在各指标上分别提高了0.74 %、2.26 %和0.87 %, 证明UniLM-TK模型生成的标题更准确、更有效。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
关键词 课堂教学标题生成主题信息TextRankUniLM    
Abstract:[Objective] In the digital age, brief titles are critical for efficient reading. However, headline generation technology is mostly used in news rather than in other domains. Generating key points in classroom scenarios can enhance comprehension and improve learning efficiency. Traditional extractive algorithms such as Lead-3 and the original TextRank algorithm fail to effectively capture the critical information of an article. They merely rank sentences based on factors such as position or text similarity, overlooking keywords. To address this issue, herein, an improved TextRank algorithm—text ranking combining keywords and sentence positions (TKSP)—is proposed. Extractive models extract information without expanding on the original text, while generative models generate brief and coherent headlines, they sometimes misunderstand the source text, resulting in inaccurate and repetitive headings. To address this issue, TKSP is combined with the UniLM generative model (UniLM-TK model) to incorporate text topic information. [Methods] Courses are collected from a MOOC platform, and audio are extracted from teaching videos. Speech-to-text conversion are performed using an audio transcription tool. The classroom teaching text are organized, segmented based on knowledge points, and manually titled to generate a dataset. Thereafter, an improved TextRank algorithm—TKSP—proposed here is used to automatically generate knowledge points. First, the algorithm applies the Word2Vec word vector model to textrank. TKSP considers four types of sentence critical influences: (1) Sentence position factor: The first paragraph serves as a general introduction to the knowledge point, leading to higher weight. Succeeding sentences have decreasing weights based on their position. (2) Keyword number factor: Sentences with keywords contain valuable information, and their importance increases with the number of keywords present. The TextRank algorithm generates a keyword list from the knowledge content. Sentence weights are adjusted based on the number of keywords, assigning higher weights to sentences with more keywords. (3) Keyword importance factor: Keyword weight reflects keyword importance arranged in descending order. Accordingly, sentence weights are adjusted; the sentence with the first keyword has the highest weight, while sentences with the second and third keywords have lower weights. (4) Sentence importance factor: The first sentence with a keyword serves as a general introduction, more relevant to the knowledge point. The sentence weight is the highest for this sentence and decreases with subsequent occurrences of the keyword. These four influencing factors of sentence weight are integrated to establish the sentence weight calculation formula. Based on the weight value of the sentence, the top-ranked sentence is chosen to create the text title. Herein, the combined TKSP algorithm and UniLM model, called the UniLM-TK model, is proposed. The TKSP algorithm is employed to extract critical sentences, and the textrank algorithm is employed to extract a topic word from the knowledge text. These are separately embedded into the model input sequence, which undergoes transformer block processing. The critical sentence captures text context using self-attention, while the topic word incorporates topic information through cross-attention. The final attention formula is established by weighting and summing these representations. The attention mechanism output is further processed by a feedforward network to extract high-level features. The focused sentences extracted by TKSP can effectively reduce the extent of model computation and data processing difficulty, allowing the model to focus more on extracting and generating focused information. [Results] The TKSP algorithm outperformed classical extractive algorithms (namely maximal marginal relevance, latent Dirichlet allocation, Lead-3, and textrank) in ROUGE-1, ROUGE-2, and ROUGE-L metrics, achieving optimal performances of 51.20 %, 33.42 %, and 50.48 %, respectively. In the ablation experiments of the UniLM-TK model, the optimal performance was achieved by extracting seven key sentences, with specific indicator performances of 73.29 %, 58.12 %, and 72.87 %, respectively. Comparing the headings generated by the UniLM-TK model and GPT3.5 API, the headings generated by UniLM-TK were brief, clear, accurate, and more readable in summarizing the text topic. Experiments were performed for real headings using a large-scale Chinese scientific literature dataset to compare the UniLM-TK and ALBERT models; the UniLM-TK model improved the ROUGE-1, ROUGE-2, and ROUGE-L metrics by 6.45 %, 3.96 %, and 9.34 %, respectively. [Conclusions] The effectiveness of the TKSP algorithm is demonstrated by comparing it with other extractive methods and proving that the headings generated by UniLM-TK exhibit better accuracy and readability.
Key wordsclassroom teaching    title generation    topic information    TextRank    UniLM
收稿日期: 2023-08-22      出版日期: 2024-04-22
基金资助:国家自然科学基金资助项目(62166041)
通讯作者: 赵晖,教授,E-mail:277875592@qq.com     E-mail: 277875592@qq.com
引用本文:   
肖思羽, 赵晖. 面向课堂教学内容的知识点标题生成[J]. 清华大学学报(自然科学版), 2024, 64(5): 770-779.
XIAO Siyu, ZHAO Hui. Title generation of knowledge points for classroom teaching. Journal of Tsinghua University(Science and Technology), 2024, 64(5): 770-779.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2023.26.059  或          http://jst.tsinghuajournals.com/CN/Y2024/V64/I5/770
[1] 焦利颖, 郭岩, 刘悦, 等. 基于序列模型的单文档标题生成研究[J]. 中文信息学报, 2021, 35(1):64-71. JIAO L Y, GUO Y, LIU Y, et al. A sequence model for single document headline generation[J]. Journal of Chinese Information Processing, 2021, 35(1):64-71. (in Chinese)
[2]张翔, 毛兴静, 赵容梅, 等. 融入全局信息的抽取式摘要研究[J]. 计算机科学, 2023, 50(4):188-195. ZHANG X, MAO X J, ZHAO R M, et al. Study on extractive summarization with global information[J]. Computer Science, 2023, 50(4):188-195. (in Chinese)
[3]程琨, 李传艺, 贾欣欣, 等. 基于改进的MMR算法的新闻文本抽取式摘要方法[J]. 应用科学学报, 2021, 39(3):443-455. CHENG K, LI C Y, JIA X X, et al. News summarization extracting method based on improved MMR algorithm[J]. Journal of Applied Sciences, 2021, 39(3):443-455. (in Chinese)
[4]VO T. An approach of syntactical text graph representation learning for extractive summarization[J]. International Journal of Intelligent Robotics and Applications, 2023, 7(1):190-204.
[5]RAKROUKI M A, ALHARBE N, KHAYYAT M, et al. TG-SMR:A text summarization algorithm based on topic and graph models[J]. Computer Systems Science and Engineering, 2023, 45(1):395-408.
[6]MALARSELVI G, PANDIAN A. Multi-layered network model for text summarization using feature representation[J]. Soft Computing, 2023, 27(1):311-322.
[7]BELWAL R C, RAI S W, GUPTA A. Extractive text summarization using clustering-based topic modeling[J]. Soft Computing, 2023, 27(7):3965-3982.
[8]冯浩. 基于Attention机制的双向LSTM在文本标题生成中的研究与应用[D]. 唐山:华北理工大学, 2020. FENG H. Research and application of bidirectional LSTM based on attention in text title generation[D]. Tangshan:North China University of Science and Technology, 2020. (in Chinese)
[9]甘陈敏, 唐宏, 杨浩澜, 等. 融合卷积收缩门控的生成式文本摘要方法[J/OL]. 计算机工程.[2023-09-05]. https://doi.org/10.19678/j.issn.1000-3428.0066847. GAN C M, TANG H, YANG H L, et al. Abstractive text summarization method incorporating convolutional shrinkage gating[J/OL]. Computer Engineering.[2023-09-05]. https://doi.org/10.19678/j.issn.1000-3428.0066847. (in Chinese)
[10]LA QUATRA M, CAGLIERO L. BART-IT:An efficient sequence-to-sequence model for Italian text summarization[J]. Future Internet, 2023, 15(1):15.
[11]FEIJO D D, MOREIRA V P. Improving abstractive summarization of legal rulings through textual entailment[J]. Artificial Intelligence and Law, 2023, 31(1):91-113.
[12]赵冠博, 张勇丙, 毛存礼, 等. 融入领域知识的跨境民族文化生成式摘要方法[J]. 南京大学学报(自然科学版), 2023, 59(4):620-628. ZHAO G B, ZHANG Y B, MAO C L, et al. A generative summary method of cross-border ethnic culture incorporating domain knowledge[J]. Journal of Nanjing University (Natural Sciences), 2023, 59(4):620-628. (in Chinese)
[13]BABU G L A, BADUGU S. Deep learning based sequence to sequence model for abstractive Telugu text summarization[J]. Multimedia Tools and Applications, 2023, 82(11):17075-17096.
[14]VO T. A novel semantic-enhanced generative adversarial network for abstractive text summarization[J]. Soft Computing, 2023, 27(10):6267-6280.
[15]刘杰. 基于GPT-2的司法裁判文书自动摘要[D]. 桂林:广西师范大学, 2022. LIU J. A GPT-2 based method for summarising judicial judgment documents[D]. Guilin:Guangxi Normal University, 2022. (in Chinese)
[16]RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. Journal of Machine Learning Research, 2020, 21(140):1-67.
[17]MIHALCEA R, TARAU P. Textrank:Bringing order into text[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Barcelona, Spain:Association for Computational Linguistics, 2004:404-411.
[18]DONG L, YANG N, WANG W H, et al. Unified language model pre-training for natural language understanding and generation[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada:ACM, 2019:13042-13054.
[19]LI Y D, ZHANG Y Q, ZHAO Z, et al. CSL:A large-scale Chinese scientific literature dataset[C]//Proceedings of the 29th International Conference on Computational Linguistics. Gyeongju, Republic of Korea:International Committee on Computational Linguistics, 2022:3917-3923.
[1] 赛牙热·依马木, 热依莱木·帕尔哈提, 艾斯卡尔·艾木都拉, 李志军. 基于不同关键词提取算法的维吾尔文本情感辨识[J]. 清华大学学报(自然科学版), 2017, 57(3): 270-273.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn