Title generation of knowledge points for classroom teaching
XIAO Siyu1, ZHAO Hui2
1. School of Software, Xinjiang University, Urumqi 830017, China; 2. School of Information Science and Engineering, Xinjiang University, Urumqi 830017, China
Abstract:[Objective] In the digital age, brief titles are critical for efficient reading. However, headline generation technology is mostly used in news rather than in other domains. Generating key points in classroom scenarios can enhance comprehension and improve learning efficiency. Traditional extractive algorithms such as Lead-3 and the original TextRank algorithm fail to effectively capture the critical information of an article. They merely rank sentences based on factors such as position or text similarity, overlooking keywords. To address this issue, herein, an improved TextRank algorithm—text ranking combining keywords and sentence positions (TKSP)—is proposed. Extractive models extract information without expanding on the original text, while generative models generate brief and coherent headlines, they sometimes misunderstand the source text, resulting in inaccurate and repetitive headings. To address this issue, TKSP is combined with the UniLM generative model (UniLM-TK model) to incorporate text topic information. [Methods] Courses are collected from a MOOC platform, and audio are extracted from teaching videos. Speech-to-text conversion are performed using an audio transcription tool. The classroom teaching text are organized, segmented based on knowledge points, and manually titled to generate a dataset. Thereafter, an improved TextRank algorithm—TKSP—proposed here is used to automatically generate knowledge points. First, the algorithm applies the Word2Vec word vector model to textrank. TKSP considers four types of sentence critical influences: (1) Sentence position factor: The first paragraph serves as a general introduction to the knowledge point, leading to higher weight. Succeeding sentences have decreasing weights based on their position. (2) Keyword number factor: Sentences with keywords contain valuable information, and their importance increases with the number of keywords present. The TextRank algorithm generates a keyword list from the knowledge content. Sentence weights are adjusted based on the number of keywords, assigning higher weights to sentences with more keywords. (3) Keyword importance factor: Keyword weight reflects keyword importance arranged in descending order. Accordingly, sentence weights are adjusted; the sentence with the first keyword has the highest weight, while sentences with the second and third keywords have lower weights. (4) Sentence importance factor: The first sentence with a keyword serves as a general introduction, more relevant to the knowledge point. The sentence weight is the highest for this sentence and decreases with subsequent occurrences of the keyword. These four influencing factors of sentence weight are integrated to establish the sentence weight calculation formula. Based on the weight value of the sentence, the top-ranked sentence is chosen to create the text title. Herein, the combined TKSP algorithm and UniLM model, called the UniLM-TK model, is proposed. The TKSP algorithm is employed to extract critical sentences, and the textrank algorithm is employed to extract a topic word from the knowledge text. These are separately embedded into the model input sequence, which undergoes transformer block processing. The critical sentence captures text context using self-attention, while the topic word incorporates topic information through cross-attention. The final attention formula is established by weighting and summing these representations. The attention mechanism output is further processed by a feedforward network to extract high-level features. The focused sentences extracted by TKSP can effectively reduce the extent of model computation and data processing difficulty, allowing the model to focus more on extracting and generating focused information. [Results] The TKSP algorithm outperformed classical extractive algorithms (namely maximal marginal relevance, latent Dirichlet allocation, Lead-3, and textrank) in ROUGE-1, ROUGE-2, and ROUGE-L metrics, achieving optimal performances of 51.20 %, 33.42 %, and 50.48 %, respectively. In the ablation experiments of the UniLM-TK model, the optimal performance was achieved by extracting seven key sentences, with specific indicator performances of 73.29 %, 58.12 %, and 72.87 %, respectively. Comparing the headings generated by the UniLM-TK model and GPT3.5 API, the headings generated by UniLM-TK were brief, clear, accurate, and more readable in summarizing the text topic. Experiments were performed for real headings using a large-scale Chinese scientific literature dataset to compare the UniLM-TK and ALBERT models; the UniLM-TK model improved the ROUGE-1, ROUGE-2, and ROUGE-L metrics by 6.45 %, 3.96 %, and 9.34 %, respectively. [Conclusions] The effectiveness of the TKSP algorithm is demonstrated by comparing it with other extractive methods and proving that the headings generated by UniLM-TK exhibit better accuracy and readability.
[1] 焦利颖, 郭岩, 刘悦, 等. 基于序列模型的单文档标题生成研究[J]. 中文信息学报, 2021, 35(1):64-71. JIAO L Y, GUO Y, LIU Y, et al. A sequence model for single document headline generation[J]. Journal of Chinese Information Processing, 2021, 35(1):64-71. (in Chinese) [2]张翔, 毛兴静, 赵容梅, 等. 融入全局信息的抽取式摘要研究[J]. 计算机科学, 2023, 50(4):188-195. ZHANG X, MAO X J, ZHAO R M, et al. Study on extractive summarization with global information[J]. Computer Science, 2023, 50(4):188-195. (in Chinese) [3]程琨, 李传艺, 贾欣欣, 等. 基于改进的MMR算法的新闻文本抽取式摘要方法[J]. 应用科学学报, 2021, 39(3):443-455. CHENG K, LI C Y, JIA X X, et al. News summarization extracting method based on improved MMR algorithm[J]. Journal of Applied Sciences, 2021, 39(3):443-455. (in Chinese) [4]VO T. An approach of syntactical text graph representation learning for extractive summarization[J]. International Journal of Intelligent Robotics and Applications, 2023, 7(1):190-204. [5]RAKROUKI M A, ALHARBE N, KHAYYAT M, et al. TG-SMR:A text summarization algorithm based on topic and graph models[J]. Computer Systems Science and Engineering, 2023, 45(1):395-408. [6]MALARSELVI G, PANDIAN A. Multi-layered network model for text summarization using feature representation[J]. Soft Computing, 2023, 27(1):311-322. [7]BELWAL R C, RAI S W, GUPTA A. Extractive text summarization using clustering-based topic modeling[J]. Soft Computing, 2023, 27(7):3965-3982. [8]冯浩. 基于Attention机制的双向LSTM在文本标题生成中的研究与应用[D]. 唐山:华北理工大学, 2020. FENG H. Research and application of bidirectional LSTM based on attention in text title generation[D]. Tangshan:North China University of Science and Technology, 2020. (in Chinese) [9]甘陈敏, 唐宏, 杨浩澜, 等. 融合卷积收缩门控的生成式文本摘要方法[J/OL]. 计算机工程.[2023-09-05]. https://doi.org/10.19678/j.issn.1000-3428.0066847. GAN C M, TANG H, YANG H L, et al. Abstractive text summarization method incorporating convolutional shrinkage gating[J/OL]. Computer Engineering.[2023-09-05]. https://doi.org/10.19678/j.issn.1000-3428.0066847. (in Chinese) [10]LA QUATRA M, CAGLIERO L. BART-IT:An efficient sequence-to-sequence model for Italian text summarization[J]. Future Internet, 2023, 15(1):15. [11]FEIJO D D, MOREIRA V P. Improving abstractive summarization of legal rulings through textual entailment[J]. Artificial Intelligence and Law, 2023, 31(1):91-113. [12]赵冠博, 张勇丙, 毛存礼, 等. 融入领域知识的跨境民族文化生成式摘要方法[J]. 南京大学学报(自然科学版), 2023, 59(4):620-628. ZHAO G B, ZHANG Y B, MAO C L, et al. A generative summary method of cross-border ethnic culture incorporating domain knowledge[J]. Journal of Nanjing University (Natural Sciences), 2023, 59(4):620-628. (in Chinese) [13]BABU G L A, BADUGU S. Deep learning based sequence to sequence model for abstractive Telugu text summarization[J]. Multimedia Tools and Applications, 2023, 82(11):17075-17096. [14]VO T. A novel semantic-enhanced generative adversarial network for abstractive text summarization[J]. Soft Computing, 2023, 27(10):6267-6280. [15]刘杰. 基于GPT-2的司法裁判文书自动摘要[D]. 桂林:广西师范大学, 2022. LIU J. A GPT-2 based method for summarising judicial judgment documents[D]. Guilin:Guangxi Normal University, 2022. (in Chinese) [16]RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. Journal of Machine Learning Research, 2020, 21(140):1-67. [17]MIHALCEA R, TARAU P. Textrank:Bringing order into text[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Barcelona, Spain:Association for Computational Linguistics, 2004:404-411. [18]DONG L, YANG N, WANG W H, et al. Unified language model pre-training for natural language understanding and generation[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver, Canada:ACM, 2019:13042-13054. [19]LI Y D, ZHANG Y Q, ZHAO Z, et al. CSL:A large-scale Chinese scientific literature dataset[C]//Proceedings of the 29th International Conference on Computational Linguistics. Gyeongju, Republic of Korea:International Committee on Computational Linguistics, 2022:3917-3923.