Abstract:The important considerations for data annotation are the annotation data quality and the annotation cost. Data annotation in natural language processing usually first uses automated model annotation followed by human corrections to reduce the cost. There have been few studies comparing the effects of different annotation approaches on the annotation quality and cost. This study uses a mature annotation team completing a dependency tree annotation as a case study. This study compares three data annotation approaches using model annotation followed by human corrections, double-blind annotation, and human-model double-blind annotation that is the fusion of the first two approaches. The human-model double-blind annotation effectively combines the advantages of model annotation followed by human corrections and double-blind annotation to reduce the annotation cost and then to improve the annotation quality by eliminating the identification tendency problem.
周明月, 龚晨, 李正华, 张民. 数据标注方法比较研究:以依存句法树标注为例[J]. 清华大学学报(自然科学版), 2022, 62(5): 908-916.
ZHOU Mingyue, GONG Chen, LI Zhenghua, ZHANG Min. Comparison of data annotation approaches using dependency tree annotation as a case study. Journal of Tsinghua University(Science and Technology), 2022, 62(5): 908-916.
[1] MARCUS M, SANTORINI B, MARCINKIEWICZ M A. Building a large annotated corpus of English:The Penn treebank[R]. Philadelphia, USA:Department of Computer and Information Science, University of Pennsylvania, 1993. [2] XUE N W, XIA F, CHIOU F D, et al. The Penn Chinese treebank:Phrase structure annotation of a large corpus[J]. Natural Language Engineering, 2005, 11(2):207-238. [3] CHEN K J, HUANG C R, CHANG L P, et al. Sinica corpus:Design methodology for balanced corpora[C]//Proceedings of the 11th Pacific Asia Conference on Language, Information and Computation. Seoul, Korea:Kyung Hee University, 1996:167-176. [4] 邱立坤, 金澎, 王厚峰. 基于依存语法构建多视图汉语树库[J]. 中文信息学报, 2015, 29(3):9-15. QIU L K, JIN P, WANG H F. A multi-view Chinese treebank based on dependency grammar[J]. Journal of Chinese Information Processing, 2015, 29(3):9-15. (in Chinese) [5] 俞士汶, 段慧明, 朱学锋, 等. 北京大学现代汉语语料库基本加工规范[J]. 中文信息学报, 2002, 16(5):49-64. YU S W, DUAN H M, ZHU X F, et al. The basic processing of contemporary Chinese corpus at Peking University specification[J]. Journal of Chinese Information Processing, 2002, 16(5):49-64. (in Chinese) [6] 周强, 任海波, 孙茂松. 分阶段构建汉语树库[C]//第2届中日自然语言处理专家研讨会. 北京, 2002:189-197. ZHOU Q, REN H B, SUN M S. Build a large scale Chinese treebank through two-stages approach[C]//Proceedings of the Second China-Japan Natural Language Processing Joint Research Promotion Conference. Beijing, 2002:189-197. (in Chinese) [7] XIA F, PALMER M, XUE N W, et al. Developing guidelines and ensuring consistency for Chinese text annotation[C]//Proceedings of the Second International Conference on Language Resources and Evaluation. Athens, Greece, 2000. [8] MCDONALD R, NIVRE J, QUIRMBACH-BRUNDAGE Y, et al. Universal dependency annotation for multilingual parsing[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2:Short Papers). Sofia, Bulgaria, 2013:92-97. [9] KESSLER J S, ECKERT M, CLARK L, et al. The ICWSM 2010 JDPA sentiment corpus for the automotive domain[C]//Proceedings of the 4th International AAAI Conference on Weblogs and Social Media Data Workshop Challenge (ICWSM-DWC). Washington DC, USA, 2010. [10] KVBLER S, MCDONALD R, NIVRE J. Dependency parsing[J]. Synthesis Lectures on Human Language Technologies, 2009, 2(1):1-127. [11] STRASSEL S, MITCHELL A, HUANG S D. Multilingual resources for entity extraction[C]//Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-Language Named Entity Recognition. Sapporo, Japan, 2003:49-56. [12] HAJI AČG J, VIDOVÁ-HLADKÁ B, PAJAS P. The Prague dependency treebank:Annotation structure and support[C]//Proceedings of the IRCS Workshop on Linguistic Databases. Philadelphia, USA, 2001:105-114. [13] LI B, WEN Y, QU W G, et al. Annotating the little prince with Chinese AMRs[C]//Proceedings of the 10th Linguistic Annotation Workshop Held in Conjunction with ACL 2016(LAW-X 2016). Berlin, Germany, 2016:7-15. [14] IDE N, PUSTEJOVSKY J. Handbook of linguistic annotation[M]. Berlin, Germany:Springer, 2017. [15] 郭丽娟, 彭雪, 李正华, 等. 面向多领域多来源文本的汉语依存句法树库构建[J]. 中文信息学报, 2019, 33(2):34-42. GUO L J, PENG X, LI Z H, et al. Construction of Chinese dependency syntax treebanks for multi-domain and multi-source texts[J]. Journal of Chinese Information Processing, 2019, 33(2):34-42. (in Chinese) [16] ŠEV AČG ÍKOVÁ M, ŽABOKRTSKÝ Z, KR AU。U ZA O. Named entities in Czech:Annotating data and developing NE tagger[C]//10th International Conference on Text, Speech and Dialogue. Pilsen, Czech, 2007:188-195. [17] BRANTS S, DIPPER S, HANSEN S, et al. The TIGER treebank[C]//HINRICHS E, SIMOV K. Proceedings of the First Workshop on Treebanks and Linguistic Theories. Sozopol, Bulgaria, 2002:24-41. [18] 卢露, 矫红岩, 李梦, 等. 基于篇章的汉语句法结构树库构建[J]. 自动化学报, 2020, 46:1-11. (2020-05-22)[2021-08-12]. http://kns.cnki.net/kcms/detail/11.2109.TP.20200521.1558.007.html. LU L, JIAO H Y, LI M, et al. A discourse-based Chinese chunk Bank[J]. Acta Automatica Sinica, 2020, 46:1-11. (2020-05-22)[2021-08-12]. http://kns.cnki.net/kcms/detail/11.2109.TP.20200521.1558.007.html. (in Chinese) [19] ZHANG W, FENG Y, MENG F D, et al. Bridging the gap between training and inference for neural machine translation[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy, 2019:4334-4343. [20] LI Z H, ZHANG M, ZHANG Y, et al. Active learning for dependency parsing with partial annotation[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). Berlin, Germany, 2016:344-354.