数据标注方法比较研究：以依存句法树标注为例

doi:10.16511/j.cnki.qhdxxb.2022.22.010

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF(1090 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要数据标注最重要的考虑因素是标注数据质量和标注成本。该文调研发现自然语言处理领域的数据标注工作通常采用机标人校的标注方法以降低成本，但很少有工作严格对比不同标注方法，以探讨标注方法对标注质量和成本的影响。该文依托一个成熟的标注团队，以依存句法树标注为案例，实验对比了机标人校、双人独立标注及该文通过融合前两种方法所提出的人机独立标注方法，结果发现：人机独立标注能有效结合机标人校和双人独立标注的优点，在利用机器降低标注成本的同时解决了校对者的认同倾向问题，从而提高了标注质量。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS

	作者相关文章
	周明月
	龚晨
	李正华
	张民

关键词 ：数据标注, 人机独立标注, 机标人校, 双人独立标注

Abstract：The important considerations for data annotation are the annotation data quality and the annotation cost. Data annotation in natural language processing usually first uses automated model annotation followed by human corrections to reduce the cost. There have been few studies comparing the effects of different annotation approaches on the annotation quality and cost. This study uses a mature annotation team completing a dependency tree annotation as a case study. This study compares three data annotation approaches using model annotation followed by human corrections, double-blind annotation, and human-model double-blind annotation that is the fusion of the first two approaches. The human-model double-blind annotation effectively combines the advantages of model annotation followed by human corrections and double-blind annotation to reduce the annotation cost and then to improve the annotation quality by eliminating the identification tendency problem.

Key words： data annotation human-model double-blind annotation model annotation followed by human corrections double-blind annotation

收稿日期: 2021-10-27 出版日期: 2022-04-26

基金资助:国家自然科学基金面上项目（61876116，62176173）

通讯作者: 李正华,教授,E-mail:zhli13@suda.edu.cn E-mail: zhli13@suda.edu.cn

作者简介: 周明月(1996—),女,硕士研究生。

引用本文:

周明月, 龚晨, 李正华, 张民. 数据标注方法比较研究：以依存句法树标注为例[J]. 清华大学学报（自然科学版）, 2022, 62(5): 908-916.
ZHOU Mingyue, GONG Chen, LI Zhenghua, ZHANG Min. Comparison of data annotation approaches using dependency tree annotation as a case study. Journal of Tsinghua University(Science and Technology), 2022, 62(5): 908-916.

链接本文:

http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2022.22.010 或 http://jst.tsinghuajournals.com/CN/Y2022/V62/I5/908

[1] MARCUS M, SANTORINI B, MARCINKIEWICZ M A. Building a large annotated corpus of English:The Penn treebank[R]. Philadelphia, USA:Department of Computer and Information Science, University of Pennsylvania, 1993.
[2] XUE N W, XIA F, CHIOU F D, et al. The Penn Chinese treebank:Phrase structure annotation of a large corpus[J]. Natural Language Engineering, 2005, 11(2):207-238.
[3] CHEN K J, HUANG C R, CHANG L P, et al. Sinica corpus:Design methodology for balanced corpora[C]//Proceedings of the 11th Pacific Asia Conference on Language, Information and Computation. Seoul, Korea:Kyung Hee University, 1996:167-176.
[4] 邱立坤, 金澎, 王厚峰. 基于依存语法构建多视图汉语树库[J]. 中文信息学报, 2015, 29(3):9-15. QIU L K, JIN P, WANG H F. A multi-view Chinese treebank based on dependency grammar[J]. Journal of Chinese Information Processing, 2015, 29(3):9-15. (in Chinese)
[5] 俞士汶, 段慧明, 朱学锋, 等. 北京大学现代汉语语料库基本加工规范[J]. 中文信息学报, 2002, 16(5):49-64. YU S W, DUAN H M, ZHU X F, et al. The basic processing of contemporary Chinese corpus at Peking University specification[J]. Journal of Chinese Information Processing, 2002, 16(5):49-64. (in Chinese)
[6] 周强, 任海波, 孙茂松. 分阶段构建汉语树库[C]//第2届中日自然语言处理专家研讨会. 北京, 2002:189-197. ZHOU Q, REN H B, SUN M S. Build a large scale Chinese treebank through two-stages approach[C]//Proceedings of the Second China-Japan Natural Language Processing Joint Research Promotion Conference. Beijing, 2002:189-197. (in Chinese)
[7] XIA F, PALMER M, XUE N W, et al. Developing guidelines and ensuring consistency for Chinese text annotation[C]//Proceedings of the Second International Conference on Language Resources and Evaluation. Athens, Greece, 2000.
[8] MCDONALD R, NIVRE J, QUIRMBACH-BRUNDAGE Y, et al. Universal dependency annotation for multilingual parsing[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2:Short Papers). Sofia, Bulgaria, 2013:92-97.
[9] KESSLER J S, ECKERT M, CLARK L, et al. The ICWSM 2010 JDPA sentiment corpus for the automotive domain[C]//Proceedings of the 4th International AAAI Conference on Weblogs and Social Media Data Workshop Challenge (ICWSM-DWC). Washington DC, USA, 2010.
[10] KVBLER S, MCDONALD R, NIVRE J. Dependency parsing[J]. Synthesis Lectures on Human Language Technologies, 2009, 2(1):1-127.
[11] STRASSEL S, MITCHELL A, HUANG S D. Multilingual resources for entity extraction[C]//Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-Language Named Entity Recognition. Sapporo, Japan, 2003:49-56.
[12] HAJI AČG J, VIDOVÁ-HLADKÁ B, PAJAS P. The Prague dependency treebank:Annotation structure and support[C]//Proceedings of the IRCS Workshop on Linguistic Databases. Philadelphia, USA, 2001:105-114.
[13] LI B, WEN Y, QU W G, et al. Annotating the little prince with Chinese AMRs[C]//Proceedings of the 10th Linguistic Annotation Workshop Held in Conjunction with ACL 2016(LAW-X 2016). Berlin, Germany, 2016:7-15.
[14] IDE N, PUSTEJOVSKY J. Handbook of linguistic annotation[M]. Berlin, Germany:Springer, 2017.
[15] 郭丽娟, 彭雪, 李正华, 等. 面向多领域多来源文本的汉语依存句法树库构建[J]. 中文信息学报, 2019, 33(2):34-42. GUO L J, PENG X, LI Z H, et al. Construction of Chinese dependency syntax treebanks for multi-domain and multi-source texts[J]. Journal of Chinese Information Processing, 2019, 33(2):34-42. (in Chinese)
[16] ŠEV AČG ÍKOVÁ M, ŽABOKRTSKÝ Z, KR AU。U ZA O. Named entities in Czech:Annotating data and developing NE tagger[C]//10th International Conference on Text, Speech and Dialogue. Pilsen, Czech, 2007:188-195.
[17] BRANTS S, DIPPER S, HANSEN S, et al. The TIGER treebank[C]//HINRICHS E, SIMOV K. Proceedings of the First Workshop on Treebanks and Linguistic Theories. Sozopol, Bulgaria, 2002:24-41.
[18] 卢露, 矫红岩, 李梦, 等. 基于篇章的汉语句法结构树库构建[J]. 自动化学报, 2020, 46:1-11. (2020-05-22)[2021-08-12]. http://kns.cnki.net/kcms/detail/11.2109.TP.20200521.1558.007.html. LU L, JIAO H Y, LI M, et al. A discourse-based Chinese chunk Bank[J]. Acta Automatica Sinica, 2020, 46:1-11. (2020-05-22)[2021-08-12]. http://kns.cnki.net/kcms/detail/11.2109.TP.20200521.1558.007.html. (in Chinese)
[19] ZHANG W, FENG Y, MENG F D, et al. Bridging the gap between training and inference for neural machine translation[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy, 2019:4334-4343.
[20] LI Z H, ZHANG M, ZHANG Y, et al. Active learning for dependency parsing with partial annotation[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). Berlin, Germany, 2016:344-354.

No related articles found!

Viewed

Full text

Abstract

Cited

Shared

Discussed