《厦门大学学报（自然科学版）》

针对现有中文自然语言处理研究多以词或者字符为单位,忽视了中文词的内部层次结构的问题,提出一种新的中文词内部层次结构定义标准.该标准定义了内部结构的节点类型和节点内部关系.在此基础上,进一步提出了中文词内部层次结构的标注规范,并且人工标注了含有带内部层次结构的53 918个中文词的词料库.该研究有望为后续的细粒度中文自然语言处理提供新思路.

The current research on Chinese natural language processing mostly regards word or character as the unit,ignoring the internal hierarchical structure of Chinese words.Here we proposed a novel standard to represent internal hierarchical structures for Chinese words.By this novel standard,we define both the node type and internal relationship of the internal structure.On this basis,we further introduced the annotation guideline to the internal hierarchical structure of Chinese words and we manually annotate the internal hierarchical structure in a corpus with 53 918 Chinese words.This work is expected to provide new ideas for subsequent fine-grained Chinese natural language processing.

引言
1 中文词内部层次结构定义标准
2 标注规范
3 语料库标注
4 实验
5 结论

图1 词内部层次结构样例图<br/>Fig.1 Sample diagram of internal hierarch structure of words

图1 词内部层次结构样例图
Fig.1 Sample diagram of internal hierarch structure of words

图2 单纯词标注样例图<br/>Fig.2 Sample diagram of simple word labeling

图2 单纯词标注样例图
Fig.2 Sample diagram of simple word labeling

图3 合成词标注样例图<br/>Fig.3 Sample diagram of compound word labeling

图3 合成词标注样例图
Fig.3 Sample diagram of compound word labeling

图4 派生词标注样例图<br/>Fig.4 Sample diagram of derivation labeling

图4 派生词标注样例图
Fig.4 Sample diagram of derivation labeling

图5 缩略词标注样例图<br/>Fig.5 Sample diagram of acronyms labeling

图5 缩略词标注样例图
Fig.5 Sample diagram of acronyms labeling

图6 标注工具界面<br/>Fig.6 Annotation tool interface

图6 标注工具界面
Fig.6 Annotation tool interface

图7 语料库关于词长和词频词分布图<br/>Fig.7 A histogram of length and frequency of corpus words

图7 语料库关于词长和词频词分布图
Fig.7 A histogram of length and frequency of corpus words

表1 实验结果
Tab.1 Experimental result

[1] 符淮青.词义和构成词的语素义的关系[J].辞书研究,1981(1):98-110.
[2] 傅爱平.汉语信息处理中单字的构词方式与合成词的识别和理解[J].语言文字应用,2003(4):25-33.
[3] 吉志薇,冯敏萱.面向普通未登录词理解的二字词语义构词研究[J].中文信息学报,2015,29(5):63-68,83.
[4] 刘扬,林子,康司辰.汉语的语素概念提取与语义构词分析[J].中文信息学报,2018,32(2):12-21.
[5] ZHAO H.Character-level dependencies in chinese:usefulness and learning[C]∥Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics.Singapore:ACL,2009:879-887.
[6] LI Z G.Parsing the internal structure of words:a new paradigm for chinese word segmentation[C]∥Procee-dings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies-Volume 1.Portland:ACL,2011:1405-1414.
[7] LI Z G,ZHOU G L.Unified dependency parsing of chinese morphological and syntactic structures[C]∥Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.Jeiu Island:ACL,2012:1445-1454.
[8] LI Z H,ZHANG M,CHE W X,et al.Joint models for chinese POS tagging and dependency parsing[C]∥Proceedings of the Conference on Empirical Methods in Natural Language Processing.Edinburgh:ACL,2011:1180-1191.
[9] SUN W.A stacked sub-word model for joint chinese word segmentation and part-of-speech tagging[C]∥Procee-dings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies-Volume 1.Portland:ACL,2011:1385-1394
[10] ZHANG M S,ZHANG Y,CHE W X,et al.Chinese parsing exploiting characters[C]∥Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers).Sofia:ACL,2013:125-134.
[11] CHEN X X,XU L,LIU Z Y,et al.Joint learning of character and word embeddings[C]∥Proceedings of the 24th International Conference on Artificial Intelligence.Buenos Aires.Argentina:AAAI,2015:1236-1242.
[12] XU J,LIU J W,Zhang L A,et al.Improve chinese word embeddings by exploiting internal structure[C]∥Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.San Diego:NAACL,2016:1041-1050.
[13] WANG S,ZHANG J,ZONG C Q.Exploiting word internal structures for generic Chinese sentence representation[C]∥Proceedings of the Conference on Empirical Methods in Natural Language Processing.Copenhagen:EMNLP,2017:298-303.
[14] BAHDANAU D,CHO K,BENGIO Y.Neural machine translation by jointly learning to align and translate[EB/OL].[2019-04-01].https:∥arxiv.org/pdf/1409.0473v2.pdf.
[15] KINGMA D P,BA J.Adam:a method for stochastic optimization[EB/OL].[2019-04-01].https:∥arxiv.org/pdf/1412.6980v8.pdf.
[16] PAPINENI K,ROUKOS S,WARD T,et al.Bleu:a method for automatic evaluation of machine translation[C]∥Proceedings of the 40th annual meeting on association for computational linguistics.Grenoble:ACL,2002:311-318.
[17] QIAN Q,TIAN B,HUANG M,et al.Learning tag embeddings and tag-specific composition functions in recursive neural network[C]∥The 53rd Annual Meeting of Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing.Lisbon:EMNLP,2015:1365-1374.

备注

引言

1 中文词内部层次结构定义标准

2 标注规范

3 语料库标注

4 实验

5 结论

学报简介

备注

引言