中文词语内部层次结构标注语料库的建立

(厦门大学信息学院,福建 厦门 361005)

中文自然语言处理; 标注规范; 语料库

Establishment of corpus of internal hierarchical structure for Chinese words
LIN Qian,WEN Huating,YANG Jing,LIU Xin,LIN Huan,WANG Hongji*,SU Jinsong

(School of Informatics,Xiamen University,Xiamen 361005,China)

DOI: 10.6043/j.issn.0438-0479.201904019

备注

针对现有中文自然语言处理研究多以词或者字符为单位,忽视了中文词的内部层次结构的问题,提出一种新的中文词内部层次结构定义标准.该标准定义了内部结构的节点类型和节点内部关系.在此基础上,进一步提出了中文词内部层次结构的标注规范,并且人工标注了含有带内部层次结构的53 918个中文词的词料库.该研究有望为后续的细粒度中文自然语言处理提供新思路.

The current research on Chinese natural language processing mostly regards word or character as the unit,ignoring the internal hierarchical structure of Chinese words.Here we proposed a novel standard to represent internal hierarchical structures for Chinese words.By this novel standard,we define both the node type and internal relationship of the internal structure.On this basis,we further introduced the annotation guideline to the internal hierarchical structure of Chinese words and we manually annotate the internal hierarchical structure in a corpus with 53 918 Chinese words.This work is expected to provide new ideas for subsequent fine-grained Chinese natural language processing.