基于子词信息的维吾尔语词项规范化

(1.中国科学院新疆理化技术研究所,新疆民族语音语言信息处理实验室,新疆 乌鲁木齐 830011; 2.中国科学院大学计算机科学与技术学院,北京 100049)

维吾尔语; 自然语言处理; 文本规范化; 词嵌入

Normalization of Uyghur terms based on subword information
ZHANG Xinlu1,2,WANG Lei1,YANG Yating1*,MI Chenggang1

(1.Xinjiang Laboratory of Minority Speech and Language Information Processing,the Xinjiang Technical Institute of Physics & Chemistry,Chinese Academy of Science,Urumqi 830011,China; 2.School of Computer Science and Technology,University of the Chinese Academy of Sciences,Beijing 100049,China)

DOI: 10.6043/j.issn.0438-0479.201811022

备注

拉丁化的维吾尔语在使用过程中具有文本不规范的特点,这种不规范是造成歧义等现象的最主要原因,严重制约着与维吾尔语相关的自然语言处理应用.由此提出了一种无监督的基于子词信息的文本规范化方法,该方法在词向量构建过程中将词的内部信息考虑进去.这种方法可以对罕见词进行向量表示,也可以将词内部的形态信息融入词的表示,丰富词向量的表达,进而用于改进无监督学习中规范化词候选集生成质量的不足.实验表明,相比于传统词向量构建方法,该方法在文本规范化任务中可以提高规范化词的召回率.

Latinized Uyghur language is characterized by nonstandard text in its use.This kind of non-standard type primarily causes the ambiguity,which seriously restricts the application of natural language processing related to Uyghur.This paper proposes a text normalization method based on subword information.The method takes the internal information of words into account in the process of constructing word vectors.In this way,rare words can be represented by the vector,and the morphological information inside the words can also be incorporated into the expression of the words to enrich the expression of the word vectors,which can be used to improve the quality of standardized word candidate set generation.Experimental results show that the proposed method can improve the recall rate of normalized words in text normalization tasks compared with traditional word vector construction methods.