《厦门大学学报（自然科学版）》

基于子词信息的维吾尔语词项规范化

张新路^1,2,王磊¹,杨雅婷^1*,米成刚¹

(1.中国科学院新疆理化技术研究所,新疆民族语音语言信息处理实验室,新疆乌鲁木齐 830011; 2.中国科学院大学计算机科学与技术学院,北京 100049)

Normalization of Uyghur terms based on subword information

ZHANG Xinlu^1,2,WANG Lei¹,YANG Yating^1*,MI Chenggang¹

(1.Xinjiang Laboratory of Minority Speech and Language Information Processing,the Xinjiang Technical Institute of Physics & Chemistry,Chinese Academy of Science,Urumqi 830011,China; 2.School of Computer Science and Technology,University of the Chinese Academy of Sciences,Beijing 100049,China)

DOI: 10.6043/j.issn.0438-0479.201811022

备注

摘要

全文

图/表

参考文献

拉丁化的维吾尔语在使用过程中具有文本不规范的特点,这种不规范是造成歧义等现象的最主要原因,严重制约着与维吾尔语相关的自然语言处理应用.由此提出了一种无监督的基于子词信息的文本规范化方法,该方法在词向量构建过程中将词的内部信息考虑进去.这种方法可以对罕见词进行向量表示,也可以将词内部的形态信息融入词的表示,丰富词向量的表达,进而用于改进无监督学习中规范化词候选集生成质量的不足.实验表明,相比于传统词向量构建方法,该方法在文本规范化任务中可以提高规范化词的召回率.

Latinized Uyghur language is characterized by nonstandard text in its use.This kind of non-standard type primarily causes the ambiguity,which seriously restricts the application of natural language processing related to Uyghur.This paper proposes a text normalization method based on subword information.The method takes the internal information of words into account in the process of constructing word vectors.In this way,rare words can be represented by the vector,and the morphological information inside the words can also be incorporated into the expression of the words to enrich the expression of the word vectors,which can be used to improve the quality of standardized word candidate set generation.Experimental results show that the proposed method can improve the recall rate of normalized words in text normalization tasks compared with traditional word vector construction methods.

引言
1 相关工作
2 基于子词信息的表示方法
3 实验与分析
4 结论

pdf格式下载

+分享

导出

学报简介

《厦门大学学报（自然科学版）》于1931年创刊，是由教育部主管，厦门大学主办，国内外公开发行的综合性学术期刊（双月刊），是我国自然科学核心期刊。本刊以印刷版、网络版的方式同时出版。主要刊载自然科学各学科的最新研究成果，包括自然科学基础理论研究、应用基础研究、高新技术方面的学术论文。所刊载的论文分三大类型：（1）“快讯”：报道某前沿领域具有突破性的最新研究成果。（2）“研究论文”：刊载理工科基础理论研究与实验研究学术论文。（3）“研究简报”：刊载内容新颖、实用（或阶段性）的成果。更多>>

备注

引言

1 相关工作

2 基于子词信息的表示方法

3 实验与分析

4 结 论

学报简介

4 结论