|本期目录/Table of Contents|

[1]张新路,王 磊,杨雅婷*,等.基于子词信息的维吾尔语词项规范化[J].厦门大学学报(自然科学版),2019,58(02):217-224.[doi:10.6043/j.issn.0438-0479.201811022]
 ZHANG Xinlu,WANG Lei,YANG Yating*,et al.Normalization of Uyghur terms based on subword information[J].Journal of Xiamen University(Natural Science),2019,58(02):217-224.[doi:10.6043/j.issn.0438-0479.201811022]
点击复制

基于子词信息的维吾尔语词项规范化(PDF/HTML)
分享到:

《厦门大学学报(自然科学版)》[ISSN:0438-0479/CN:35-1070/N]

卷:
58卷
期数:
2019年02期
页码:
217-224
栏目:
民族语言处理
出版日期:
2019-03-27

文章信息/Info

Title:
Normalization of Uyghur terms based on subword information
文章编号:
0438-0479(2019)02-0217-08
作者:
张新路12王 磊1杨雅婷1*米成刚1
1.中国科学院新疆理化技术研究所,新疆民族语音语言信息处理实验室,新疆 乌鲁木齐 830011; 2.中国科学院大学计算机科学与技术学院,北京 100049
Author(s):
ZHANG Xinlu12WANG Lei1YANG Yating1*MI Chenggang1
1.Xinjiang Laboratory of Minority Speech and Language Information Processing,the Xinjiang Technical Institute of Physics & Chemistry,Chinese Academy of Science,Urumqi 830011,China; 2.School of Computer Science and Technology,University of the Chinese Academy of Sciences,Beijing 100049,China
关键词:
维吾尔语 自然语言处理 文本规范化 词嵌入
Keywords:
Uyghur natural language processing text normalization word embedding
分类号:
TP 391
DOI:
10.6043/j.issn.0438-0479.201811022
文献标志码:
A
摘要:
拉丁化的维吾尔语在使用过程中具有文本不规范的特点,这种不规范是造成歧义等现象的最主要原因,严重制约着与维吾尔语相关的自然语言处理应用.由此提出了一种无监督的基于子词信息的文本规范化方法,该方法在词向量构建过程中将词的内部信息考虑进去.这种方法可以对罕见词进行向量表示,也可以将词内部的形态信息融入词的表示,丰富词向量的表达,进而用于改进无监督学习中规范化词候选集生成质量的不足.实验表明,相比于传统词向量构建方法,该方法在文本规范化任务中可以提高规范化词的召回率.
Abstract:
Latinized Uyghur language is characterized by nonstandard text in its use.This kind of non-standard type primarily causes the ambiguity,which seriously restricts the application of natural language processing related to Uyghur.This paper proposes a text normalization method based on subword information.The method takes the internal information of words into account in the process of constructing word vectors.In this way,rare words can be represented by the vector,and the morphological information inside the words can also be incorporated into the expression of the words to enrich the expression of the word vectors,which can be used to improve the quality of standardized word candidate set generation.Experimental results show that the proposed method can improve the recall rate of normalized words in text normalization tasks compared with traditional word vector construction methods.

参考文献/References:

[1] GIMPEL K,SCHNEIDER N,O’CONNOR B,et al.Part-of-speech tagging for Twitter:annotation,features,and experiments[C]∥Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics.Portland:Association for Computational Linguistics,2011:42-47.
[2] FOSTER J,?ZLEM,WAGNER J,et al.From news to comment:resources and benchmarks for parsing the language of web 2.0[C]∥International Joint Conference on Natural Language Processing.Chiang Mai:Asian Federation of Natural Language Processing,2011:893-901.
[3] RITTER A,CHERRY C,DOLAN B.Unsupervised modeling of Twitter conversations[C]∥Human Language Technologies:the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics.Los Angeles:Association for Computational Linguistics,2010:172-180.
[4] 罗程多,吴晓蕊,薛凯,等.社交文本规范化研究综述[J].网络新媒体技术,2017(5):10-14.
[5] MI C,YANG Y,ZHOU X,et al.A phrase table filtering model based on binary classification for Uyghur-Chinese machine translation[J].Journal of Computers,2014,9(12):2780-2786.
[6] 杨帆.新疆维吾尔族大学生手机短信交际语言文字使用现状调查研究[D].乌鲁木齐:新疆师范大学,2012:28-31.
[7] TURSUN O,CAKICI R.Noisy Uyghur text normalization[C]∥Proceedings of the 3rd Workshop on Noisy User-generated Text.Copenhagen:Association for Computational Linguistics,2017:85-93.
[8] 罗延根,李晓,蒋同海,等.基于词向量的维吾尔语词项归一化方法[J].计算机工程,2018(2):220-225.
[9] BOJANOWSKI P,GRAVE E,JOULIN A,et al.Enriching word vectors with subword information[EB/OL].[2018-11-11].https:∥arxiv.org/pdf/1607.04606.
[10] SRIDHAR V K R.Unsupervised text normalization using distributed representations of words and phrases[C]∥Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing.Denver:Association for Computational Linguistics,2015:8-16.
[11] BRILL E,MOORE R C.An improved error model for noisy channel spelling correction[C]∥Proceedings of the 38th Annual Meeting on Association for Computational Linguistics.Hong Kong:Association for Computational Linguistics,2000:286-293.
[12] TOUTANOVA K,MOORE R C.Pronunciation modeling for improved spelling correction[C]∥Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.Philadelphia:Association for Computational Linguistics,2002:144-151.
[13] AW A T,ZHANG M,XIAO J,et al.A phrase-based statistical model for SMS text normalization[C]∥Proceedings of the COLING/ACL on Main Conference Poster Sessions.Sydney:Association for Computational Linguistics,2006:33-40.
[14] PENNELL D,LIU Y.A character-level machine translation approach for normalization of sms abbreviations[C]∥Proceedings of 5th International Joint Conference on Natural Language Processing.Chiang Mai:Asian Federation of Natural Language Processing,2011:974-982.
[15] XIE Z,AVATI A,ARIVAZHAGAN N,et al.Neural language correction with character-based attention[EB/OL].[2018-11-11].https:∥arxiv.org/pdf/1603.09727.
[16] IKEDA T,SHINDO H,MATSUMOTO Y.Japanese text normalization with encoder-decoder model[C]∥Proceedings of the 2nd Workshop on Noisy User-generated Text(WNUT).Osaka:Association for Computational Linguistics,2016:129-137.
[17] HAN B,COOK P,BALDWIN T.Automatically constructing a normalisation dictionary for microblogs[C]∥Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.Jeju Island:Association for Computational Linguistics,2012:421-432.
[18] HASSAN H,MENEZES A.Social text normalization using contextual graph random walks[C]∥Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics.Sofia:Association for Computational Linguistics,2013:1577-1586.
[19] HARRIS Z S.Distributional structure[J].Word,1954,10(2/3):146-162.
[20] MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[EB/OL].[2018-11-11].https:∥arxiv.org/pdf/1301.3781.
[21] 陈培,景丽萍.融合语义信息的矩阵分解词向量学习模型[J].智能系统学报,2017(5):83-89.
[22] 来斯惟.基于神经网络的词和文档语义向量表示方法研究[D].北京:中国科学院大学,2016:5-25.[1] GIMPEL K,SCHNEIDER N,O’CONNOR B,et al.Part-of-speech tagging for Twitter:annotation,features,and experiments[C]∥Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics.Portland:Association for Computational Linguistics,2011:42-47.
[2] FOSTER J,?ZLEM,WAGNER J,et al.From news to comment:resources and benchmarks for parsing the language of web 2.0[C]∥International Joint Conference on Natural Language Processing.Chiang Mai:Asian Federation of Natural Language Processing,2011:893-901.
[3] RITTER A,CHERRY C,DOLAN B.Unsupervised modeling of Twitter conversations[C]∥Human Language Technologies:the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics.Los Angeles:Association for Computational Linguistics,2010:172-180.
[4] 罗程多,吴晓蕊,薛凯,等.社交文本规范化研究综述[J].网络新媒体技术,2017(5):10-14.
[5] MI C,YANG Y,ZHOU X,et al.A phrase table filtering model based on binary classification for Uyghur-Chinese machine translation[J].Journal of Computers,2014,9(12):2780-2786.
[6] 杨帆.新疆维吾尔族大学生手机短信交际语言文字使用现状调查研究[D].乌鲁木齐:新疆师范大学,2012:28-31.
[7] TURSUN O,CAKICI R.Noisy Uyghur text normalization[C]∥Proceedings of the 3rd Workshop on Noisy User-generated Text.Copenhagen:Association for Computational Linguistics,2017:85-93.
[8] 罗延根,李晓,蒋同海,等.基于词向量的维吾尔语词项归一化方法[J].计算机工程,2018(2):220-225.
[9] BOJANOWSKI P,GRAVE E,JOULIN A,et al.Enriching word vectors with subword information[EB/OL].[2018-11-11].https:∥arxiv.org/pdf/1607.04606.
[10] SRIDHAR V K R.Unsupervised text normalization using distributed representations of words and phrases[C]∥Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing.Denver:Association for Computational Linguistics,2015:8-16.
[11] BRILL E,MOORE R C.An improved error model for noisy channel spelling correction[C]∥Proceedings of the 38th Annual Meeting on Association for Computational Linguistics.Hong Kong:Association for Computational Linguistics,2000:286-293.
[12] TOUTANOVA K,MOORE R C.Pronunciation modeling for improved spelling correction[C]∥Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.Philadelphia:Association for Computational Linguistics,2002:144-151.
[13] AW A T,ZHANG M,XIAO J,et al.A phrase-based statistical model for SMS text normalization[C]∥Proceedings of the COLING/ACL on Main Conference Poster Sessions.Sydney:Association for Computational Linguistics,2006:33-40.
[14] PENNELL D,LIU Y.A character-level machine translation approach for normalization of sms abbreviations[C]∥Proceedings of 5th International Joint Conference on Natural Language Processing.Chiang Mai:Asian Federation of Natural Language Processing,2011:974-982.
[15] XIE Z,AVATI A,ARIVAZHAGAN N,et al.Neural language correction with character-based attention[EB/OL].[2018-11-11].https:∥arxiv.org/pdf/1603.09727.
[16] IKEDA T,SHINDO H,MATSUMOTO Y.Japanese text normalization with encoder-decoder model[C]∥Proceedings of the 2nd Workshop on Noisy User-generated Text(WNUT).Osaka:Association for Computational Linguistics,2016:129-137.
[17] HAN B,COOK P,BALDWIN T.Automatically constructing a normalisation dictionary for microblogs[C]∥Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.Jeju Island:Association for Computational Linguistics,2012:421-432.
[18] HASSAN H,MENEZES A.Social text normalization using contextual graph random walks[C]∥Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics.Sofia:Association for Computational Linguistics,2013:1577-1586.
[19] HARRIS Z S.Distributional structure[J].Word,1954,10(2/3):146-162.
[20] MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[EB/OL].[2018-11-11].https:∥arxiv.org/pdf/1301.3781.
[21] 陈培,景丽萍.融合语义信息的矩阵分解词向量学习模型[J].智能系统学报,2017(5):83-89.
[22] 来斯惟.基于神经网络的词和文档语义向量表示方法研究[D].北京:中国科学院大学,2016:5-25.

备注/Memo

备注/Memo:
收稿日期:2018-11-12 录用日期:2018-12-05
基金项目:国家自然科学基金(U1703133); 新疆自治区重大科技专项(2016A03007-3); 中国科学院“西部之光”人才培养引进计划(2017-XBQNXZ-A-005)
*通信作者:yangyt@ms.xjb.ac.cn
更新日期/Last Update: 1900-01-01