融合关键词概率映射的汉越低资源跨语言摘要

(昆明理工大学信息工程与自动化学院,云南省人工智能重点实验室,云南 昆明 650500)

低资源跨语言摘要; 跨语言语义对齐; 关键词; 概率映射

Low resource cross-language summarization of Chinese-Vietnamese combined with keyword probability mapping
LI Xiaomeng,ZHANG Yafei*,GUO Junjun,GAO Shengxiang,YU Zhengtao

(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Key Laboratory of Artificial Intelligence of Yunnan Province,Kunming 650500,China)

low-resource cross-language abstract; cross-language semantic alignment; keywords; probability mapping

DOI: 10.6043/j.issn.0438-0479.202110023

备注

在低资源汉越跨语言摘要任务中,由于标注的汉越对齐数据稀缺,较难实现跨语言语义对齐.鉴于此,提出一种融合关键词概率映射的低资源跨语言摘要方法,首先利用源语言关键词实现关键信息的提取,然后基于概率映射对将源语言关键词映射到目标语言,最后基于指针网络将映射的目标语言关键词融入到摘要生成过程中.在构建的汉越跨语言摘要数据集上的实验结果表明,相比于直接的端到端的方法,融入关键词概率映射信息可以有效地提升低资源跨语言摘要的质量.

Objective : The Chinese-Vietnamese cross-language summarization is the task of generating a summary in Vietnamese for the given Chinese text through a Chinese-Vietnamese cross-language summarization model. Keywords contain the important content of the source text, which is an effective enhancement to the source text. It provides important information guidance for the generation of text summaries, and the bilingual parallel dictionary provides a unified semantic space for bilingual texts. As a continuation of previous research, this paper studies mapping the source language text keywords to the target language based on the Chinese-Vietnamese probability mapping pair, to solve the problem of difficult semantic alignment and poor quality of Chinese-Vietnamese cross-language summaries when the annotateds Chinese-Vietnamese alignment data is scarce.
Methods : Taking Chinese text and Chinese keywords as input, based on the Transformer framework and pointer generation network. The size of the Chinese dictionary is set to 100 000, the size of the Vietnamese dictionary is 10 000, the size of the probability mapping dictionary is set to 39 311 according to the word frequency, and the number of keywords is set to 5 according to the average length of the abstract, and the experiments are carried out on single NVIDIA RTX 2070 SUPER GPU. First, obtain the joint representation of the source language keywords and the source language text. Then we map source language keywords to the target language based on the probability mapping pair. The final distribution is obtained based on the pointer network by combining the word probability generated by the decoder.
Results : In this paper, a Chinese-Vietnamese cross-language summarization model combined with keyword probability mapping is constructed. Then, we comparison with NCLS and other methods on the constructed 100 000 Chinese-Vietnamese cross-language summarization dataset. The results show that better contextual representation can be obtained by obtaining the joint representation of the source keywords and the source text, and the fusion of keyword probability mapping to guide the summary generation can effectively improve the performance of the model, it can also effectively alleviate the problem of poor summary effect caused by poor Vietnamese machine translation performance. In addition, we also explored the influence of the number of keywords and the size of the probability mapping dictionary on the performance of the model. The experimental results show that the guidance of the summary model with prior knowledge such as keywords can effectively improve the performance of the low-resource summary model. The coverage of the keywords in the probability mapping dictionary affects the performance of the model to a certain extent. Small-scale probabilistic mapping dictionaries will affect the performance of the model due to their low coverage and high noise. In addition, we also explored the influence of the fusion method of the keyword probability mapping information on the model performance. The experimental results show that under the same experimental conditions, the fusion method of the pointer network in this paper is more effective than the direct splicing method.
Conclusions : The results of above prove that in the case of low resources, by obtaining the keyword information of the source language text and mapping it to the target language to guide the summary generation. There is a certain improvement in the Chinese-Vietnamese low-resource cross-language summarization task. It can also be proved by experiments that the keyword probability mapping information can provide richer guidance information for the cross-language summarization model, and it is also proved that the method proposed in this paper may be more effective for low-resource cross-language summarization tasks. Multimodality and other multi-source information is a high-level generalize of the text content, which can be a good supplement to the text content. Therefore, how to use multimodality information to guide cross-language summarization is the focus of future studied.