학술논문

Data Generation Using Sequence-to-Sequence
Document Type
Conference
Source
2018 IEEE Recent Advances in Intelligent Computational Systems (RAICS) Intelligent Computational Systems (RAICS), 2018 IEEE Recent Advances in. :108-112 Dec, 2018
Subject
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Engineering Profession
General Topics for Engineers
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Data models
Decoding
Feeds
Training
Computational modeling
Error analysis
Context modeling
Sequence2Sequence
NLP
transliteration
LSTM
encoder
decoder
Language
Abstract
Sequence to Sequence models have shown a lot of promise for dealing with problems such as Neural Machine Translation (NMT), Text Summarization, Paraphrase Generation etc. Deep Neural Networks (DNNs) work well with large and labeled training sets but in sequence-to-sequence problems, mapping becomes a much harder task due to the differences in syntax, semantics and length. Moreover usage of DNNs is constrained by the fixed dimensionality of the input and output, which is not the case with most of the Natural Language Processing (NLP) problems. Our primary focus was to build transliteration systems for Indian languages. In the case of Indian languages, monolingual corpora are abundantly available but a parallel one which can be directly applied to transliteration problem is scarce. With the available parallel corpus, we could only build weak models. We propose a system to leverage the mono-lingual corpus to generate a clean and quality parallel corpus for transliteration, which is then iteratively used to tune the existing weak transliteration models. The results that we got prove our hypothesis that the process of generation of clean data can be validated objectively by evaluating the models alongside the efficiency of the system to generate data in each iteration.