학술논문

Morphological competence in neural natural language processing
Document Type
Electronic Thesis or Dissertation
Source
Subject
case-marking languages
dependency parsing
Estonian
Finnish
linguistically-oriented analysis of neural networks
morphological competence
morphologically rich languages
natural language processing
Polish
Russian
Language
English
Abstract
In case-marking languages (CMLs), such as Polish or Finnish, a substantial portion of grammatical information is expressed at the word-level. The word-forms provide information about their inherent properties, like tense or mood, but also encode information about relations between the words. This is in contrast to morphologically impoverished languages, like English, which undergo little inflection. The linguistic factors associated with CMLs make them a challenge for data-driven, neural natural language processing (NLP). To successfully process a CML, neural NLP models must be morphologically competent; i.e., they have to capture both the meaning and function of different components of a word form and recognise the importance of morphological signals within a language. Despite the importance of morphological competence for language processing, the neural NLP models have never been directly tested for that linguistic ability. This gap in the literature is the more important given that most neural NLP models are developed with English language in mind and later applied, without any adaptations, to other languages. It remains unclear whether the architectures and optimization techniques developed on English are able to extract all the essential information from the word-forms of CMLs and whether they can interpret this information at the clausal-level to solve NLP tasks. In this thesis I investigate whether state-of-the-art neural models for CMLs utilise morphosyntactic information when solving a task for which this information is key: dependency parsing. To answer this question I propose a new evaluation paradigm which involves evaluating the models on various counterfactual versions of dependency corpora. Through evaluation of Polish, Russian, Finnish and Estonian dependency parsers, I reveal that the models often fail to recognise morphology as the primary indicator of syntax; instead of generalising based on the case and agreement markings, they learn to over-rely on word order and lexical semantics. Following this finding, I experiment with two methods of increasing the models' reliance on morphology: one based on the alteration of the training data and another involving an enhanced training objective. Finally, through creating synthetic CMLs by manipulating selected typological properties of Polish, I investigate whether the models have a 'preference' for the means of encoding case information and reveal that syncretism and high fusion are amongst the properties that drive the models away from relying on morphology as a signal to subject/objecthood.

Online Access