학술논문

Unsupervised grammar induction with simple linguistic constraints
Document Type
Electronic Thesis or Dissertation
Author
Source
Subject
415.01
Language
English
Abstract
This thesis investigates the problem of unsupervised learning of natural language grammar in the context-free grammar formalism, and argues that linguistic notions are beneficial to the task. Like some recent approaches, this thesis employs distributional clustering, which is based on the linguistic notion of distribution. Although grammar induction is conceptually complicated as it involves both the demarcation between constituents and non-constituents and that between different types of constituents, it is shown in the thesis that these two tasks are actually the two sides of the same coin. That is, nonconstituents can also be classified into different clusters and these clusters are very easy to be separated from those of constituents, and therefore the real problem in grammar induction is how to identify constituents. This thesis provides a generic framework of distributional grammar induction for experimenting with the effect of different criteria for selecting clusters of constituents. Experiments show that a criterion based on the simple principle of minimum variance fails to learn plausible grammars from vast amount of complex data, and it also leads to inconsistency in syntactic analysis as well as flat parse trees. Another criterion is proposed on the basis of the fragment test, one of the constituency tests proposed in distributional linguistics. This criterion, augmented by a novel grammar rule rewriting mechanism, is shown to be successful in guarding against many frequently-occurred non-constituents, in learning very many types of constituents, and in removing redundancy in grammar and giving rise to highly hierarchical syntactic structure.

Online Access