Previous abstract | CoNLL-2001 Proceedings | Next abstract

Unsupervised Induction of Stochastic Context-Free Grammars using Distributional Clustering

Alexander Clark

An algorithm is presented for learning a phrase-structure grammar from tagged text. It clusters sequences of tags together based on local distributional in formation, and selects clusters that satisfy a novel mutual information criterion. This criterion is shown to be related to the entropy of a random variable associated with the tree structures, and it is demonstrated that it selects linguistically plausible constituents. This is incorporated in a Minimum Description Length algorithm. The evaluation of unsupervised models is discussed, and results are presented when the algorithm has been trained on 12 million words of the British National Corpus.

[ps] [pdf] [bibtex]

Last update: July 12, 2001.