Previous abstract | Contents | Next abstract

Bootstrapping a multilingual part-of-speech tagger in one person-day

This paper presents a method for bootstrapping a fine-grained, broad-coverage part-of-speech (POS) tagger in a new language using only one person-day of data acquisition effort. It requires only three resources, which are currently readily available in 60-100 world languages: (1) an online or hard-copy pocket-sized bilingual dictionary, (2) a basic library reference grammar, and (3) access to an existing monolingual text corpus in the language. The algorithm begins by inducing initial lexical POS distributions from English translations in a bilingual dictionary without POS tags. It handles irregular, regular and semi-regular morphology through a robust generative model using weighted Levenshtein alignments. Unsupervised induction of grammatical gender is performed via global modeling of context-window feature agreement. Using a combination of these and other evidence sources, interactive training of context and lexical prior models are accomplished for fine-grained POS tag spaces. Experiments show high accuracy, fine-grained tag resolution with minimal new human effort.

Silviu Cucerzan and David Yarowsky, Bootstrapping a multilingual part-of-speech tagger in one person-day. In: Dan Roth and Antal van den Bosch (eds.), Proceedings of CoNLL-2002, Taipei, Taiwan, 2002, pp. 132-138. [ps] [ps.gz] [pdf] [bibtex]

Last update: September 07, 2002. erikt@uia.ua.ac.be