Previous abstract | Contents | Next abstract

Cross-dataset clustering: Revealing corresponding themes across multiple corpora

We present a method for identifying corresponding themes across several corpora that are focused on related, but distinct, domains. This task is approached through simultaneous clustering of keyword sets extracted from the analyzed corpora. Our algorithm extends the information-bottleneck soft clustering method for a suitable setting consisting of several datasets. Experimentation with topical corpora reveals similar aspects of three distinct religions. The evaluation is by way of comparison to clusters constructed manually by an expert.


Ido Dagan, Zvika Marx and Eli Shamir, Cross-dataset clustering: Revealing corresponding themes across multiple corpora. In: Dan Roth and Antal van den Bosch (eds.), Proceedings of CoNLL-2002, Taipei, Taiwan, 2002, pp. 15-21. [ps] [ps.gz] [pdf] [bibtex]
Last update: September 07, 2002. erikt@uia.ua.ac.be