ECML-97 MLnet Familiarization Workshop
April 26, Prague, Czech Republic

Empirical Learning of Natural Language Processing Tasks

Electronic Workshop Notes

This page lists the table of contents of the Workshop Notes, containing links to gzipped postscript versions of all papers as they appear in the Notes, and the abstracts of these papers. All participants are invited to download the documents of their choice to get informed about the topics of the talks at the workshop beforehand.

The complete electronic Workshop Notes (all postscript versions) can also be FTPd as one gzipped tar file (909 Kb, unzipped 3,208 Kb). Click here to download.

Please note that the reference for the Workshop Notes is the following:


Notice: corrected version of Kazakov's paper

A corrected version of Dimitar Kazakov's paper Unsupervised Learning of Naive Morphology with Genetic Algorithms has replaced the old version which lacked all letters with diacritics. Especially when you have already downloaded the paper, you are invited to download the new postscript version directly by clicking here.


Table of Contents



Abstracts


Walter Daelemans (Tilburg University, The Netherlands), Antal van den Bosch, and Ton Weijters (Universiteit Maastricht, The Netherlands)
Empirical Learning of Natural Language Processing Tasks
pages 1-10

Language learning has thus far not been a hot application for machine-learning (ML) research. This limited attention for work on empirical learning of language knowledge and behaviour from text and speech data seems unjustified. After all, it is becoming apparent that empirical learning of Natural Language Processing (NLP) can alleviate NLP's all-time main problem, viz. the knowledge acquisition bottleneck: empirical ML methods such as rule induction, top down induction of decision trees, lazy learning, inductive logic programming, and some types of neural network learning, seem to be excellently suited to automatically induce exactly that knowledge that is hard to gather by hand. In this paper we address the question why NLP is an interesting application for empirical ML, and provide a brief overview of current work in this area.


Luc Steels (VUB Artificial Intelligence Laboratory, Brussels, Belgium)
Language Learning and Language Contact
pages 11-24

The evolution of language can only be explained when we take a language learning process into account which is necessarily imperfect due to weak data and the limits of induction. Thus language change yields clues on what the learning processes are and conversely hypothesised learning processes should predict possible language changes. This paper considers this issue by studying lexicon formation processes. It is shown how a process for the construction and acquisition of a lexicon in a single agent leads to various dynamical phenomena when applied to a population of agents with varying degrees of contact.


James Cussens, David Page, Stephen Muggleton, and Ashwin Srinivasan (Oxford University, United Kingdom)
Using Inductive Logic Programming for Natural Language Processing
pages 25-34

We summarise recent work on using Inductive Logic Programming (ILP) for Natural Language Processing (NLP). ILP performs learning in a first-order logical setting, and is thus well-suited to induce over the various structured representations used in NLP. We present Stochastic Logic Programs (SLPs) and demonstrate their use in ILP when learning from positive examples only. We also give accounts of work on learning grammars from children's books and part-of-speech tagging.


Luc DeHaspe and Luc De Raedt (Katholieke Universiteit Leuven, Belgium)
Mining a Natural Language Corpus for Multi-Relational Association Rules
pages 35-48

Association rules are generally recognised as a highly valuable type of regularities and various algorithms have been presented for efficiently mining them in large databases. To the best of our knowledge, the application of these algorithms is so far restricted to databases that consist of a single relation composed of a set of binary attributes. We describe how these restrictions can be overcome through the combination of the available algorithms with standard techniques from the field of inductive logic programming. We present the algorithm APRIORIREL, which extends APRIORI [Agrawal et al, 1996] to mine association rules in multiple relations. Whereas in APRIORI eacg example is described by a single tuple, in APRIORIREL each example is viewed as a separate data base with a selection, from multiple relations, of all tuples related to the example. Accordingly, the association rules discovered by APRIORIREL may combine information from various relations to statements of the form "95% of the employees who have a parent who is in head of one of the administration departments have worked on one of the reorganisation projects". We apply APRIORIREL to the natural language processing task of mining part-of-speech tagging rules in a large corpus of English.


Saso Dzeroski, Tomaz Erjavec (Jozef Stefan Institute, Ljubljana, Slovenia)
Learning Slovene Declensions with FOIDL
pages 49-60

The paper presents results of using FOIDL to learn certain inflectional forms of Slovene nouns. FOIDL learns first-order decision lists, defined as ordered list of clauses; it has been previously tested on the problem of inducing rules for forming the past tense of English verbs. Slovene, unlike English, has rich inflectional morphology, and the paper reports the result of applying FOIDL over a large lexicon of Slovene word-forms to induce rules for the generation of the singular genitive forms of Slovene nouns of the masculine, feminine and neuter gender.


Maria Wolters (University of Bonn, Germany) and Antal van den Bosch (Universiteit Maastricht, The Netherlands)
Automatic Phonetic Transcription of Words Based On Sparse Data
pages 61-70

The relation between the orthography and the phonology of a language has traditionally been modelled by hand-crafted rule sets. Machine-learning (ML) approaches offer a means to gather this knowledge automatically. Problems arise when the training material is sparse. Generalising from sparse data is a well-known problem for many ML algorithms. We present experiments in which connectionist, instance-based, and decision-tree learning algorithms are applied to a small corpus of Scottish Gaelic. The results show that instance-based learning in the IB1-IG algorithm yields the best generalisation performance, and that most algorithms tested perform tolerably well. Given the availability of a lexicon, even if it is sparse, ML is a valuable and efficient tool for automatic phonetic transcription of written text.


Jakub Zavrel, Walter Daelemans (Tilburg University, The Netherlands)
Memory-based learning: using similarity for smoothing
pages 71-84

This paper analyses the relation between the use of similarity in Memory-Based Learning and the notion of backed-off smoothing in statistical language modeling. We show that the two approaches are closely related, and we argue that feature weighting methods in the Memory-Based paradigm can offer the advantage of automatically specifying a suitable domain-specific hierarchy between most specific and most general conditioning information without the need for a large number of parameters. We report two applications of this approach: PP-attachment and POS-tagging. Our method achieves state-of-the-art performance in both domains, and allows the easy integration of diverse information sources, such as rich lexical representations.


Toine Andernach, Mannes Poel, and Etto Salomons (University of Twente, The Netherlands)
Finding Classes of Dialogue Utterances with Kohonen Networks
pages 85-94

In this paper, we present Kohonen Self-Organising Feature Maps (SOMs) as a method for automatically finding classes of dialogue utterances on the basis of superficial utterance features, in particular for dialogues found in the SCHISMA corpus. Furthermore, we discuss some ways for determining the quality of a certain utterance classification. We propose to use supervised classification to gain more insight in the results of the unsupervised generation of Kohonen maps.


Jens-Uwe Moeller (University of Hamburg, Germany)
CLASSITALL: Incremental and Unsupervised Learning in the DIA-MOLE Framework
pages 95-104

The learning algorithm CLASSITALL is a module of DIA-MOLE, a tool that supports an engineering-oriented approach towards dialogue modelling for a spoken-language interface. CLASSITALL is a descendant of the COBWEB conceptual clustering algorithm, and in this paper we especially focus on extensions that are necessary for processing data within a spoken language system environment. While most learning algorithms handle simple data, like e.g. attribute-value pairs, we introduce an approach to use uncomplete and uncertain as well as highly structured knowledge within a learning task. CLASSITALL clustering of structured values is based on sets of least general generalisations. For reasons of efficiency we developed an algorithm with a lazy evaluation strategy for constructing local hierarchies and for propagating information. Some improvements for using classification hierarchies are also introduced.


Dimitar Kazakov (Czech Technical University, Prague, Czech Republic)
Unsupervised Learning of Naive Morphology with Genetic Algorithms
pages 105-112

The morphological lexicon is an important part of NLP systems which is typically hand-written with the help of linguist experts. Even a partial automation of this process could decrease the cost of the lexicon, being of theoretical importance for languages and dialects which have not been well analysed yet. In this work we describe an attempt to use the minimal description length (MDL) as the one bias for deriving lexicons of morphemes from a raw list of words. MDL is used as a fitness function of a simple genetic algorithm. Results are reported for a rich-morphology language corpus (French) and future work is discussed.


Miles Osborne (Cambridge University, UK)
Minimisation, Indifference and Statistical Language Learning
pages 113-124

When applied to probabilistic categorial grammar learning, the Minimum Description Length principle outperforms Maximum Likelihood Estimation. Smoothing does not bridge the gap between the two approaches.


Jan Hajic, Kiril Ribarov (Charles University, Prague, Czech Republic)
Rule-Based Dependencies
pages 125-136

We present a modified version of the Transformation-Based Approach (TBA) and Transformation-Based Error-Driven Learning (TBEDL). We modified the TBA in order to work with a dependency tree structure, which describes more efficiently the syntax of inflective and free-word order languages, such as the Czech language. The major changes and characteristics are described in more detail: they mostly concern the error function and the transformation templates. Variations of these transformations and their relationship to their triggering environments is also discussed. The transformations are designed to be compatible with the structural representation used in the Prague Dependency Treebank, which is under construction.


Khalil Sima'an (Utrecht University, The Netherlands)
Explanation-Based Learning of Partial-Parsers
pages 137-146

This paper presents a method for learning efficient parsers of natural language. The method consists of an Explanation-Based Learning (EBL) algorithm for learning partial-parsers, and a parsing algorithm which combines partial-parsers with existing "full-parsers". The learned partial-parsers, implementable as Cascades of Finite-State Transducers (CFSTs), recognize and combine constituents efficiently, prohibiting spurious overgeneration. The parsing algorithm combines a learned partial-parser with a given full-parser such that the role of the full-parser is limited to combining the constituents, recognized by the partial-parser, and to recognizing unrecognized portions of the input sentence. We exhibit encouraging empirical results using a pilot implementation: parse-space is reduced substantially with minimal loss of coverage.


Last updated: 9 March 2002 by Antal van den Bosch / Antal.vdnBosch@kub.nl