This page lists the table of contents of the Workshop Notes, containing links to gzipped postscript versions of all papers as they appear in the Notes, and the abstracts of these papers. All participants are invited to download the documents of their choice to get informed about the topics of the talks at the workshop beforehand.
The complete electronic Workshop Notes (all postscript versions) can also be FTPd as one gzipped tar file (909 Kb, unzipped 3,208 Kb). Click here to download.
Please note that the reference for the Workshop Notes is the following:
A corrected version of Dimitar Kazakov's paper Unsupervised Learning of Naive Morphology with Genetic Algorithms has replaced the old version which lacked all letters with diacritics. Especially when you have already downloaded the paper, you are invited to download the new postscript version directly by clicking here.
Language learning has thus far not been a hot application for
machine-learning (ML) research. This limited attention for work on
empirical learning of language knowledge and behaviour from text and
speech data seems unjustified. After all, it is becoming apparent that
empirical learning of Natural Language Processing (NLP) can alleviate
NLP's all-time main problem, viz. the knowledge acquisition
bottleneck: empirical ML methods such as rule induction, top down
induction of decision trees, lazy learning, inductive logic
programming, and some types of neural network learning, seem to be
excellently suited to automatically induce exactly that knowledge that
is hard to gather by hand. In this paper we address the question why
NLP is an interesting application for empirical ML, and provide a
brief overview of current work in this area.
The evolution of language can only be explained when we take a
language learning process into account which is necessarily imperfect
due to weak data and the limits of induction. Thus language change
yields clues on what the learning processes are and conversely
hypothesised learning processes should predict possible language
changes. This paper considers this issue by studying lexicon formation
processes. It is shown how a process for the construction and
acquisition of a lexicon in a single agent leads to various dynamical
phenomena when applied to a population of agents with varying degrees
of contact.
We summarise recent work on using Inductive Logic Programming (ILP)
for Natural Language Processing (NLP). ILP performs learning in a
first-order logical setting, and is thus well-suited to induce over
the various structured representations used in NLP. We present
Stochastic Logic Programs (SLPs) and demonstrate their use in ILP
when learning from positive examples only. We also give accounts of work on
learning grammars from children's books and part-of-speech tagging.
Association rules are generally recognised as a highly valuable type
of regularities and various algorithms have been presented for
efficiently mining them in large databases. To the best of our
knowledge, the application of these algorithms is so far restricted to
databases that consist of a single relation composed of a set of
binary attributes. We describe how these restrictions can be overcome
through the combination of the available algorithms with standard
techniques from the field of inductive logic programming. We present
the algorithm APRIORIREL, which extends APRIORI [Agrawal et al, 1996]
to mine association rules in multiple relations. Whereas in APRIORI
eacg example is described by a single tuple, in APRIORIREL each
example is viewed as a separate data base with a selection, from
multiple relations, of all tuples related to the example. Accordingly,
the association rules discovered by APRIORIREL may combine information
from various relations to statements of the form "95% of the employees
who have a parent who is in head of one of the administration
departments have worked on one of the reorganisation projects". We
apply APRIORIREL to the natural language processing task of mining
part-of-speech tagging rules in a large corpus of English.
The paper presents results of using FOIDL to learn certain
inflectional forms of Slovene nouns. FOIDL learns first-order
decision lists, defined as ordered list of clauses; it has been
previously tested on the problem of inducing rules for forming the
past tense of English verbs. Slovene, unlike English, has rich
inflectional morphology, and the paper reports the result of
applying FOIDL over a large lexicon of Slovene word-forms to
induce rules for the generation of the singular genitive forms of
Slovene nouns of the masculine, feminine and neuter gender.
The relation
between the orthography and the phonology of a language has
traditionally been modelled by hand-crafted rule sets.
Machine-learning (ML) approaches offer a means to gather this
knowledge automatically. Problems arise when the training material is
sparse. Generalising from sparse data is a well-known problem for many
ML algorithms. We present experiments in which connectionist,
instance-based, and decision-tree learning algorithms are applied to a
small corpus of Scottish Gaelic. The results show that instance-based
learning in the IB1-IG algorithm yields the best generalisation
performance, and that most algorithms tested perform tolerably
well. Given the availability of a lexicon, even if it is sparse, ML is
a valuable and efficient tool for automatic phonetic transcription of
written text.
This paper analyses the relation between the use of similarity in
Memory-Based Learning and the notion of backed-off smoothing in
statistical language modeling. We show that the two approaches are
closely related, and we argue that feature weighting methods in the
Memory-Based paradigm can offer the advantage of automatically
specifying a suitable domain-specific hierarchy between most specific
and most general conditioning information without the need for a large
number of parameters. We report two applications of this approach:
PP-attachment and POS-tagging. Our method achieves state-of-the-art
performance in both domains, and allows the easy integration of
diverse information sources, such as rich lexical representations.
In this paper, we present Kohonen Self-Organising Feature Maps (SOMs)
as a method for automatically finding classes of dialogue
utterances on the basis of superficial utterance features, in
particular for dialogues found in the SCHISMA corpus. Furthermore, we
discuss some ways for determining the quality of a certain utterance
classification. We propose to use supervised classification to gain
more insight in the results of the unsupervised generation of Kohonen
maps.
The learning algorithm CLASSITALL is a module of DIA-MOLE, a tool that
supports an engineering-oriented approach towards dialogue modelling
for a spoken-language interface. CLASSITALL is a descendant of the
COBWEB conceptual clustering algorithm, and in this paper we
especially focus on extensions that are necessary for processing data
within a spoken language system environment. While most learning
algorithms handle simple data, like e.g. attribute-value pairs, we
introduce an approach to use uncomplete and uncertain as well as
highly structured knowledge within a learning task. CLASSITALL
clustering of structured values is based on sets of least general
generalisations. For reasons of efficiency we developed an algorithm
with a lazy evaluation strategy for constructing local hierarchies and
for propagating information. Some improvements for using
classification hierarchies are also introduced.
The morphological lexicon is an important part of NLP systems
which is typically hand-written with the help of linguist
experts. Even a partial automation of this process could
decrease the cost of the lexicon, being of theoretical
importance for languages and dialects which have not been well
analysed yet. In this work we describe an attempt to use the
minimal description length (MDL) as the one bias for
deriving lexicons of morphemes from a raw list of words.
MDL is used as a fitness function of a simple genetic
algorithm. Results are reported for a rich-morphology language
corpus (French) and future work is discussed.
When applied to probabilistic categorial grammar learning, the
Minimum Description Length principle outperforms Maximum
Likelihood Estimation. Smoothing does not bridge the gap between the
two approaches.
We present a modified version of the
Transformation-Based Approach (TBA) and Transformation-Based
Error-Driven Learning (TBEDL). We modified the TBA in order to work
with a dependency tree structure, which describes more efficiently the
syntax of inflective and free-word order languages, such as the Czech
language. The major changes and characteristics are described in more
detail: they mostly concern the error function and the transformation
templates. Variations of these transformations and their relationship
to their triggering environments is also discussed. The
transformations are designed to be compatible with the structural
representation used in the Prague Dependency Treebank, which is under
construction.
This paper presents a method for learning efficient parsers of natural
language. The method consists of an Explanation-Based Learning (EBL)
algorithm for learning partial-parsers, and a parsing algorithm which
combines partial-parsers with existing "full-parsers". The learned
partial-parsers, implementable as Cascades of Finite-State Transducers
(CFSTs), recognize and combine constituents efficiently, prohibiting
spurious overgeneration. The parsing algorithm combines a learned
partial-parser with a given full-parser such that the role of the
full-parser is limited to combining the constituents, recognized by
the partial-parser, and to recognizing unrecognized portions of the
input sentence. We exhibit encouraging empirical results using a pilot
implementation: parse-space is reduced substantially with minimal loss
of coverage.
Walter Daelemans (Tilburg University, The Netherlands), Antal
van den Bosch, and Ton Weijters (Universiteit Maastricht, The
Netherlands)
Empirical Learning of Natural Language Processing Tasks
pages 1-10
Luc Steels (VUB Artificial Intelligence Laboratory, Brussels,
Belgium)
Language Learning and Language Contact
pages 11-24
James Cussens, David Page, Stephen Muggleton, and Ashwin Srinivasan
(Oxford University, United Kingdom)
Using Inductive Logic Programming for Natural Language
Processing
pages 25-34
Luc DeHaspe and Luc De Raedt (Katholieke Universiteit Leuven,
Belgium)
Mining a Natural Language Corpus for Multi-Relational Association
Rules
pages 35-48
Saso Dzeroski, Tomaz Erjavec (Jozef Stefan Institute,
Ljubljana, Slovenia)
Learning Slovene Declensions with FOIDL
pages 49-60
Maria Wolters (University of Bonn,
Germany) and Antal van den Bosch (Universiteit Maastricht, The
Netherlands)
Automatic Phonetic Transcription of Words
Based On Sparse Data
pages 61-70
Jakub Zavrel, Walter Daelemans (Tilburg
University, The Netherlands)
Memory-based learning: using similarity for smoothing
pages 71-84
Toine Andernach, Mannes Poel, and Etto Salomons (University of Twente,
The Netherlands)
Finding Classes of Dialogue Utterances with Kohonen
Networks
pages 85-94
Jens-Uwe Moeller (University of Hamburg, Germany)
CLASSITALL: Incremental and Unsupervised Learning in the DIA-MOLE
Framework
pages 95-104
Dimitar Kazakov (Czech Technical University, Prague, Czech
Republic)
Unsupervised Learning of Naive Morphology with Genetic
Algorithms
pages 105-112
Miles Osborne (Cambridge University, UK)
Minimisation, Indifference and Statistical Language
Learning
pages 113-124
Jan Hajic, Kiril Ribarov (Charles University, Prague, Czech
Republic)
Rule-Based Dependencies
pages 125-136
Khalil Sima'an (Utrecht University, The Netherlands)
Explanation-Based Learning of Partial-Parsers
pages 137-146