Fourth CLIF Symposium
Symposium on Language and Speech Technology in Flanders
|December 9, 2009|
Click on image to enlarge.
|You can see the photos here!|
|9.30||Streamlining processing stages in building the parallel corpus DPC||Hans Paulussen|
|10.00||Automatic transcription of Flemish broadcast news shows||Kris Demuynck|
|11.00||Automatic speaker recognition||David van Leeuwen|
|11.50||Proper name recognition using multilingual acoustic and lexical models||Bert Réveil|
|12.20||A Multimodal Approach to Audiovisual Text-to-Speech Synthesis||Wesley Mattheyses|
|14.30||CLARIN -- Language and Speech Infrastructure for Researchers in the Humanities and Social Sciences.||Ineke Schuurman|
|15.00||Cross-lingual Word Sense Disambiguation||Els Lefever|
|16.00||Alignment of grammatically divergent parses using interlingual MT techniques||Tom Vanallemeersch|
|16.30||Computational approaches to creativity||Tom De Smedt|
|17.00||Reception: Belgian beers|
Automatic speaker recognition is an area of speech technology that has received much attention from speech researchers in recent years. Some believe that it is the cleanest of all speech related recognition problems. Although simple in its formulation, the speaker recognition problem appears to have an intricate relation with its application. Text independent Speaker Recognition can be seen as a pattern recognition problem, where features are highly variable sequences related to a single source. The task is to detect whether the source is of known identity.
In this presentation, the typical characteristics of the speaker recognition approach are reviewed, and an overview of the machine learning techniques employed is given. Apart from spectral features, which are most dominant in speech, techniques involving linguistic models exist that can contribute to the discriminability of speakers. These approaches can be effectively combined to baseline systems using simple fusing and calibration techniques. The framework for measuring the performance of the state of the art in text independent speaker recognition are regular NIST speaker recognition evaluations.
The continuous and steady improvements made over the years on both the accuracy and robustness of large vocabulary continuous speech recognition have lead to systems that can deal with with complex tasks such as the automatic transcription of broadcast news shows. In this presentation, we will describe one such system build on top of the open software toolkit SPRAAK. Several task and system related aspects will be briefly discussed:
The performance and main causes of error of the current system are analyzed on the NBest benchmark and on some recent Flemish news shows.
The follow up and coordination of a multilingual corpus project requires a different approach compared to the compilation of monolingual corpora. Unlike the latter type of project, which are mainly focused on successive linear processing stages, multilingual corpora require a parallel follow up of data processing.
In building the Dutch Parallel Corpus (DPC), we not only had to cope with monitoring the sequential tasks of data acquisition, data processing and data packaging, but we also had to consider the complexities regarding the different types of data processing: sentence alignment and linguistic annotation, and this for three languages (Dutch, English and French). In order to manage the different aspects in corpus creation, the whole procedure was monitored through an electronic "matrix" (linked with metadata files), which could be updated flexibly on a daily basis, thus facilitating optimisation of the corpus design requirements.
In this talk we present the approach used in DPC to handle the follow up of the project and the coordination of the different processing stages.
Utterances of proper names remain a challenge for voice-driven car navigation or directory assistance applications as they exhibit a lot of pronunciation variation. The latter is owed to the fact that proper names often show archaic spelling or originate (in part) from foreign languages. Furthermore, the above applications usually require the accommodation of non-native users.
In order to address and explore this challenge in the context of Dutch Points Of Interest (POI) recognition, two previously proposed approaches were revisited. First, multiple foreign grapheme-to-phoneme (g2p) transcriptions were added into the lexicon of a monolingual proper name recognition system. In a second step a multilingual acoustic model was introduced. We found that both measures greatly improve the performance of the recognizer, and analyzed the improvements thoroughly.
However, even though the accuracy gains obtained with our best system were substantial, a cheating experiment with auditorily verified (AV) transcriptions revealed that further significant improvements are possible. Therefore, we are currently deploying so-called phoneme-to-phoneme (p2p) converters that try to transform a set of baseline transcriptions into a pool of transcription variants that lie closer to the “true” AV transcriptions. The first experiments have shown that p2p transcriptions allow us to further improve the recognition accuracy.
Audiovisual text-to-speech systems convert a written text into an audiovisual speech signal. Typically, the visual mode of the synthetic speech is synthesized separately from the audio; the latter being either natural or synthesized speech. The possible perception of mismatches between these two information streams, which could degrade the quality, requires experimental exploration. In order to increase the intermodal coherence in synthetic 2D photorealistic speech, we extended the well-known unit selection audio synthesis technique to work with multimodal segments containing original combinations of audio and video.
In this presentation we discuss our synthesis strategy and we summarize the results of listening experiments we conducted.
We present a multilingual unsupervised Word Sense Disambiguation (WSD) task for a sample of English nouns. The task was formulated within the framework of the SemEval-2010 evaluation exercise. Instead of providing manually sense-tagged examples for each sense of a polysemous noun, our sense inventory is built up on the basis of the Europarl parallel corpus. The multilingual setup involves the translations of a given English polysemous noun in five supported languages, viz. Dutch, French, German, Spanish and Italian.
Organizing this task consists in: (a) the manual creation of a multilingual sense inventory for a lexical sample of English nouns and (b) the evaluation of systems on their ability to disambiguate new occurrences of the selected polysemous nouns.
For the creation of the hand-tagged gold standard, all translations of a given polysemous English noun are retrieved in the five languages and clustered by meaning. Human annotators label each instance with the appropriate cluster and their top-3 translations from this cluster. The frequencies of these translations are used to assign weights to all translations in the gold standard. Systems can participate in some of the five bilingual evaluation subtasks and in a multilingual subtask covering all language pairs.
To score the system output, we perform a "best" evaluation (where the credit for each correct guess is divided by the number of guesses) and a more "relaxed" evaluation of maximum 10 system guesses (where systems are not penalized for a higher number of guesses). We provide two baselines: the first baseline takes into account the most frequent GIZA++ word alignments whereas the second baseline uses the most frequent EuroWordNet sense.
Alignment of nodes across parse trees is useful for several purposes, among which the creation or tuning of MT systems and computer-assisted translation. Tree alignment approaches combine features such as lexical equivalences, syntactic labels, tree levels and inside/outside scores. A non-trivial problem is the alignment of divergences, such as equivalent words with different syntactic categories and paraphrases.
We propose an approach for aligning grammatical divergences which is based on interlingual MT techniques from the Eurotra system, abstracts away from surface linguistic properties and creates semantic hypotheses. These divergences involve equivalences between verbs and deverbal nouns (e.g. "during their meeting" and "terwijl ze vergaderden" involve a semantic subject "they"/"ze" and an action "meet"/"vergaderen"), differences in tense and aspect, and differences in diathesis (e.g. passivisation).
For three languages (Dutch, French and English), we create a reference corpus with semantically annotated sentences, parse the sentences and associate subtree patterns with semantic hypotheses. We test the patterns and hypotheses by applying them to parses of Europarl sentence pairs and aligning hypotheses based on their similarity and a bilingual lexicon. We extend the bilingual sentence alignment in the Europarl corpus to a trilingual one in order to align hypotheses between three languages.
Traditionally, software applications for computer graphics have been based on real-world analogies. Each icon in the application's user interface represents a concrete object — a pen, an eraser, scissors. This model raises creative limitations: features can only be used as implemented by the developers, the screen is too small to display all the features (some are never discovered), actions are mouse- based so the user's decision-making process is literally lost in translation.
"NodeBox" is an ongoing effort to produce software that allows more people to express themselves creatively. One of our areas of interest is the way creative ideas are established and how these ideas can be mined from text. Using a number of NLP techniques (shallow parser, semantic network) and drawing inspiration from cognitive processes such as analogy and concept fluidity, the system is able to translate graphically underspecified concepts to something that can be used in a visual representation. For example, "creepy" has no direct visual representation - instead the system could propose you use an image of an octopus for your creepy design. For a given property (e.g. "creepy") and a range of concepts (e.g. animals) it yields the concepts from the range that best resemble the property (the creepiest animals). In this particular example the system will suggest such animals as octopus, bat, crow, locust, mayfly, termite, tick, amphibian, arachnid... No fluffy bunnies or frolicking ponies there!
|Organisation: CLiPS University of Antwerp|
Organizing team: Walter Daelemans, Patrick Wambacq, Vincent Van Asch
Sponsored by CLIF
Last updated: 15th of December 2009