Computational Linguistics

Can we model language understanding, production, learning, and translation with computational models? Computational linguistics research at CLiPS is concerned with the study of computational methods for the representation, acquisition, and use of language knowledge.

We focus on the application of statistical and machine learning methods, trained on corpus data, to explain human language acquisition and processing data, and to develop automatic text analysis systems that are accurate, efficient, and robust enough to be used in practical applications. We develop specific machine learning algorithms suited for the properties of language data (few regularities, many irregularities and exceptions), and develop new methodologies for simulation of these language data.

Our application-oriented research is in the domain of Language Technology, the development of language processing tools to solve concrete problems. Research focus here has been on text mining (extracting knowledge from unstructured text data). We develop new approaches combining machine learning and automatic text analysis to solve generic problems in text mining (automatic summarization, question answering, information extraction, smart search, ontology learning, etc.). We build these generic solutions into prototypes for specific applications. Recently, the group has also developed research initiatives on language technology for African languages, and on Digital Humanities (especially the areas of computational stylometry and language technology for the study of old variants of Dutch).

In the dominant approach in stylometry, superficial linguistic characteristics are often used, e.g. frequencies of words and character sequences. Although such features have been proven to work well on various tasks in stylometry, the issue of explanation often arises: it can be difficult to explain why certain superficial features perform well e.g. in a complex task such as authorship attribution. Moreover, such shallow features can be difficult to interpret from a linguistic point of view....
Our aim is to implement commercial web services for automatic opinion detection and author profiling in text. In this project we will develop the core technology: data mining and annotation, machine learning and setting up the server. In a follow-up project we will then launch a spin-off company. This kind of language technology is useful for a wide range of big data applications, and does not yet exist for Dutch, and only in part for English. https://www.textgain.com
The acquisition of abstract linguistic categories is investigated. Computational models of bootstrapping operations are constructed in order to investigate how knowledge from one domain can be instrumental in acquiring knowledge of another domain. In our simulations the language addressed to very young children is used in an attempt to elucidate how grammatical categories and grammatical gender are acquired given a combination of distributional, phonological and morphological bootstrapping.
 The AMiCA (“Automatic Monitoring for Cyberspace Applications”) project aims to mine relevant social media (blogs, chat rooms, and social networking sites) and collect, analyse, and integrate large amounts of information using text and image analysis. The ultimate goal is to trace harmful content, contact, or conduct in an automatic way. Essentially, we take a cross-media mining approach that allows us to detect risks “on-the-fly”. When critical situations are detected (e.g. a very...
The research project focuses on the authorship, composition and textual interconnectedness of three 16th-century mystical texts, all of which are believed to have emerged from a group of female/male writers we now call “the Arnhem mystics”. These texts, Die evangelische peerle, Vanden tempel onser sielen and the Arnhem mystical sermons, are all in some way connected to the St. Agnes convent in Arnhem. Similarities between the three can be found on a lexical, semantic, conceptual and stylistic...
In this project, we investigate a methodology for the automatic extraction and analysis of style that we want to apply to both individual authors (authorship attribution, both fiction and non-fiction) and groups of authors (extraction of stylistich characteristics associated to gender and age). This methodology covers several aspects: (1) Automatic linguistic analysis of documents by means of available text analysis tools on the level of morphological structure, part of speech, global syntactic...
In this project we investigate the applicability of machine learning techniques (supervised and unsupervised methods) to various language technology problems for African languages. 
We conduct research into text analytics (e.g., do adults use more punctuation than adolescents?) and its real-world applications (e.g., can we predict age by punctuation?). Many of our resources are freely available. Here are some reads on how we constructed or applied such resources, for example for sentiment analysis, demograpy prediction, and detection of subversive behavior (cyberbullying, grooming, hate speech, ...).  We frequently release open source tools, such as...

Past Projects

In many human language technology applications (e.g. machine translators, spelling checkers), it often happens that concatenatively written compounds (e.g. “skrywerspen”/”schrijverspen” ‘writer’s pen’) are processed incorrectly. From a technological perspective, these segmentation problems are particularly problematic, since concatenative compounding is a highly productive process in many languages, including Dutch and Afrikaans. Although a compound splitter has already been developed for...