Resources

A lot of useful resources have been developed at CLiPS, many of which are available for a wider audience. We have categorised these resources into three pages:

 

Below is a selection of our software, corpora and datasets.

TwiSty is a corpus developed for research in author profiling. It contains personality (MBTI) and gender annotations for a total of 18,168 authors spanning six languages. We distribute the Twitter ids of these authors as well as the ids of their...
The Personae corpus was collected for experiments in Authorship Attribution and Personality Prediction. It consists of 145 Dutch-language essays, written by 145 different students (BA in Linguistics and Literature at the University of Antwerp,...
Pattern is a web mining module for Python. It bundles tools for data retrieval (Google + Twitter + Wikipedia, web spider, HTML parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf...
MBSP is a text analysis system based on the TiMBL and MBT memory based learning applications developed at CLiPS and ILK. It provides tools for Tokenization and Sentence Splitting, Part of Speech Tagging, Chunking, Lemmatization, Relation Finding...
The deLearyous dataset is a Dutch (Flemish) dataset for emotion classification following the framework of Leary's Rose, also known as the Interpersonal Circumplex. The dataset contains 11 conversations that were annotated on the sentence level with...
The CSI corpus is a yearly expanded corpus of student texts in two genres: essays and reviews. The purpose of this corpus lies primarily in stylometric research, but other applications are possible. There is a vast amount of meta-data available,...
The AuCoPro-Semantics dataset serves for the automatic semantic analysis of compounds. It contains semantically annotated noun-noun compounds (NN) from Dutch and Afrikaans, split in two annotation rounds per language. The semantic annotation was...