Deep linguistic features for computational stylometry


Stylometry refers to the quantitative study of writing style. This field is rich in linguistic applications in which the correlation between a text’s writing style and its metadata is investigated. Empirical studies have shown that, for instance, the gender of an author can be fairly reliably predicted from his/her writing style. Other interesting applications include authorship attribution and prediction of age, gender and personality.


Project information

In the dominant approach in stylometry, superficial linguistic characteristics are often used, e.g. frequencies of words and character sequences. Although such features have been proven to work well on various tasks in stylometry, the issue of explanation often arises: it can be difficult to explain why certain superficial features perform well e.g. in a complex task such as authorship attribution. Moreover, such shallow features can be difficult to interpret from a linguistic point of view.


In this project, we will explore the use of deeper linguistic features in computational stylometry. Following a line of recent research, we hypothesise that more complex features will provide us with complementary information about writing style. We will propose methods of constructing new features (i.e. finding quantifiable aspects of the text) related to the semantics and the discourse of the text, two types of linguistic knowledge that are currently underresearched in stylometry.


01/10/2014 - 30/09/2018

FWO Research Foundation - Flanders

Welcome to Enrique Manjavacas

CLiPS welcomes Enrique Manjavacas as a visiting researcher for the next couple of months. Enrique is an MA student at the Freie Universität Berlin where he studies European Languages. Enrique is working on an MA thesis on computational stylometry. He will be working at the computational linguistics group in close collaboration with Prof. Walter Daelemans and Ben Verhoeven.

Verhoeven, B., Daelemans W., & Plank B. (2016).  TwiSty: a multilingual Twitter Stylometry corpus for gender and personality profiling. Proceedings of the 10th Annual Conference on Language Resources and Evaluation (LREC 2016). PDF
Verhoeven, B., & Daelemans W. (In Press).  Discourse features for computational stylometry. Poster to be presented at the TextLink Second Action Conference, Budapest, Hungary..
Verhoeven, B., Plank B., & Daelemans W. (In Press).  Multilingual personality profiling on Twitter. To be presented at DHBenelux 2016, Belval, Luxembourg.
Verhoeven, B., & Daelemans W. (2016).  Discourse features for computational stylometry. Presented at University of Groningen, The Netherlands. PDF
Plank, B., Verhoeven B., & Daelemans W. (2015).  Personality traits on Twitter for less-resourced languages. Presented at the 26th Meeting of Computational Linguistics in the Netherlands (CLIN 26). PDF
Syndicate content