Project team at CLiPS (University of Antwerp)
PhD student: Kim Luyckx
Promotor: Prof. dr. Walter Daelemans
Co-promotors: dr. Guy De Pauw
and
Edward Vanhoutte
Duration
January 2007 - end of December 2010
Abstract
Project funded by the National Science Foundation (FWO)
In this project, we investigate a methodology for the automatic extraction and analysis of style that we want to apply to both individual authors (authorship attribution, both fiction and
non-fiction) and groups of authors (extraction of stylistic characteristics associated to gender and age). This methodology covers several aspects:
- Automatic linguistic analysis of documents by means of available text analysis tools on the level of morphological structure, part of speech, global syntactic structures and semantic roles (subject, object, temporal, location) for the construction of potentially relevant stylistic characteristics.
- Unsupervised and supervised learning techniques for selecting characteristics with high information value and constructing a model of authorial style.
- Evaluation of these models by (a) comparison with stylistic analyses in linguistics and literary science and (b) empiric testing of the predictive power of the models.
Expected results
- The operationalization of a methodology for the construction of a stylistic model of individual authors and groups of authors in terms of reusable and freely available text analysis tools for research purposes and tools for the style feature extraction by means of machine learning techniques (in a software package).
- Corpora for future research in stylistic characteristics, authorship attribution and detection of plagiarism.
- An answer to the following fundamental research questions:
- Does the proposed methodology prove its efficiency for the extraction of stylistic characteristics from corpora with a constant theme and register?
- Do the stylistic characteristics keep their value when the models are being used on texts with a non-constant theme and register?
- Does the proposed methodology proved the same insights as manual (literary) style analysis?
- Are dependencies between words better predictive features for style than other syntactic features?
- Can the methodology be applied to applications like authorship attribution or gender identification with reliable results?
Selected stylometry bibliography
- Baayen, H., Van Halteren, H. and Tweedie, F. (1996), "Outside the Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution", Literary and Linguistic Computing, 11(3), 121-131.
- Koppel, M., Argamon, S. and Shimoni, A. (2003), "Automatically Categorizing Written Texts by Author Gender", Literary and Linguistic Computing, 17(4), 401-412. [pdf]
- Sebastiani, F (2002), "Machine Learning in Automated Text Categorization, ACM Computing Surveys, 34(1), 1-47. [pdf]
- Stamatatos, E., Fakotakis, N. and Kokkinakis, G. (2001), "Automatic Text Categorization in Terms of Genre and Author", Computational Linguistics, 26(4), 471-495. [pdf]
- Stamatatos, E. (2009), "A Survey of Modern Authorship Attribution Methods" (2009), Journal of the American Society for Information Science and Technology, 60(3), 538-556. [pdf]
- Van Halteren, H. (2007), "Author verification by linguistic profiling: An exploration of the parameter space", ACM Transactions on Speech and Language Processing, 4(1), 1-17. [pdf]