Claude, F., Galaktionov, D., Konow, R., Ladra, S., & Pedreira, Ó. (2017). Competitive author profiling using compression-based strategies. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 25(Suppl. 2), 5-20.
Author profiling consists in determining some demographic attributes — such as gender, age, nationality, language, religion, and others — of an author for a given document. This task, which has applications in fields such as forensics, security, or marketing, has been approached from different areas, especially from linguistics and natural language processing, by extracting different types of features from training documents, usually content — and style-based features.
In this paper we address the problem by using several compression-inspired strategies that generate different models without analyzing or extracting specific features from the textual content, making them style-oblivious approaches. We analyze the behavior of these techniques, combine them and compare them with other state-of-the-art methods. We show that they can be competitive in terms of accuracy, giving the best predictions for some domains, and they are efficient in time performance.
Identifying the Dataset
Highlighting the importance of author profiling for research and industry, this study makes use of the TIRA experimentation platform, available to participants of the PAN International Author Profiling competitions (hereby referred to as PAN). Researchers accessed PAN datasets from 2014-2016 for test and training data. These datasets included blogs, social media, hotel reviews and twitter content in English, Spanish, Italian, and Dutch.
Built using training data, models will compress documents fed to them via different available models. The algorithm will be responsible for determining which model most effectively compresses a document. Documents are subsequently classified by the model that best suits them. The proposed classifiers are: word counting, Move to Front (reduces redundancy) Entropy (descriptive texts will have a higher entropy) and the Compressor. (based on training data, the content written by a similar group should be more predictable with a compressor). These techniques are combined and tested using SVM.
Key Assumptions Stated by Authors
According to the authors, compression methods need not take as long as complex natural language processing classification models. Compression models can be simplified to be language-independent. Because this method takes both efficacy and efficiency into account, the authors claim that this method is scalable to large datasets and can be applied to various domains without the need for substantial adaptations.
The general wisdom among author profiling researchers, namely that women are descriptive and use more words, is present in this study. Any complexities of the method are undermined by the binary nature of its potential conclusions. Of course, practically speaking, a binary classification is much simpler to compute. However, perhaps the approach can do a better job of acknowledging its own failures and lack of applicability to other domains.
Unsurprisingly, the compression and SVM combined models outperformed other models on the majority of tests, with some exceptions - such as the word-count feature performing well, especially with Spanish text.
Along with measuring efficiency (time elapsed appears on the TIRA platform) and accuracy, the results of this experiment were compared with previous PAN competition winners. Although the scope of this comparison bears significance mostly for people who are familiar with the competition, this is a strategic evaluation technique (as opposed to simply listing other studies as related work).