Peersman, C., Daelemans, W., & Van Vaerenbergh, L. (2011, October). Predicting age and gender in online social networks. In Proceedings of the 3rd international workshop on Search and mining user-generated contents (pp. 37-44).
A common characteristic of communication on online social networks is that it happens via short messages, often using nonstandard language variations. These characteristics make this type of text a challenging text genre for natural language processing. Moreover, in these digital communities it is easy to provide a false name, age, gender and location in order to hide one’s true identity, providing criminals such as pedophiles with new possibilities to groom their victims. It would therefore be useful if user profiles can be checked on the basis of text analysis, and false profiles flagged for monitoring. This paper presents an exploratory study in which we apply a text categorization approach for the prediction of age and gender on a corpus of chat texts, which we collected from the Belgian social networking site Netlog. We examine which types of features are most informative for a reliable prediction of age and gender on this difficult text type and perform experiments with different data set sizes in order to acquire more insight into the minimum data size requirements for this task.
Identifying the Dataset
The motivation behind this study is to make Online Social Networking (ONS) safer for children by identifying and flagging pedophiles with greater ease. Because the main focus of this paper is to target pedophiles, both age and gender served as key latent attributes. Conducting a case study on the Belgian ONS, Netlog, the authors curate 1,537,283 Flemish Dutch posts containing 18,713,627 tokens (character-based features). Considering that the communication on this site is short, non-standard, and constantly changing (colloquial Dutch speech evolves particularly rapidly), the authors state that obtaining training dataset is difficult.
Rather than choosing pre-determined classifier features, classifier features are selected through statistical techniques and classification using machine learning algorithms. The data is processed by saving the last post of each interaction as separate documents. Post-processing, the posts are tokenized (style and language independent), normalised, and grouped using profile data. A total of four age and gender classes were created; male and female classes for users below and above 25 e.g. plus25_male. Different experiments were set up following these steps with 10,000 posts, 5,000 posts, and 1,000 posts in order to test the minimum amount of data required for accurate results. The experiment was conducted with an SVM package via Liblinear to determine which types of features are most informative.
Key Assumptions Stated by Authors
The authors acknowledge that it is not possible to manually check profiles, therefore some information about gender and or age could be false. Given the nature of Netlog, the three main limitations outlined by the authors include the feasibility of gender prediction on chat text, the usefulness of this approach with short texts, and minimum data training size required for accurate results.
This study does not presume accuracy based on a predetermined model and attempts to use various classifications to create the best model. However, ‘robustness’ is determined by a category classification system setup for a specific objective - flagging potential pedophiles. Therefore, the applicability of this approach (in conjunction with the intention) is limited.
Moreover, beyond the gender binary assumptions (which is common across most studies), the authors assume that tokens, characters, and word unigrams are inherently demonstrative of demographic identities. While this may be true in the case of falsification, where stereotypical gendered conventions are used by pedophiles, these stereotypes remain problematic when applied to the population at large - especially considering the variety of the language and communication features on Netlog.
The 4-way age-gender classification was input into SVM which yielded an accuracy of 66.3%. According to the experiments, token features outperformed character features and unigrams were the most effective feature in predicting gender with post metadata. The highest accuracy was achieved with the largest dataset. These evaluations are based on F-scores, precision, and recall. The best results were obtained when both gender and age were balanced in the dataset - the accuracy reached 88% in this case.
Because of the ad hoc nature of the training dataset and determining the classifier features, it would be difficult to replicate this experiment across a wide range of scenarios and or industries. However, in the context of language independent text analyses, comparing the precision and recall of unigrams, tokens, and characters, could prove useful.