Kokkos, A., & Tzouramanis, T. (2014). A robust gender inference model for online social networks and its application to LinkedIn and Twitter. First Monday, 19(9).
Online social networking services have come to dominate the dot com world: Countless online communities coexist on the so-named Social Web. Some typically characteristic user attributes, such as gender, age group, sexual orientation, are not automatically part of the profile information that is provided by default in some of these online services, and in some cases they can even be deliberately and maliciously falsified. This paper looks into how the automated inference of the gender of a user of online social networks can be achieved by means of the application to this user's written text data of a combination of natural language processing and classification techniques. The paper aims to show that the use of a support vector machine classifier and a part-of-speech tagger makes it possible to infer the gender of an online social network user and that this can be done through the classification of down to one single short message included in a profile, quite independently of whether this message does follow a structured and standardized format (as with the attribute summary in LinkedIn) or does not (as with the micro-blogging postings in Twitter). Extensive experimentation on LinkedIn and Twitter has indicated a very marked degree of accuracy of the proposed gender identification scheme of up to 98.4%. This represents a higher level of accuracy than had been achieved by research in the field. The approach proposed herewith may lend itself to applications that could prove significant in a number of fields such as the advertising world; personalization and recommendation; security, forensics and cyber investigation; etc. To the best of the authors' knowledge this paper is actually the first to address the gender identification problem in online social networks that are used for business-only interactions, such as LinkedIn.
Samples of text from LinkedIn (1000 profile summaries) and Twitter (1000 tweets) were converted into multidimensional feature vectors to identify latent attributes. The training data consisted of 5000 tweets along with Brown Corpus and Reuters 21758 Newsgroup data for LinkedIn. Comparing LinkedIn and Twitter user content, this experiment demonstrates the possibilities of attribute inference with data characterised by limited textual variation.
Like many other studies, the task at hand consists of constructing a machine learning algorithm on the training data specified above, and then uses this algorithm on the test data. The strategy is to employ content based and style based features for wider applicability to future studies. Additionally, a necessary component is the part-speech tagger. The part-speech tagger assigns a tag to each token and creates predictive features for each token based on a holistic view of the text. For twitter, the user's gender was manually checked for all 5,000 tweets with names, tags, and URLs removed from the dataset. The training samples were kept separate to produce a linear SVM model for Twitter and an Analogues classifier for LinkedIn using LinPipe toolkit.
Key Assumptions Stated by Authors
This paper stands out in comparison to others due to the author’s awareness regarding the limitations of the gender binary. At the end of the paper, the authors write a “caveat” which explicitly states that they did not apply a nuanced understanding of gender to their work. Furthermore, the authors discuss privacy and attribute disclosure. Arguably, the success of their LinkedIn study shows that even in scenarios where the privacy of the user is kept in-tact, there are ways to obtain that information through textual means.
Undoubtedly, the caveat explains references toward cognitive psychology and sociolinguistic assumptions that gender identities inform behavioural patterns and the tendency of men and women to conform to gender roles, e.g. women having a tendency to express their thoughts more passionately or use pleasant diction. While the caveat does rationalise these thoughts, a justification for continuing to use this approach, beyond convenience, remains absent. Thus, this model and these algorithms continue to perpetuate the social conventions.
With regard to privacy concerns, rather than questioning the ethics of big data research, the authors automatically strive to extract private user data. In certain instances, such as a criminal investigation, this may be useful; however, a line needs to be drawn between useful big data research and a right to privacy (especially if the platforms follow certain privacy regulations).
The quality of the training dataset and the classifier features are essential to the success of this study. Here, accuracy is defined as the number of tweets/ summaries correctly guessed divided by the total number of tweets/summaries. The rate of accuracy increases with the number of words, therefore LinkedIn summary tests outperformed tweet accuracy, although, they both achieved accuracy rates above 90%.
The outstanding success of this study would not translate to examining text data in other languages, or with features for which substantive training data is unavailable.