‘Say it with Colors: Language-Independent Gender Classification on Twitter’ by Jalal S. Alowibdi, Ugo A. Buy, & Philip S. Yu (2014)

Alowibdi J. S., Buy U. A., & Yu P. S.: Say it with colors: language-independent gender classification on twitter. In: Kawash J. (eds) Online Social Media Analysis and Visualization. Springer, pp 47-62 (2014)

URL: https://link.springer.com/chapter/10.1007/978-3-319-13590-8_3#citeas

Abstract

Online Social Networks (OSNs) have spread at stunning speed over the past decade. They are now a part of the lives of dozens of millions of people. The onset of OSNs has stretched the traditional notion of community to include groups of people who have never met in person but communicate with each other through OSNs to share knowledge, opinions, interests and activities. Here we explore in depth language independent gender classification. Our approach predicts gender using five color-based features extracted from Twitter profiles such as the background color in a user’s profile page. This is in contrast with most existing methods for gender prediction that are language dependent. Those methods use high-dimensional spaces consisting of unique words extracted from such text fields as postings, user names, and profile descriptions. Our approach is independent of the user’s language, efficient, scalable, and computationally tractable, while attaining a good level of accuracy.

Critical Annotation

Alowibdi et al. explore gender classification methods independent of language. Using color-based features extracted from Twitter, they attempt gender inference based on a user’s color preference. They harvest colour data from Twitter across 34 different languages -- specifically the five color fields that allow users to choose colors for their profiles: (1) background color, (2) text color, (3) link color, (4) sidebar fill color and (5) sidebar border color. They normalize this data through a quantization and sorting procedure to reduce the number of color features, and apply different classifiers to this dataset to test the validity of their approach. They also test different subsets of this data in order to account for the existence of predefined designs in the Twitter user set up. They get the highest accuracy with the naive bayes/decision tree hybrid with the best results achieved when they excluded profiles using the 19 predefined designs. Despite the fact that a large portion of Twitter users do not change the default colors of their Twitter profile, the authors conclude that using color features can reasonably infer gender.