Liu, W., & Ruths, D.: What’s in a name? Using first names as features for gender inference in twitter. In: AAAI Spring Symposium Series (2013)
Despite significant work on the problem of inferring a Twitter user's gender from her online content, no systematic investigation has been made into leveraging the most obvious signal of a user's gender: first name. In this paper, we perform a thorough investigation of the link between gender and first name in English tweets. Our work makes several important contributions. The first and most central contribution is two different strategies for incorporating the user's self-reported name into a gender classifier. We find that this yields a 20% increase in accuracy over a standard baseline classifier. These classifiers are the most accurate gender inference methods for Twitter data developed to date. In order to evaluate our classifiers, we developed a novel way of obtaining gender-labels for Twitter users that does not require analysis of the user's profile or textual content. This is our second contribution. Our approach eliminates the troubling issue of a label being somehow derived from the same text that a classifier will use to infer the label. Finally, we built a large dataset of gender-labeled Twitter users and, crucially, have published this dataset for community use. To our knowledge, this is the first gender-labeled Twitter dataset available for researchers. Our hope is that this will provide a basis for comparison of gender inference methods.
In this paper, Liu and Ruths study the incremental value of using a Twitter user’s name as a feature for gender inference. They achieve this by designing and implementing 2 different gender inference methods that utilize the user’s self-reported first name in varying ways.They derive a gender name association score utilizing the name distributions collected from the US census. This score is based on the logic that there are names that are very specific to a females or males while other are common among both genders. In one method, this name association score was included as a feature and integrated into a text features based classifier while in the other, this score was used as a threshold, determining whether or not to follow up with a classifier (different thresholds were tested).The authors found that both methods that incorporate the name information outperform the baseline classification method devoid of name features. The threshold approach only made a modest improvement in the accuracy of gender inference compared to the integrated method. The authors explain this by suggesting the majority of names are strongly gendered or unknown.