Tang C., Ross K., Saxena N., & Chen R. (2011). What’s in a Name: A Study of Names, Gender Inference, and Gender Behavior in Facebook. In: Xu J., Yu G., Zhou S., Unland R. (eds) Database Systems for Adanced Applications. DASFAA 2011. Lecture Notes in Computer Science, vol 6637. Springer, Berlin, Heidelberg.
In this paper, by crawling Facebook public profile pages of a large and diverse user population in New York City, we create a comprehensive and contemporary first name list, in which each name is annotated with a popularity estimate and a gender probability. First, we use the name list as part of a novel and powerful technique for inferring Facebook users’ gender. Our name-centric approach to gender prediction partitions the users into two groups, A and B, and is able to accurately predict genders for users belonging to A. Applying our methodology to NYC users in Facebook, we are able to achieve an accuracy of 95.2% for group A consisting of 95.1% of the NYC users. This is a significant improvement over recent results of gender prediction , which achieved a maximum accuracy of 77.2% based on users’ group affiliations. Second, having inferred the gender of most users in our Facebook dataset, we learn several interesting gender characteristics and analyze how males and females behave in Facebook. We find, for example, that females and males exhibit contrasting behaviors while hiding their attributes, such as gender, age, and sexual preference, and that females are more conscious about their online privacy on Facebook.
Identifying the Dataset
In July of 2009, the authors created a crawler to visit Facebook profiles and compile all users in New York City. The crawler did this by finding associated geographical networks and the friends/mutual friends of each user. At the time, this location information was available and there were a total of 2 million Facebook users in the city. Out of the 2 million users, 1.67 were crawled. After eliminating users who were inactive, the dataset amounted to 1,282,563 of which 679,351 users made their gender known. Of the 679, 351, 52.97% were male.
The methods are split into three categories: The “Facebook-generated name list” annotates each name with a popularity estimate and gender probability based on newborn baby name and name trend studies (as well as the stated gender on user profiles). lAll names that are not “real names,” i.e. names that have no vowels, are only one letter, etc. were removed from the list, leaving a total of 23,363 names. After accounting for canonical names each name was assigned a high gender probability of 94%.
The second category is “Name-Centric Gender Inference.” After splitting the dataset into two, a comparison of group affiliation (77.2% accuracy) and name-based techniques (992.5% accuracy) showed a dramatic difference in accuracy. Gender predictors include single and hybrid predictors ranging from offline name lists (USA baby names), the Facebook annotated list mentioned above, local information predictors, friend information predictors, and combinations of the above. Each predictor was assigned a corresponding feature vector to be split between training and test datasets. Classifiers were built using the Weka toolkit. Based on the predictor, either a Multinomial Naive-Bayes or decision tree classifier was used.
Thirdly, “Gender Behaviour and Characteristics” contribute to the authors ability to draw behavioural trends. For example, the authors find that females are more likely to be more conscious about their online privacy and hide their demographic information than males.
Key Assumptions Stated by Authors
Despite having privacy settings, statistical techniques and machine learning can still work to infer attributes that are meant to be private. Furthermore, third party crawlers and Facebook itself could buy and sell information for advertising, revenue, or even malicious purposes.
Although there is less default information that is publicly available on facebook, the variety of predictors used in this experiment demonstrates that obtaining names and friend lists can still be useful and fruitful in making gender predictions.
Regarding methods, unlabeled and ambiguous names, such as unregistered or unisex names, would be difficult to classify simply with the generated-name database. Therefore, unambiguous names can be classified via machine learning techniques.
Throughout the paper, with the exception of some assumptions stated by the authors, privacy concerns mostly revolve around whether males or females are more reluctant to share their latent attributes as well as the implications of this for targeted advertising. Despite expressing concern for the potential of malicious actors to cause harm, the significance of user privacy and data ownership is barely acknowledged. In fact, the choice to remain private, is used as an indicator for gender. Not only does this contribute to perpetuating binary stereotypes, but it also weaponises user’s privacy choices against them - by becoming the very reason through which they can be identified. This is something the authors should consider more seriously from an ethics framework.
The effectiveness of each predictor was evaluated against accuracy rates as well as the fraction of users belonging to various sets of categories via thresholding. In addition to measuring the accuracy of their results based on predictors and group comparisons, the authors extrapolate behaviours and preferences of each gender on Facebook. Understandably, when dealing with datasets this large, it is wise to limit geographic areas as a control variable. Therefore, a similar approach can be used in other localised contexts.
However, the methodology does not effectively account for names (uncommon English names) that would not have ground-truth data within the external databases consulted. If the names that were listed were mostly common names, then it is likely that the friends and friends of friends of those users would also have networks with a relatively small number of uncommon names. This may leave out a large portion of the population and contribute to the cycle of machine classifiers being trained on a limited and narrow understanding of gender in relation to names. Additionally, the authors do note that Facebook privacy policies have changed, thus the applicability of this method is highly vulnerable to company policies and may not be possible to conduct (or difficult to track because of the increasing complexity of algorithms) as a result of ongoing changes.