‘Private Traits and Attributes are Predictable from Digital Records of Human Behaviour’ by Michal Kosinski, David Stillwell, & Thore Graepel (2013)

Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 110(15), 5802-5805.

URL: https://doi.org/10.1073/pnas.1218772110

Abstract

We show that easily accessible digital records of behavior, Facebook Likes, can be used to automatically and accurately predict a range of highly sensitive personal attributes including: sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parental separation, age, and gender. The analysis presented is based on a dataset of over 58,000 volunteers who provided their Facebook Likes, detailed demographic profiles, and the results of several psychometric tests. The proposed model uses dimensionality reduction for preprocessing the Likes data, which are then entered into logistic/ linear regression to predict individual psychodemographic profiles from Likes. The model correctly discriminates between homosexual and heterosexual men in 88% of cases, African Americans and Caucasian Americans in 95% of cases, and between Democrat and Republican in 85% of cases. For the personality trait “Openness,” prediction accuracy is close to the test–retest accuracy of a standard personality test. We give examples of associations between attributes and Likes and discuss implications for online personalization and privacy.

Critical Annotation

Identifying the Dataset

Although Facebook likes are publicly available, this study chooses to examine a variety of traits and attributes to understand the extent to which studies and predictors can accurately glean information and or act intrusively. Gender and sexual orientation were verified via user profiles. Experimental datasets included individuals whose profiles consisted of between 1 and 700 likes with a median of 68 likes. The total number of profiles for which gender was examined amounted to 57, 505 of which 62% belong to female users. Out of the profiles examined, sexual orientation (limited to hetersexual and homosexual labels) was determined by the “interested in” field on individual profiles.

Methods

To process sparse data with a large variation of data availability, the authors ran predictive models with 500 users who had at least 300 likes with randomly selected subsets of likes. These trial experiments showed that even a few likes provide enough information for accurate predictions; each additional like increases prediction accuracy with diminishing returns.

Both gender and sexual orientation are classified as dichotomous variables; the results present them as such. Modelled by an area under the receiver-operating characteristic curve (AUC). This curve models the probability of correctly predicting a user belonging to one of two classes and therefore evaluates accuracy.

Key Assumptions Stated by Authors

The authors are cognisant of the tension between the vastness of available digital records on individual’s online behaviour for consumer and citizen research, and the necessity of the right to privacy and data ownership. If a company is mining data to run effective ad campaigns, targeting and mis-targeting ads can lead to invasions of privacy and can have harmful impacts on people’s day-day life. Because having a digital presence has become necessary for a majority of essential and non-essential services in several places, it is difficult for folks to keep track of what about themselves exists in cyberspace. Due to the pervasiveness of digital ID’s, it might be easier for others to draw connections and make predictions, even if one chooses to not explicitly state or express a certain aspect of their identity. The lack of consent in conducting research and mining this data is apparent. The authors claim that there is hope in considering transparency as a solution, as well as the control of information and data ownership in the hands of users rather than platforms and companies.

Inferred Assumptions

The fact that gender and sexual orientation are automatically modelled as dichotomous variables is the first red flag. Granted, there are a lot of variables being examined here, however, if this compromises the quality of addressing some variables, the number of factors considered should be limited. Like the article on demographic prediction via smartphone applications, the authors of this paper acknowledge the tensions between privacy, data ownership, and protection of their identities/ ability to express them. Even after acknowledging this, the authors fail to provide any sort of information (in the methods section and otherwise) regarding the details of how they managed the ethics of their own experiment - especially since they are dealing with sensitive traits and attributes, many of which are difficult to quantify. Furthermore, there are very few steps between extrapolating likes to categories to traits/ characteristics, signalling that the data processing and analysis can and should be more robust.

Evaluating Results

According to the AUC, gender was correctly categorised in 93% of the profiles tested. In terms of sexual orientation prediction, predicting homesexuality among males was easier than predicting it among females. According to the authors, distinguishable attributes such as gender or sexual orientation, unlike intelligence, are easier to draw correlations with. However, in the case of heterosexual men, the authors acknowledge the caveat of indirect or non-explicit likes and correlations to be drawn. In this case, if correlations and accuracy are partly evaluated based on ease, it is unlikely that this approach would be relevant in social and cultural contexts where gendered conventions do not follow Western norms. Moreover, the breadth of the attributes studied in this paper makes it difficult to focus on the accuracy and depth of any of the attributes. It may prove that a lot of information can come from likes, but this information (and the accuracy of this information) is largely contingent upon the particularities of this experiment - the scalability of this exact method, temporally and sectorally, is questionable.