Schwartz, H. A., et al. (2013). Personality, gender, and age in the language of social media: The open-vocabulary approach. PloS one, 8(9), e73791.
We analyzed 700 million words, phrases, and topic instances collected from the Facebook messages of 75,000 volunteers, who also took standard personality tests, and found striking variations in language with personality, gender, and age. In our open-vocabulary technique, the data itself drives a comprehensive exploration of language that distinguishes people, finding connections that are not captured with traditional closed-vocabulary word-category analyses. Our analyses shed new light on psychosocial processes yielding results that are face valid (e.g., subjects living in high elevations talk about the mountains), tie in with other research (e.g., neurotic people disproportionately use the phrase ‘sick of’ and the word ‘depressed’), suggest new hypotheses (e.g., an active life implies emotional stability), and give detailed insights (males use the possessive ‘my’ when mentioning their ‘wife’ or ‘girlfriend’ more often than females use ‘my’ with ‘husband’ or ’boyfriend’). To date, this represents the largest study, by an order of magnitude, of language and personality.
Identifying the Dataset
The resulting dataset is mentioned in the abstract above. This was collected from a larger dataset of 19 million Facebook status updates by 136,000 volunteers. The dataset was narrowed down by the following characteristics: English must be their primary language, have 1,000 words in their status updates, be under 65 years old, and clearly indicate their gender and age.
Aiming to go beyond computational linguistic contributions, by comparing open and closed vocabulary studies, this experiment hopes to provide insights into individual characteristics as results. Broadly, the steps of the Differential Language Analysis (DLA) or open-vocabulary experiment involve collecting volunteer data, extracting linguistic features, conducting a correlation analysis between words/phrases/topic and gender/personality/location/age, followed by a visualisation of the results. The extraction is mechanised via emoticon-aware tokenizers, collocation filters, and the Latent Dirichlet Allocation technique. On the other hand, the closed vocabulary method counts the frequency of a category of words used by an individual, and the percentage of words from any given category used by said individual via a least squares regression analysis.
Key Assumptions Stated by Authors
Situating social sciences within the “age of data science,” the authors emphasise the significance of quantifying, measuring, and evaluating differences and similarities between groups of people via language used on social media. In acknowledgement of the dynamic nature of language, the authors compare determining latent attributes with a priori (fixed) word sets and through DLA, which analyses the words at hand within a given dataset. Because this study focuses on factors outside of gender, namely personality, the potential for rich psychological insights to be drawn from any given word set is desirable.
The premise of this study operates on the assumption that language is the most common and reliable method to communicate internal thoughts. While this may be true in some instances, it is not always the case on social media. When considering genders that exist outside of the binary, it is unlikely that these individuals (depending on the safety of the social media platform, their immediate political and cultural contexts, etc.) would be communicating their truest and most internal thoughts via language. Furthermore, language is often coded and can have meaning beyond its semantic definition, a common occurrence on online platforms. In this case, the dataset only considers those who have clearly marked their gender - therefore, the consideration for safety is not made, and those who are uncomfortable or not represented by the gender options on Facebook, are automatically excluded.
This paper can be evaluated based on the contributions proposed by the authors. By the size of the dataset, and the incorporation of phrases and topics (along with words), makes this experiment the largest study of personality and language. Because of this, the authors were able to correlate the use of pronouns ahead of male and female labels i.e. girlfriend, brother, wife, etc. to what the authors identify as gendered behaviours. In terms of evaluating the results themselves, model accuracy was determined by out-of-sample prediction indicators. The DLA model offers more accurate predictions with out-of-sample datasets and yields high levels of accuracy for gender prediction specifically at 91.9%. In addition to proving their hypothesis, that DLA would function better, the results of this experiment can serve as datasets for further analysis and creative output creation (i.e. along the lines of word clouds). While the authors can attribute these contributions to their work, the authors should further contextualise the significance of their results.
Furthermore, the authors take the connection between data sciences and social sciences - integrating quantitative methods not only involves using different datasets, but asking why those datasets are being used, who they represent, and what the implications of the experiment could be in reproducing data biases. The DLA may provide more accurate results, however it does not question the correlations themselves and what stereotypes or narratives they perpetuate.