Bamman, D., Eisenstein, J., & Schnoebelen, T. (2014). Gender identity and lexical variation in social media. Journal of Sociolinguistics, 18(2), 135-160.
We present a study of the relationship between gender, linguistic style, and social networks, using a novel corpus of 14,000 Twitter users. Prior quantitative work on gender often treats this social variable as a female/male binary; we argue for a more nuanced approach. By clustering Twitter users, we find a natural decomposition of the dataset into various styles and topical interests. Many clusters have strong gender orientations, but their use of linguistic resources sometimes directly conflicts with the population‐level language statistics. We view these clusters as a more accurate reflection of the multifaceted nature of gendered language styles. Previous corpus‐based work has also had little to say about individuals whose linguistic styles defy population‐level gender patterns. To identify such individuals, we train a statistical classifier, and measure the classifier confidence for each individual in the dataset. Examining individuals whose language does not match the classifier's model for their gender, we find that they have social networks that include significantly fewer same‐gender social connections and that, in general, social network homophily is correlated with the use of same‐gender language markers. Pairing computational methods and social theory thus offers a new perspective on how gender emerges as individuals position themselves relative to audiences, topics, and mainstream gender norms.
Identifying the Dataset
The authors use Twitter data for several reasons: the data is publicly available through Twitter’s API, the format allows for mass data collection, and Twitter has a wide reach across various strata in society (ethnicity, gender, income levels). The 14,464 Twitter users (9,212,118 tweets) were gathered over a 6 month period, January-July 2011 - they focused on users in the United States that use common English terms. Moreover, each individual in the dataset must be actively engaging with their network, and these networks must be mutual (determined via mutual mentions on Twitter). Individuals with 4-100 mutual friends were selected to reflect any given social network. Within this dataset, gender is determined by first name. According to the authors, 95% of users have a name that is at least 85% associated with its majority gender. This condition for the dataset is necessary for further discussion and accuracy evaluations.
The methods that the authors have laid out aim to address two gaps in previous quantitative studies. The first approach to this computational and lexical analysis of gender involves clustering ranges of linguistic styles within the population of individuals included in the dataset. Clustering provides an opportunity to examine aggregate correlations between linguistic and gender, and understand instances of individuals whose language conflicts with aggregate findings. After the clustering step, the authors pinpoint individuals who do not conform to conventional gender statistics by identifying those who do not fall into a statistical classifier’s gender prediction. Here, the authors find that homophily plays a large role in the extent to which users use mainstream gendered language.
Key Assumptions Stated by Authors
The authors acknowledge and address the narrow possibilities and conclusions that come with fixating on binaries and assuming that all individuals conform to linguistic gender norms. One of the most significant assumptions pointed out by the author is that the complex definitions and roles of gender within the context of personal identity pose problems for those wishing to pursue quantitative analyses of gender. Gender, by definition, is determined by social categories and performances and varies across contexts. Here, the authors conclude that studying the social construction of gender is qualitative; however, quantitative analysis can examine these phenomena at a much larger scale - exploratory quantitative analysis can provide the groundwork for qualitative based analysis into the meaning and depth of the findings.
The authors uniquely focus on the relationship between quantitative and qualitative analysis. Additionally, I would encourage the authors to consider the authors to unpack their understanding and implications of studying homophilous networks and the intersections at which they might exist.
The authors conclude that predictive models “permit exploratory analysis that reveals patterns and associations that might have been rendered invisible by less hypothesis-driven analysis.” Unlike many other papers, there is a recognition of the descriptive inadequacy that predictive models are necessarily burdened with. However, despite the pitfalls of these models, the authors are optimistic about collaboration and cooperation between quantitative and qualitative analyses to expand the scale of qualitative analyses and the ability to identify patterns. These conclusions are valuable for a variety of contexts and sectors and is something to be considered seriously in discussions regarding data and development.