Cesare, Nina & Grant, Christan & Nsoesie, Elaine. (2017). Detection of User Demographics on Social Media: A Review of Methods and Recommendations for Best Practices.
URL: https://oudalab.github.io/papers/cesare2017detection.pdf
Researchers in fields such as sociology, demography and public health, have used data from social media to explore a diversity of questions. In public health, researchers use data from social media to monitor disease spread, assess population attitudes toward health-related issues, and to better understand the relationship between behavioral changes and population health. However, a major limitation of the use of these data for population health research is a lack of key demographic indicators such as, age, race and gender. Several studies have proposed methods for automated detection of social media users’ demographic characteristics. These range from facial recognition to classic supervised and unsupervised machine learning methods. We seek to provide a review of existing approaches to automated detection of demographic characteristics of social media users. We also address the applicability of these methods to public health research; focusing on the challenge of working with highly dynamical, large scale data to study health trends. Furthermore, we provide an overview of work that emphasizes scalability and efficiency in data acquisition and processing, and make best practice recommendations.
Identifying the Dataset
This paper conducts a critical review of existing methods, their limitations, and how they can be improved. The aim of this review is to understand how/if Big Data research can contribute to and further public health research. Curated studies focus on social media sites and analyse methods ranging from SVM, neural networks, maximum entropy etc.
Methods
In October of 2016, the authors took to Google Scholar and searched various keywords such as “classification,” “machine learning,” and “gender.” Out of the first 20 pages of results, 60 relevant articles published between 2006-2016 were chosen based on their title and summary. The majority of the 60 articles predicted gender (47 out of 60).
Key Assumptions Stated by Authors
The authors acknowledged the risk implicit within big data research; the perpetuation of inequality. The authors explicitly state that “big data research raises questions around truth, power, and control.” Because of these inequities, the authors implore researchers to consider how they can ensure that they are conducting scientific and replicable research in an ethical fashion.
One of the major concerns around ethics is about privacy. Because participants are active on social media or networking sites which are technically public, researchers believe there are no added considerations that should be made when using user data. Of course, these users have chosen and agreed to take part in the platform, but nobody signs up for a platform as a means to consent to being part of a multitude of research studies.
Unlike many other studies, this paper visibilises the assumption common in most experiments - that gender is something that can be somewhat easily identified and is something that people manifest in a predictable, normative, and binary way.
In addition to these necessary considerations, the authors suggest that there are benefits to this kind of research when done using thoughtful, automated, and low cost methods.
Inferred Assumptions
With public health research as a focus, the author assumes that studying and evaluating online demographic prediction studies could be useful due to the vast availability of data and the fidelity one has to one’s opinions online (as opposed to answering a survey or questionnaire). While surveys and online statuses are certainly completely different mediums, it is unfair to assume that all people project their true thoughts and feelings online. For some minority groups and opinion holders, the internet and social media can be an unsafe space. Operating under this reality, it is presumptuous to believe that online responses and voices are not also carefully curated.
Evaluating Results
Based on the studies surveyed, the authors highlight general public health related challenges that many of these studies must overcome. The lack of ground-truth data, the tendency to favour accuracy over precision and recall, and the disregard for applicability across research domains, are weakening the majority of ongoing predictive efforts.
For public health efforts specifically, the authors recommend more efficient and scalable approaches via text-independent machine learning, data matching, and facial detection (and still each method comes with its own challenges). To materialise these changes, the authors suggest using open source tools. Moreover, they suggest that computer and data scientists collaborate with social scientists in order to flag and address their biases.
These results are certainly helpful for highlighting major flaws in the majority of research studies, however it does not take time to provide concrete steps as to how to frame ethics or what ethical research may look like in this context.