Rao, D., Yarowsky, D., Detecting latent user properties in social media. In: Proceedings of the NIPS MLSN Wokshop (2010)
URL: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.364.4449
The ability to identify user attributes such as gender, age, regional origin, and political orientation solely from user language in social media such as Twitter or similar highly informal content has important applications in advertising, personalization, and recommendation. This paper includes a novel investigation of stacked-SVM-based classification algorithms over a rich set of original features, applied to classifying these four user attributes. We propose new sociolinguisticsbased features for classifying user attributes in Twitter-style informal written genres, as distinct from the other primarily spoken genres previously studied in the user-property classification literature. Our models, singly and in ensemble, significantly outperform baseline models in all cases.
Rao and Yarowsky set out to infer the gender, age and political orientation of Twitter users utilizing a stacked SVM classification method (combining SVM models) over a set of features. The authors emphasize the importance of language in identifying latent user attribute and propose a mixture of lexical and sociolinguistic-based feature for facilitating classification of Twitter users based on their informal textual communication. They utilize user’s status messages as a means of inference. They test and compare 3 different classification models: (1) a sociolinguistic feature model: where digital socio-linguistic cues such as the use of emoticons or certain punctuation (ellipses and exclamation marks) are used as features; (2) a lexical n-gram model: where the unigram and bigram of the tweet text was derived and; (3) a stacked model that’s features were derived from the predictions of the previous 2 models. For gender, they found that their sociolinguistic model (71.76% accuracy) performed better than their lexical model (68.70% accuracy) from the status text alone -- their stacked model did marginally better, achieving a 72.33 % accuracy rate. It should be noted that the authors also examined the use of social network structure (e.g. follower-followee ratio) and communication behaviour (e.g. reply rate) and determined that they were not valuable in inferring latent author attributes.