Ciot, M., Sonderegger, M., & Ruths, D.: Gender inference of twitter users in non-English contexts. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 19-21 (2013)
While much work has considered the problem of latent attribute inference for users of social media such as Twitter, little has been done on non-English-based content and users. Here,we conduct the first assessment of latent attribute inference in languages beyond English,focusing on gender inference. We find that the gender inference problem in quite diverse languages can be addressed using existing machinery. Further, accuracy gains can be made by taking language-specific features into ac-count. We identify languages with complex orthography, such as Japanese, as difficult for existing methods, suggesting a valuable direction for future research.
Ciot et al. set out to infer gender from non-english based twitter content with the aim of understanding how men and women use language differently across cultures. They work with a sample comprised of genetically unrelated languages (Japanese, Indonesian, Turkish and French) and run an SVM classifier to determine its performance across these languages. They call their model language agnostic as given a labeled set of users and tweets in a particular language, a model can be built without any knowledge of the structure or content of the language itself. Each of their language gender models was comprised of a different set of language agnostic features (such as top words, top n-grams, top hashtags, frequency of tweets, retweets, etc.). Classifier accuracy levels vary across languages: classifier accuracy was comparable on French and Indonesian while the Turkish language performed the best. In contrast, gender in Japanese could not be reliably inferred with reasonable accuracy -- the authors calling for better ways to accommodate the complex orthography and syntax based mechanisms of language. The authors conclude that when developing a language gender model, language-specific features should be considered as their inclusion will boost accuracy.