Wang J., Xue Y., Li S., Zhou G. (2015) Leveraging Interactive Knowledge and Unlabeled Data in Gender Classification with Co-training. In: Liu A., Ishikawa Y., Qian T., Nutanong S., Cheema M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science, vol 9052. Springer, Cham
Conventional approaches to gender classification much rely on a large scale of labeled data, which is normally hard and expensive to obtain. In this paper, we propose a co-training approach to address this problem in gender classification. Specifically, we employ both non-interactive and interactive texts, i.e., the message and comment texts, as two different views in our co-training approach to well incorporate unlabeled data. Experimental results on a large data set from micro-blog demonstrate the appropriateness of leveraging interactive knowledge in gender classification and the effectiveness of the proposed co-training approach in gender classification.
Identifying the Dataset
Rather than focusing solely on user generated content, this study incorporates interactive text as well. Put simply, interactive text is the comments section, where users communicate with and address one another. Limited by data availability and the high computing cost of processing labeled data, the authors chose to co-train message and comment data to infer gender. The authors crawled thousands of profiles from Sina Weibo, a popular microblogging site in China. Users with less than 50 followings and verified users were extracted - from this remaining dataset, 2000 males and 2000 females were chosen. There were a total of 400 samples each for training and test data, and the remaining data remained unlabeled.
Drawing on previous work done by Liu and Mukherjee, the adopted features for the classifier include text (bag-of-words), the underlying tones and messages (F-measure) and stylistic choice (POS sequence pattern). ICTCLAS was used to break up portions of Chinese text. This semi-supervised model operated on a maximum entropy algorithm and used Mallet toolkits to obtain the results. In broad strokes, the process involved using labeled data to teach the classifier, using the classifier to identify unlabelled data, and corroborating identified data with labeled and unlabelled comment data.
Key Assumptions Stated by the Authors
Considering that many previous studies depend on resource heavy machine learning techniques as well as non-interactive text, the authors posit that their methods are more accessible, easier to accomplish, and include more related (and reliable) data to enhance their gender predictions.
Although interactive text can prove useful, and does take into account the social aspect of online social networks, it operates on the assumption that 1) other users will explicitly address a fellow user via gendered pronouns or phrases such as ‘wife’ and 2) that implicit cues (based on stereotypes), such as conversations about shopping, will most likely be indicative of one gender. If interactive text is used for the sole purpose of binary gender identification, it is likely that nuances in contextuality will be dismissed by the classifiers and models.
Individually, both the comments and messages classifiers perform poorly. When combined, gender predictions show a significant improvement and demonstrate that adding unlabelled data can enhance the performance of this co-training method.
Here, it is important to be cognisant of single-view training approaches. While others may use a single approach, some classifications and models may involve features from various aspects of a blog site. Therefore in evaluating the results, creating a distinction between single and double view approaches may not be entirely fruitful.