Z. Qin et al., "Demographic information prediction based on smartphone application usage," 2014 International Conference on Smart Computing, Hong Kong, 2014, pp. 183-190.
Demographic information is usually treated as private data (e.g., gender and age), but has been shown great values in personalized services, advertisement, behavior study and other aspects. In this paper, we propose a novel approach to make efficient demographic prediction based on smartphone application usage. Specifically, we firstly consider to characterize the data set by building a matrix to correlate users with types of categories from the log file of smartphone applications. By considering the category-unbalance problem, we predict users’ demographic information and propose an optimization method to further smooth the obtained results with category neighbors and user neighbors. The evaluation is supplemented by the dataset from real world workload. The results show advantages of the proposed prediction approach compared with baseline prediction. In particular, the proposed approach can achieve 81.21% of Accuracy in gender prediction. While in dealing with a more challenging multi-class problem, the proposed approach can still achieve good performance (e.g., 73.84% of Accuracy in the prediction of age group and 66.42% of Accuracy in the prediction of phone level).
Identifying the Dataset
Network enabled technologies intricately connect virtual and demographic information. Therefore, demographic information is accessible via app and smartphone usage to enhance behaviour studies, user interfaces, and advertising efforts.
The dataset includes four months of smartphone application load and usage obtained from 31,660 users who are registered with one network service provider. More specifically, the log includes the amount of entries from within the app that derive resources from the internet. This dataset includes 179,954,181 entries. Each entry is identified by a numerical code and internet resource. Internet resources and categories are matched with one another through Regular Expression Matching, thus, matching entries themselves to categories. Categories of user interests are modeled on a weighted graph and the frequency of category requests by users is modeled via an adjacent matrix.
This method follows a series of steps. First, the authors draw correlations between different demographic groups with the classifier by defining specific and granular user interests. They then build a matrix which corroborates search entries on smartphone apps with types of entry categories - prior probabilities of interest categories are weighted against request times and frequency to mitigate prediction bias. Additionally, the authors evaluate the stability of baseline predictions and ensure that the dataset is sufficient in terms of volume and period of study. After applying a Bayes-based classifier to predict demographic attributes, the results are optimised via category and user neighbour information through an adjacent matrix. To limit the noise of repeating and leveraging neighbour user data, the authors use a Singular Value Decomposition as an algorithm to factorise and simplify sparse matrix data. Finally, they conduct an evaluation by testing a ground-truth dataset of web application usage and searches.
Key Assumptions Stated by Authors
The authors claim that topic searches for applications are distinct relative to web pages. Search options are limited and people of different ages or genders are likely to have specific motivations for using the same app.
In terms of methods, the authors posit that category and user smoothing, neighbour data, and larger datasets guarantee better results for gender rather than age because gender is a more definitive attribute. Moreover, the correlation between interest categories and gender may be stronger than with age.
The authors neglect to clearly outline how and why certain interest categories and user neighbours networks are assigned to certain genders. For the most part, the categorisations are based entirely on stereotypes, reverting to conventional norms such as shopping interests for females and sports for males. The 10-fold cross validation, intermittent evaluations, and adjacent matrices serve to reiterate these assumptions throughout the experimentation process - the only factors evening results out include frequency, request time, and prior probability (which is also grounded largely on stereotypes). Furthermore, because this dataset is new relative to web browsing history and text-based analysis, the differences in accuracy between classic classifiers and this approach are potentially exaggerated.
Evaluation was built into the entirety of the design process - the approach was disaggregated in order to accommodate for this iterative evaluation technique. Prediction performance is evaluated against accuracy, precision, recall, and F1 scores.
The authors note, after testing various datasets collected over a range of durations, that increasing the duration of training datasets yields better results.
While testing the dataset with classic classifiers (Logistic, Naive-Bayes, and Decision Tree), all three performed similarly when predicting gender - around 60% accuracy. Meanwhile, smartphone usage analysis resulted in greater prediction accuracy. Of course, smartphone usage has proliferated widely in the recent past. However, this is only the case in a limited number of contexts. It focuses on a niche of the population that has access to data, availability to a wide array of apps, can afford a smartphone, and is limited to one service provider which may further price out potential research subjects. That said, we cannot assume phone plan consumers consent to this kind of research and data collection. However, these exclusion criteria elucidate the limited relevance and applicability of this study among less affluent and increasingly censored contexts.