Hu, J., Zeng, H. J., Li, H., Niu, C., & Chen, Z. (2007, May). Demographic prediction based on user's browsing behavior. In Proceedings of the 16th international conference on World Wide Web (pp. 151-160).
Demographic information plays an important role in personalized web applications. However, it is usually not easy to obtain this kind of personal data such as age and gender. In this paper, we made a first approach to predict users’ gender and age from their Web browsing behaviors, in which the Webpage view information is treated as a hidden variable to propagate demographic information between different users. There are three main steps in our approach: First, learning from the Webpage click-though data, Webpages are associated with users’ (known) age and gender tendency through a discriminative model; Second, users’ (unknown) age and gender are predicted from the demographic information of the associated Webpages through a Bayesian framework; Third, based on the fact that Webpages visited by similar users may be associated with similar demographic tendency, and users with similar demographic information would visit similar Webpages, a smoothing component is employed to overcome the data sparseness of web click-though log. Experiments are conducted on a real web click through log to demonstrate the effectiveness of the proposed approach. The experimental results show that the proposed algorithm can achieve up to 30.4% improvements on gender prediction and 50.3% on age prediction in terms of macro F1, compared to baseline algorithms.
Identifying the Dataset
Relying on web-browsing information, the dataset used in this experiment examines web browsing behaviour during one week in 2006 - each row in the web-page click through log is considered as an individual datum. Users during this one week who browsed less than 10 pages were removed. Post-processing, the experiment included a total of 189,480 users and 223,786 pages.
Based on user profiles, known latent attributes, and browsing history, user information is translated and affiliated with browsed pages using a Bayesian framework. To group similarities of users and webpages, the authors use latent semantic indexing followed by linear interpolation for grouped category based predictions. The neighbourhoods of users were also included in the computation process by vector calculations.
For the experiment, 1,546,441 pages were collected from the Open Directory Project and spanned hundreds of thousands of website categories. Over a thousand categories were left after the indexing processes - from these, a hierarchical SVM classifier was programmed.
Key Assumptions Stated by Authors
To communicate the strength of their study, the authors delineate between the population of bloggers on the internet (8%) versus the vast majority of internet users who browse pages relatively frequently. Based on this reality, the authors assume that their experimental results and dataset are widely representative. One of the challenges mentioned by the authors is the data sparseness that arises from relying on web browsing datasets. Additionally, each website may have several web pages to browse, making user grouping and neighbour vectoring a challenging task. To circumvent this, the authors assume that Collaborative Filtering (CF) would aid in training the classifier from the user’s side.
The authors state that people of a similar gender may have similar interests and preferences, thus leading them to similar websites. This is the assumption that undergirds the entirety of this dataset and methodology. Although it stands out from other studies, which usually rely on language/ style/ character features, the web browsing data continues to assign certain behavioural patterns based on a narrow understanding of gender. Moreover, not only are user genders predicted, but the websites and webpages themselves become gendered; potentially resulting in safety, privacy, and ad targeting vulnerabilities for these sites and the users who visit them.
In line with the inferred assumptions, the results obtained by this study conclude that female users prefer visiting pages about movies, kids, food, etc. while males browse sites related to money, girls, cars etc. The results were evaluated according to F1, precision, and recall scores - the experiments amounted to a 30.4% improvement in gender prediction when examining both category and content-based features of web pages. Ultimately, the authors propose that there is a dialectical relationship between webpage and gender predictions - claiming that strengthening the prediction of one will strengthen the prediction of the other.
It is important to note that this approach is sensitive to political contexts in which internet shutdowns, censorship, and or surveillance is widely practiced. In this context, it is unlikely that gender predictions could strengthen the analysis of web page browsing and vice versa. Moreover, browsing web pages/ having the devices to do so necessitates a certain amount of disposable time/ income. Thereby, the methods and results have built in political and economic exclusions.