To infer an attribute from a dataset would require defining a set of descriptive parameters that would best separate this dataset into categories associated with the attribute in question. We refer to these parameters as features.
Take for example a dataset composed entirely of ice cream cones and ice cream carts. How would a robot -- never before having seen either ice cream cones or ice cream carts before -- infer if each item has the attribute of being an ice cream cone or ice cream cart?
At the simplest level, this robot would want to determine a parameter -- or feature -- that would best differentiate these items. It can utilize a number of features (such as shape, size, and texture) in combination with each other for the purposes of inference.
We should keep in mind that although the use of number of features can result in more thorough decision making, the robot would have to spend more time considering each feature (of size, shape and texture) as it relates to an item.
Conversely, the robot can determine one single feature with significant discriminatory power and use only that parameter to make its decisions. Weight for example, could act as this feature, as ice cream carts weigh significantly more than ice cream cones. Although this method is more efficient, there is always the possibility (although incredibly minute) of the robot finding an very large ice cream cone and determining it’s an ice cream cart.
While features are used to separate the items in datasets, it’s the classifier that determines how the data inputs will be seperated. A classifier is the way in which the robot observes the items of ice cream cones and ice cream carts and subsequently categorize these items.
The robot, for example, can use the same feature -- weight -- but different methods of classification. It can map the ice cream carts and ice cream cones based on their weight and make its decision based on where said item is located on the graph. It can decide a specific weight beforehand -- say 500 grams -- and ask if each item it encounters weighs more than this limitation (for specific classifiers, please check out our glossary).
We offer a categorization of data sources used for the purposes of gender and sexuality inference in the works surveyed. Using Hong Yang and Yali Yaun’s, A General Gender Inference Method Based on Web as reference, we have identified five broad categories of data types used:
Text-based approaches: This method analyzes the language choices of a user from sources such as their blog posts or profiles.1 This method is dependent on the assumption that language is intrinsically linked to demographics -- different groups of people using language differently. Research in sociolinguistics has demonstrated the differences in lexical choice and other linguistic features in discourse based on factors such as age, gender, race, class, sexuality and geography. Such analysis is building off of previous language work that utilized sociolinguistic models to discover latent speaker/author attributes from various mediums. Garera and Yarowsky, for example, examined conversational speech and writing styles.2 Bocklet et al examine audio recordings.3 Yan and Yan4 as well as Mukherjee and Liu5 classify authors’ gender based on their web blog postings. More recently, textual user-generated content from social media sites such as Twitter and Facebook are being utilized for purposes of demographic inference. It should be noted that such methods often use a large number of features and this necessitate larger computational complexity.
Visual-based approach: A language independent approach that utilizes visual information created or selected by users. In the case of gender and sexuality classification, this approach generally utilizes images (such as those posted by users on their social media accounts) or colours. Alowibdi et al., for example, discern gender by analyzing the background colour of a user’s profile page on Twitter.6
Name-based approach: Utilizing people’s names to infer their gender. This approach requires a pre-existing database of names and associated genders from which a classifier can refer to.
Physical features approach: Analyzing a person’s facial features, body features, or their gate for demographic inference.
Combination features approach: Aggregating and combining the various approaches listed above with the assumption that they will offer greater accuracy in inference.