5. MATERIAL Y MÉTODO
5.8. Instrumentos del estudio
Having discussed text-based and social network-based approaches, we now discuss methods that combine the two approaches together for better accuracy.
Abrol and Khan (2010) estimated an unknown user u’s location based on u’s friends’ locations. The locations of u’s friends are estimated using gazetted terms in
their tweets, and are further represented in the form of location distributions. This is because there may be ambiguous gazetted terms in their tweets, and a friend may mention many places in tweets. An example distribution of a friend’s location is 80% in Melbourne, Australia, 15% in Sydney, Australia and 5% in other cities. In the prediction stage, u’s friends’ location distributions are summed, and the final prediction is then the most likely location in this distribution.
Ren et al. (2012) built a generative model that makes use of local words and named entities as the text-based part of the geolocation model. In the social network-based part, they combined three sources of social relationship in majority vote for a user u: (1) locations of u’s followers; (2) locations of users that u is following; and (3) the followers of u’s siblings — suppose s is following a user that u is also following, then the follower locations of s are also counted for u. Both text-based and social network- based prediction scores are scaled to [0, 1], and are combined linearly to select the final prediction.
Sadilek et al. (2012a) jointly predicted social relationships and user locations in Twitter. They made use of co-friendships (i.e., two users who share the same friend), word choice and temporal activity overlap in a Bayesian propagation framework. The model recovers hidden social relationships and user locations based on partially observed data. Although promising results have been achieved, the approach requires users to be actively posting GPS-labelled tweets, limiting its applicability to densely populated areas and users with more tweet data. Furthermore, their results suggest co-friendships are effective in locating users, different from Rout et al.’s (2013) finding. One potential reason is because of the different data used in the studies. Sadilek et al. (2012a) primarily focused on active users with at least 100 geotagged posts per month in big cities. These active users often have a higher ratio of social connectivity, and consequently the social graph is relatively dense. The co-friendships in Rout et al. (2013), on the other hand, are country-wide (i.e., within UK), and also incorporate less-active users who only tweet once a month.
Similarly, Li et al. (2012b) jointly combined user tweet data and social relation- ships in a directed graphical model. They considered both users and locations as nodes, and these nodes are connected by two types of edges which represent: (1) a
user tweeting about a place; and (2) a user following another user, corresponding to the text-based part and social network-based part, respectively, in their model. All location nodes themselves are associated with geolocations, and the user nodes are partially observed (i.e., some users have canonical unambiguous locations). The so- cial network-based inference then propagates the location information from observed nodes to unobserved user nodes (i.e., users whose locations are not known). As for the text-based part, a user’s location is estimated based on geolocation references in their tweets such as gazetted terms. Their experiments suggest optimising location predictions over the whole graph outperforms inferring a user’s location based on nearby nodes.
As a probabilistic generalisation of Li et al.’s (2012b) method, Li et al.’s (2012a) method allows users to have multiple locations in their model. A user might tweet about a location if they are there, and the user’s friends may stay in multiple places. As such, they assume a user has a primary location and some temporary locations forming a multinomial distribution over locations. The goal is to estimate the location distribution for the user based on partially observed data, i.e., some users with known primary locations. They incorporated these intuitions in a generative process in LDA. The tweeting and following edges are generated based on: (1) the background random model, and (2) the location assigned from the user’s multinomial location distributions.
2.3.5
Summary
In this section, we discussed the benefits and challenges of geolocation aware- ness in social media. We further categorised geolocations by granularity and location type. After that, we reviewed mainstream approaches to geolocation prediction. Off- the-shelf tools are often ineffective due to non-standard and ambiguous geograph- ical references in social media text. Most existing work has moved to geolocation prediction using less reliable but more abundant information. For instance, social network-based methods predict a user’s location based on the user’s social relation- ships (e.g., friends’ locations), and text-based methods rely on geospatial references
(e.g., gazetted terms, dialectal words, local topics) embedded in the text to disam- biguate the locations. Combining all these methods improves geolocation prediction, however, the integration of different approaches also increases the computational bur- den, which is a non-trivial factor when processing high volumes of social media data. To balance efficiency and effectiveness of geolocation prediction, both features and learning algorithms require careful selection.
In this thesis, we exclusively focus on improving text-based methods for geoloca- tion prediction. In particular, we extend the reach of existing text-based methods, and examine a range of influential factors on prediction accuracy in Chapter 4, such as a more detailed exploration of feature selection methods (in Section 4.3) and the impact of user tweeting language. Making sense of the impact of these factors is crucial, because they are often mutually influencing, and may drastically change the prediction accuracy in practise.