• No se han encontrado resultados

Capítulo 1: Marco Teórico

1.3 Calidad del transporte público

1.3.3 Descripción del modelo de las deficiencias

Feature-driven approaches treat popularity as a non-decomposable process and take a bottom- up approach. The most important task is identifying a set of informative features that can best represent the content and achieve high prediction accuracies. The intuition behind the aforementioned class of models is that learning algorithms can somehow identify and capture latent dependency between popularity and an extensive set of features.

Feature-driven methods formulate the problem of predicting popularity in two ways: regres- sion and classification. Next, we give details about the two settings along with examples from previous work.

Regression vs. Classification

The regression problem is to predict the exact popularity of an item up to some timet, wheret=

∞is equivalent to predicting the final popularity of the online item. For example, predicting the number of views for a YouTube video [Szabo and Huberman, 2010; Pinto et al., 2013], predicting the final size of a URL cascade on twitter Bakshy et al. [2011]; Martin et al. [2016].

However, at times, predicting the exact value is needless, and we only need to segregate popular items for unpopular ones. In this setup, instead of predicting the exact size, we predict whether the popularity of a particular item will cross a predefined threshold value or not. For instance, predicting whether a cascade will double its size or not [Cheng et al., 2014], whether an item will have 10 million views [Shamma et al., 2011], or be among the top 5% of the most popular items [Yu et al., 2014]. Classification setting is a relatively easier setup [Bandari et al., 2012].

Features

Most of the work in feature-driven methods are scattered around finding and constructing as many informative features as possible with human expertise and domain knowledge. The type of features used can be divided into four main categories: content, user, temporal and structural. We detail each of the four categories below.

Content features. The most basic feature responsible for the propagation of diffusion is the

content itself. Content features are readily available in most of the scenarios as they do not depend on the network under study.

On Twitter, tweet content is used to derive features like the number of URLs, mentions or hashtags in the tweet [Tsur and Rappoport, 2012; Suh et al., 2010]. [Wu et al., 2018b] uses freebase topics describing a YouTube along with the category of the video as one of their features. Recent studies have generally identified content at best to be a very weak predictor when compared to other features like temporal, user or structural [Cheng et al., 2014; Martin et al., 2016].

User features. They relate to all the users who are part of the cascade. In Twitter, the most

straight forward user feature is their number of friends or followers. [Petrovic et al., 2011] found out that the features of the author who started the tweet are more important than the features of the tweet itself (content). [Bakshy et al., 2011; Martin et al., 2016] found the past success of a user to be an informative feature. [Cheng et al., 2014] constructed a variety of user features on facebook like whether the user is a page or person, age, gender, time since on FB. Not all user features are available on all platforms.

Temporal features. Temporal features hope to capture the unfolding of a cascade. [Cheng et al.,

2014] reported them to be the best performing features among everyone. Surprisingly [Szabo and Huberman, 2010; Pinto et al., 2013] found the early popularity, most straightforward of all temporal features, to be a robust and powerful predictor for predicting the final volume of views for a YouTube video. One point to note is that we can create temporal features without having access to the underlying network.

Structural features. They relate to the shape and status of the underlying network on which

diffusion unfolds. They are mostly extracted by building a user’s friendship graph and then extracting graph level attributes from it. [Cheng et al., 2014; Romero et al., 2013] reported struc- tural features to be better than the user and content features. However, they also observed that structural features are not as useful as temporal features. Structural features are costly to create both time and space-wise as we need to extract extra information generally not avail-

§2.2 Popularity Modeling and Prediction 13

able through public API. Hence, our work in the thesis focuses on predicting the popularity of cascades without considering the underlying network structure.

In summary, there has been much work in the space of predicting popularity with feature- driven methods. However, there still seems to be a disconnect within the community in iden- tifying features that make predictions more accurate across different settings, i.e., regression and classification. Another unclear aspect is the set of features that are transferable between different online social networks. Moreover, many of these features are constructed on propri- etary datasets. Hence, there is little understanding about the wider applicability of features. In Chapter 3, we curate the Newsdataset from Twitter public API and construct a set of features

that can be built on any free public online social network. Our work in Chapter 4 evaluates both regression and classification tasks over Newsdataset to understand the performance of features

across different settings.