IV. Políticas de desarrollo productivo
3. Políticas tendientes a nivelar el campo de juego
• In Section 7.2, we describe a machine learning approach to news query classification that lever- ages both newswire and user-generated content streams to classify user queries regarding their news-related intent in a real-time manner. In particular, we define a framework for the extraction of news-related features from parallel content streams, enabling a model to be learned that can be used to classify incoming user queries.
• Section 7.3 describes our methodology for evaluating the proposed news query classification ap- proach. In particular, we discuss the crawling and preparation of the news and user-generated content corpora from different time-frames that we subsequently extract features from and how we train our news query classifier.
• In Section 7.4, we summarise the features that we extract from each of our news and user- generated corpora that are used later under FANS for news query classification. Detailed de- scriptions of these features can be found in Appendix C.
• Section 7.5 lists the research questions investigated through experimentation in the following two sections.
• In Section 7.6, we experimentally evaluate our news query classification approach using our first experimental dataset from May 2006.
• Section 7.7 evaluates our news query classification approach using our second dataset from April 2012.
• In Section 7.8, we provide conclusions regarding the effectiveness of our proposed news query classification approach and whether the user-generated content can facilitate accurate real-time classification of user queries.
7.2
Feature Aggregation from News Streams (FANS)
To tackle the challenging task of identifying news-related queries in real-time, we propose a new classi- fication approach that combines evidence from multiple news and user-generated sources. In particular, we propose to learn a classification model for distinguishing between news-related and non-news-related queries, which can then be deployed on a live stream. We use the incoming query to retrieve recent and related content from parallel news and user-generated content streams, from which we will derive fea- tures that either relate it to an ongoing news story or describe the current interest shown in the query topic. The classification model combines these features for a query into a probability score that it is
7.2 Feature Aggregation from News Streams (FANS)
Figure 7.1: A hypothetical distribution of documents published over time that are about a newsworthy event in three streams.
a news-related query or not. We refer to this proposed approach as Feature Aggregation from News Streams (FANS).
We begin by describing how FANS classifies queries over time. FANS classification has three stages:
• Features relating the query to be classified to recent publications in one or more document streams are extracted, referred to as stream features.
• To these initial features, additional query-only features are added.
• Finally, all of these features are expressed as a feature vector describing the query. A machine learned classification model uses the feature vector to estimate whether the query is news-related or not. For more information about machine learned classification approaches see Section 2.5.1.
Importantly, the stream features vary over time, as new documents are published within each source. This means that for a given query, the classification returned for that query can change over time, based upon whether the stream features indicate that there is a related event being currently discussed. To illustrate, Figure 7.1 shows a hypothetical distribution of documents published over a 5 hour period that are about a newsworthy event in three streams, namely: Twitter, Digg and the blogosphere. From Figure 7.1, we see that the distribution of documents is not the same across streams. In particular, at 11:30am, only relevant tweets have been published, while an hour later users start posting relevant documents to the Digg social news aggregator. Finally, around 12:45, the first blog post is published. Assume that FANS was to classify a query relating to the event that these documents discuss. If the query was submitted at 11:30am, then features extracted from Twitter would indicate that the query is news-related, while features tracking the Digg and blogosphere streams would not. Dependant upon
7.2 Feature Aggregation from News Streams (FANS)
how much weight the pre-prepared classification model assigns to Twitter-derived features, the query might not be considered to be news-related at this point. However, if the same query was submitted at 1pm, then features from all three of the sources will indicate that related discussions are being posted in each stream. Hence, at 1pm an effective classification model would therefore classify the query as news-related at that point in time.
In general, under our FANS approach, a query classifier learns a classification model comprised of multiple time-dependent features extracted from current newswire and user-generated content streams from the same time-frame, which are indicative of news-related queries. The classification model is then used to classify a test set of unseen queries (once those queries have had their associated features extracted), facilitating performance evaluation. When extracting features from multiple streams in a real-time setting, each feature can be expressed in terms of three components, namely: the stream from which it is extracted; the time window – representing the subset of the stream used to calculate the feature, and the stream feature – i.e. the notion is being measured from the stream. Hence, the score for one feature can be expressed formally as follows:
scoref(S, tstart, tend) = $(f, Ststart→tend) (7.1)
where f is the stream feature, S is the stream from which to extract the feature, tstartis the start of the time window and tendis the end of the time window. $() calculates a score given a stream feature and stream (or subset of that stream). Figure 7.2 illustrates the feature score generation process when using three streams (BBC News Articles) , two time windows (1 hour and 24 hours preceding the time of the query) and two stream features (the DF and TF-IDF for the query in the stream) . Note that the number of possible features that can be generated is the multiplication of the three components. However, it is often not necessary to generate every feature combination because not every feature will provide unique and/or useful evidence for classification.
As we discussed earlier in Section 2.5, data-driven approaches like FANS are particularly advan- tageous when using features from multiple diverse corpora, because of their generality. In particular, unlike within a statistical text classification scenario (Nigam et al., 2000), our features are not individ- ual terms, but are rather dependant on recent historical news content provided by different newswire and user-generated content corpora. Data-driven learning provides a convenient framework by which multiple types of evidence from these corpora can be evaluated in terms of their effectiveness in dis- tinguishing news-related queries. Furthermore, once a model has been learned for a particular set of features, that model can be used to classify unseen queries in real-time, subject to the extraction of those features for the new queries.
7.3 Experimental Methodology
Figure 7.2: Illustration of feature generation for news query classification.
7.3
Experimental Methodology
In this section, we describe our experimental methodology for evaluating our proposed FANS ap- proach to news query classification, as described previously (see Section 7.2). In particular, to evaluate FANS, we need both parallel news and user-generated content streams and both news-related and news- unrelated queries for FANS to classify. However, there exists no standard dataset of this form that we could use. Instead, as we described earlier in Section 5.2.2, we develop two new datasets from different time frames, namely N QCM ay2006that spans the month of May 2006 and N QCApr2012that covers the period of the 11th to the 23rd of April 2012.
Recall from our dataset discussion in Section 5.2.2, that each news query classification dataset is comprised of three components. First, a series of corpora spanning a fixed time-frame that represent different content streams. The features that FANS uses to classify each query are extracted from these corpora. Secondly a set of queries that are to be classified. Each query has a timestamp from within the period of the dataset, corresponding to the time that query was made. The features from each corpora for a query are extracted with respect to that query’s timestamp, i.e. only documents published before the timestamp are considered. Finally, for each query, a classification label is provided indicating whether that query was news-related or not for the time denoted by its timestamp.
Notably, May 2006 was chosen for our first dataset to match the time-period of the most recent available query log, i.e. M SNM ay2006(see Table 5.1). In contrast, April 2012 was chosen such that the Twitter and Digg user-generated content sources – that were not available in 2006 – could be used. A description of the corpora, queries and relevance assessments contained within each of these datasets
7.3 Experimental Methodology
Streams N QCM ay2006 N QCApr2012 BBC Articles 4
Guardian Articles 4 Telegraph Articles 4
Blekko News Snippets 4 Blekko News Articles 4
Blog Posts 4 4
Blog Snippets 4
Digg Snippets 4
Digg Referred Articles 4 MSN Queries 4
Tweets 4
WikiNews Headlines 4
Wikipedia Updates 4 4
Table 7.1: News Streams available within each of the two news query classification datasets.
can be found in Section 5.2.2. Notably, the corpora used within each of the two news query classi- fication datasets where collected in different manners due to the differing time-frames. In particular, N QCM ay2006was generated retroactively, by crawling archives of content from the time period of May 2006. Meanwhile, corpora for N QCApr2012were live crawled. For this reason, the corpora contained within the two datasets are not identical. Table 7.1 provides a summary of the content streams that are available within each of the datasets.
To train the classification model used by FANS, we created a set of classification instances to train on (see Section 2.5.1). In particular, for a set of selected training queries, FANS extracts a fixed set of features about those queries using information from the streams within the dataset. Features for a query can only use documents from each stream that were published before the timestamp of that query, simulating a real-time streaming setting. Each of the features that we extract are detailed in the next section (Section 7.4). Each query, its features and the classification label (whether it is news-related or not) form a single classification instance. FANS builds a classification model by training upon a subset of all classification instances and then testing upon the other instances. Following standard practice in machine learning for classification, we use a 10-fold cross validation (Witten & Frank, 2005).
We measure the classification effectiveness of FANS in terms of the precision, recall and combined F1metrics described previously in Section 2.4.1.1. Furthermore, for the majority of the experiments, we down-sample our query dataset due to the high degree of class imbalance between news and non-news classes (Chawla et al., 2004). This imbalance results from the naturally smaller ratio of news to non- news queries submitted to Web search engines (Bar-Ilan et al., 2009). Specifically, on N QCM ay2006, from the original 176 news and 2092 non-news queries, we randomly removed 1917 non-news queries