4 PROCESO LLEVADO A CABO
4.2 Pasos del proceso
4.2.4. Ensayo de desgaste (Tribómetro/Pin on disc )
An event is said to be an occurrence of anything significant associated with specific time and location (Brants et al.,2003). On social media platforms, due to online presence of the masses, the occurrence of an event has also been defined by an increase in the volume of messages around a particular topic (Dou et al.,2012). Events have been categorised as new event, specified event, unspecified event, and
small scale event (Atefeh & Khreich,2015;Castillo,2016). A new event is not similar to any of the earlier noted events. Specific events are predetermined type which can be monitored. Unspecific events are any events that are detected in the incoming data streams. Small scale events are generally those that do not generate too much traction for a particular situation, such as crisis events that last for a long time may include sub-events of smaller scale or similar independent events.
New Event Detection, also termed as First Story Detection (FSD), is a sub-task within Topic De- tection and Tracking (TDT) (Allan,2002). Event detection in TDT was traditionally meant for the newswire data, where each new topic was matched with the previous entries. The voluminous and
streaming nature of social media platforms such as Twitter warrant the usage of streaming algorithms. The streaming algorithm is a data processing model where the incoming data is chronologically ar- ranged and is processed in bounded space and time as each new entry arrives (Muthukrishnan et al., 2005). Petrovic and colleagues (Petrović et al.,2010) applied the first story detection methodology, along with a clustering approach, on twitter data to identify new events. Becker and colleagues (Becker et al.,2011) also exploited clustering methods for identifying real-world events. They created clusters of related tweets and further classified a cluster as event or non-event. They extracted different types of features such as temporal (messages posted in an hour are used to create clusters), social (clusters refined using user interactions- retweets and replies), and topical features. For any new event they mea- sure the cosine similarity between the new message and each cluster. They hypothesised that a high percentage of retweets and replies do not indicate an event, and also that events are built around a cen- tral topic, while the non-events clusters are formed around terms which do not form or reflect a cen- tral theme (e.g. work, sleep, monday etc). Phuvipadawat (Phuvipadawat & Murata,2010) proposed grouping and ranking the messages collected via search queries (e.g. #breakingnews and/or #breaking news). The messages similar to each other are grouped together forming a cluster of news articles for a particular story. Message similarity was measured using TF-IDF, weight of nouns, and hashtags.
Another basic approach is the word frequency based method to detect events, when there is a rapid increase in the frequency of a single-word or multi-word tokens. The periodic counters of the number of messages are maintained, and as soon as the count of messages in a particular periodic window in- creases above a threshold, an event is said to be observed. The frequency based analysis can be extended to other activities which can reflect a sudden change in the masses’ behaviour, for example web traffic. Osborne and colleagues (Osborne et al.,2012) took the previous approach (Petrović et al.,2010) as a baseline and enhanced it with considering the traffic on the relevant Wikipedia*pages in the same time intervals. They termed their approach as multi-stream FSD. However, one potential limitation
of this approach is the dependency on web page traffic on third party platforms such as Wikipedia. As the authors themselves point out that Wikipedia lags behind Twitter, in terms of activity, by a few hours and hence it might not be suited for a real time event detection. Also, such approaches are aimed at identifying broad events, rather than identifying/classifying individual text documents into some classes/labels.
Another related work is a system TwitInfo, by Marcus and colleagues (Marcus et al.,2011), that collects posts based on an input keyword such as ‘earthquake’. The system kept track of frequency of tweets per minute, and reported a potential event when the frequency in a particular time window ex- ceeded the average frequency by two standard deviations. In a multi-word frequency based approach, a system TwitterMonitor proposed by Mathioudakis and Koudas (Mathioudakis & Koudas,2010), identifies events by first exploring the rise in frequency of individual words, and then further grouping them together based on co-occurrence (in same tweets). Some of the variations of such an approach exploit multiple hashtags from the tweets (Corley et al.,2013). Another system, Twevent (Li et al., 2012a) relies on determining frequency of tweets which contain data segments, which are generated from segmenting the text into unigrams or bi-grams and extending them using Microsoft Web N- Gram service*. An expected frequency of segments is evaluated using a Gaussian distribution model†. The segments for which the actual frequency exceeds the expected frequency, they are termed as bursty
segments. An obvious limitation of these approaches is that they are bounded by frequency thresh-
old, which curbs the applicability of such systems in scenarios where the crisis related information are below the threshold and/or not carrying relevant vocabulary. Also, these approaches do not take into consideration different types of events (crisis) and the content language, which we focus on in this thesis.
The multi-word frequency can further be extended by generating graphs where nodes are words
*https://www.microsoft.com/en-us/research/project/web-n-gram-services/
or phrases and edges indicate weights cross-correlation between different nodes. Further, these graphs can be segmented and clusters of nodes can be created (Sayyadi et al.,2009). Weng and Lee (Weng & Lee,2011) proposed a system EDCOW, which computes the subgraphs from the cross-correlation graph, and label a subgraph as an event when there is a high cross-correlation between the nodes (which are the words). Interestingly, the cross-correlation graph is built on the criteria of words exhibiting a similar burst pattern, i.e., similar frequency pattern. This system focused on events from sports, music, politics etc. A similar burst detection approach was used to detect earthquakes (Robinson et al.,2013), where the frequency of posts was monitored for search queries such as ‘#earthquake’ and ‘#eqnz’. We have already highlighted the difference between frequency based methods and the approaches we have adopted in the previous paragraph while comparing with the other work (Li et al., 2012a).
From the event detection perspective, Twitter has also been considered as a source of sensors, where the users are social sensors. Sakaki and colleagues (Sakaki et al.,2010), used Twitter social sensors (users) to detect earthquake events. They collected the tweets and performed semantic analysis for phrases such as earthquake, shaking, now it is shaking. They also used classification approaches to classify them as positive or negative class, i.e., they were either related to earthquake event or not. A potential limitation of this work lies in the assumption that people may share relevant information in only a certain variations of the text, and does not consider semantics at a more conceptual level. However, this is an example of specific event detection. Another domain specific event detection system, Twitter-based Event Detection and Analysis System-TEDAS was proposed by Li and colleagues (Li et al.,2012b). The system specifically detected crime and disaster events. TEDAS was partially a rule based system which crawled over tweets based on certain rules, such as specific keywords and hashtags. Next, the tweets are classified using a supervised learning. Within the event detection approaches, these works focus on crisis specific data, which either focused on specific crisis events (earthquake) or vocabulary (keywords and hashtags), thereby not scaling the applicability of the system to multiple
crisis type and multilingual crisis data. Table 2.5 shows a comparison of various works with regards to the research scope of this thesis.
Some of the works focused on extracting events in the form of entities, dates, etc. Ritter (Ritter et al.,2012) developed a system TwiCal to extract multi-type events relating to sports, politics, mu- sic release, from Twitter and generate open-domain calendar. They used an in-domain trained entity tagger (Ritter et al.,2011), instead of using Stanford Tagger. The system extracted entities, dates, event phrases from the Twitter data. The use of Natural Language Processing techniques has been exploited in more works to perform event detection. Elloumi and colleagues (Elloumi et al.,2013) designed a two-step model for performing event detection. The first step performs relation extrac- tion and creates binary relations between entities in the text. The second step arranges these relations in a template, which can define an event. Popescu and colleagues (Popescu & Pennacchiotti,2010) applied supervised machine, using Gradient Boosted Decision Trees (Friedman,2001), learning to detect controversial events. For this they used a controvery lexicon from Wikipedia, bad words lexicon, and an English dictionary. The English dictionary comprised of 100k part-of-speech tagged English words, which was trained over Wall Street Journal and Brown Corpora*. The work reflected opti- mistic results, however it was not catered for controversial events from diverse domains. Alsaedi and colleagues (Alsaedi et al.,2016a) proposed a two stage classification system for identifying real-world events from Twitter in Arabic language. First stage was a classification task where the data is cate- gorised into events or non-events. The second stage was a clustering stage to cluster the data into multiple potential events. For supervised classification task, a sample of 5000 Arabic tweets was man- ually annotated into categories event and non-event. While the work focused on Arabic data, it only demonstrated the event detection problem from a single language perspective. They used Arabic lan- guage specific stemmer for pre-processing the data.
So far, we have covered the segment of the literature where the focused domain of information on
social media is treated as an event, and the various approaches to detect the events. Next, we survey the works which have focused on identifying crises oriented information from social media.