CAPÍTULO IV. LA EVOLUCIÓN LEGAL DE LA EXPULSIÓN EN EL ÁMBITO PENAL ESPAÑOL
IV. 3.2) La introducción del acceso al tercer grado como vía de expulsión
In this study, tweets have been characterized based on the following three main compo- nents: (i)tweet textual content, (ii)tweet temporal feature, and (iii)tweet spatial feature. The last two components are the contextual features describing the time and location when tweet has been posted. An example tweet for the reference case study addressed this work is reported in Table 3.4. Tweet components are described below.
Tweet textual content. Tweets are short, user-generated, textual messages of at most 140 characters long and publicly visible by default. Due to their limited size, these messages are inherently sparse. Moreover, they are usually extremely impure because they include a wide variety of Unicode data, symbols, numbers and links. Thus, tweets messages should be properly cleaned and prepared before applying the data analysis phase. In this study, the textual content has been represented using the Bag-of-Word
(BOW) representation as described in Section 2.2.2.
Tweet spatial feature can be acquired as geo-coordinates, location specified in user profile, and location mentioned in the tweet textual content. Geo(graphical)-coordinates (i.e., latitude and longitude) are available when GPS enabled devices were used in ac- cessing Twitter. They specify the spatial position of people when posting the tweet. Instead, the location specified in the user profile is free-text information provided by posters. It usually corresponds to the place (as city, state or country) where people come from. Since our aim is discovering tweets with similar textual content but posted in nearby geographical areas (and time periods), we focused on the spatial information provided by geo-coordinates.
Geo-coordinates 52.076171,-1.363145
Creation time Fri Jun 20 09:26:53 +0000 2014
Textual content England 2-0 I still believe
Table 3.4: Tweet example
Tweet temporal featurecorresponds to thetimestampincluding date and hour when tweet was posted. In this study, we neglect the temporal information possibly appearing in the tweet message because less relevant for discovering tweets posted in nearby time.
3.2.2.1 Twitter Data Collection and Preprocessing
Tweet posts are retrieved from twitter.com via Twitter’s Streaming Application Pro- gramming Interfaces (APIs). The Streaming APIs provide low latency access to Twitter’s global stream of Tweet data, by establishing and maintaining a continuous connection with the stream endpoint. A java crawler has been used to collect and parse tweets in real time based on a predefined set of keywords (e.g., “worldcup2014”, “fifaworldcup” in our case study), ignoring case considerations. Among the crawled tweets, we extracted English tweets only.
To suit the raw tweet textual content to the subsequent mining process, some preliminary data cleaning and processing steps have been applied. First, numbers, usernames and URLs mentioned in the content have been removed. Then after converting the letters into lowercase, tweet messages are purified by eliminating stop words (such as “is", “at” and “the”), and represented according to the Bag-of-Word representation. Collected tweets may be posted from different time zones. For computing the temporal distance between messages, the tweet time information has been preliminarily reported to a reference time zone. For example, in the use case considered in this study, the Brazil (America/Sao_Paulo) time zone, where the 2014 FIFA World Cup was held, has been selected as a reference.
Tweets may spread out over a vast geographical area and/or time frame. In this case, the tweet collection can be preliminarily partitioned based on same geographical areas and/or time periods. Then, the cluster analysis is locally performed in each segment.
3.2.2.2 Twitter Data Representation
This section formalizes the adopted data representation for the textual, temporal and spatial information of tweets.
Definition 3.1. Tweet data collection. Let D be a collection of tweets and Σ = {w1, . . . , wk} the set of words appearing in at least one tweet in D. An arbi- trary tweettwi ∈ Dis represented as a triplettwi = (ttwi, stwi, Wtwi) wherettwi andstwi are respectively the temporal and spatial features oftwi, whileWtwi is the tweet textual content.
According to the tweet characterization in Section 3.2.2, the temporal feature ttwi is thetimestamp on whentweet twi was posted, while the spatial feature stwi is the pair of geo-coordinates (latitude, longitude) reporting where tweet twi was posted. Wtwi represents theset of words wj,wj ∈Σ, appearing in tweet twi.
The Term Frequency (TF) - Inverse Document Frequency (IDF) scheme [2] usually used in text mining has been adopted to highlight the relevance of specific words for each tweet. This scheme reduces the importance of common terms in the collection. It allows focusing the tweet matching in the subsequent clustering phase on words specific for each set of tweets instead of words appeared in most tweets.
To weight word relevance based on the TF-IDF scheme, the tweet textual content is transformed using the Vector Space Model (VSM) representation [6]. Each tweet is a vector in the word space. Each vector element corresponds to a different word and is associated with the Term Frequency(TF)-Inverse Document Frequency(IDF) weight describing the word relevance for the tweet.
Definition 3.2. Tweet textual content representation. Lettwi = (ttwi, stwi, Wtwi) be an arbitrary tweet in collection D. The tweet textual contentWtwi is a vector of |Σ| cells in the word space Σ in D. Each vector element Wtwi[wj] contains the TF-IDF weight of wordwj for tweettwi. Wtwi[wj] is computed asWtwi[wj] =T Ftwi,wj∗IDFwj, where terms T Ftwi,wj and IDFwj are defined as follows.
1. T Ftwi,wj is the relative frequency of wordwjfortwi. T Ftwi,wj =ftwi,wj/
P
1≤k≤|Σ|ftwi,wk, whereftwi,wjis the number of times wordwjappeared in tweettwiand
P
1≤k≤|Σ|ftwi,wk is the total number of words contained intwi.
2. IDFwj is the frequency of wordwj in Σ. IDFwj =Log[|D|twk ∈ D:ftwk,wj 6= 0|] where|D| is the number of tweets in Dand |twk ∈ D:ftwk,wj 6= 0|is the number of tweets in Dwhich contain (at least once) word wj.
Mathematically, the base of the log function for IDF computation in Definition3.2does not matter and constitutes a constant multiplicative factor towards the overall result. The TF-IDF weight Wtwi[wj] for word wj in tweet twi is high when wj appears with high frequency in tweettwi but low frequency in tweets in the collectionD. When word
wj appears in more tweets, the ratio inside the IDFs log function approaches 1, and the IDF(wj) value and TF-IDF weight Wtwi[wj] become close to 0. Hence, the approach tends to filter out common words. In short-messages as tweets, the TF-IDF weighting score could actually build down to a pure IDF due to the limited word frequency within each tweet. Nevertheless, we preserved the TF-IDF approach to consider also possible word repetitions.