Estructura del programa : - El blog cinema: una estrategia para trabajar la comprensión lectora

The various features of candidates’ media content, including both news coverage and social media activity, are at the core of this study. Due to the size of modern day media data needed for such an analysis, the process for gathering the data was complex, requiring a combination of manual and computational tools. Moreover, the strategy used to gather news media and social media data required different methods and posed unique challenges. The following sections detail the processes used to gather data for both media channels.

In order to analyze the news coverage for the candidates in the 2008-2016 U.S. Senate elections, all coverage of these candidates was downloaded using LexisNexis. For each candidate, a search was performed in the LexisNexis database using the candidate’s full name, with a time-frame of six months prior to the elections. A python script was then used to parse the results into different articles, collecting article-level meta-data as well.

121

Search was performed in the LexisNexis database for all U.S. Newspapers and Wire Services (Lexis-Nexis code: 140954). This strategy was chosen specifically, as opposed to the more common strategy of searching only major newspapers, which is unlikely to identify local coverage, and for Senate elections local coverage might be of critical importance. However, it is also important to note that the database may have some limitations that need to be considered. LexisNexis is by no means identical to print editions of newspapers and as such may provide a somewhat biased sample of news coverage (Ridout et al., 2012). The database does not contain all outlets publishing in the U.S. and might thus be missing both large and small news outlets. However, this problem is more common for wire services data and for international news. Therefore, in the context of this study, for which I rely on local news, such problems should be less acute. Further, while some local sources might be missing from the sample, given that

candidates are compared to their counterparts, any biases in the LexisNexis database are likely to influence both candidates in similar ways, thus limiting the overall bias in estimating media features.

Additional limitations are the inclusion of duplicate items (for example, a wire service article that was printed verbatim by another news outlet), and server test items in the database. Test items were identified by excluding extremely short items from the database. Duplicate articles were identified and removed using a random 200-character string taken from the middle of the article. If that exact 200-character string was found in another already archived article, then the article was deemed to be duplicate and was not archived again. I elaborate more on these issues when addressing the pre-processing procedures.

122

Data for social media activity was gathered for a more limited time-frame due to the limited availability of the data (2012-2016). Several off-the-shelf tools and packages are available for mining social media data in general and Twitter data in particular. These tools allow researchers and developers to search the Twitter database (Rest API) as well as observe the ongoing stream of tweets that are constantly uploaded to the website (Streaming API). However, both services come with some rate and size limitations, and research has shown that they might deliver non-representative samples for the requested content (Tromble, Storz, & Stockmann, 2017). Thus, a non-API approach was chosen for the data retrieval in this study.

In order to gather content for all candidates, the Twitter username (handle) of all candidates needed to be obtained. This was done using a combination of methods offered in previous studies (Bode et al., 2016; Bright et al., 2018; Jungherr, 2016). First, official Twitter pages were gathered from Wikipedia, Ballotpedia, and the candidates’ websites. A Google search was also performed using the candidate’s name, state, and the keywords “Twitter,” “campaign” and the year of the race. The first two pages of Google search results were manually examined to identify additional viable Twitter pages to assess whether they related to the candidate, or whether they related to another individual with the same name, a parody account, a hijacked account, or other non-genuine campaign pages. For this task, timing was found to be critical. Several candidates’ pages were removed from Twitter or hijacked by a third party by the time the search was conducted, as can be viewed from the content of the page feed. For example, the Twitter handle “@SadlerTX” was previously attached to candidate Paul Sadler but has since been

123

claimed by a Russian speaking individual. Therefore, I decided to focus on more recent elections (2012-2016) for which more pages still existed online.

Following the identification of candidate-related Twitter usernames, all activity in these pages (for all tweets written by the user) was downloaded and parsed into separate tweets using a custom-built python script. These included the textual data of the tweets along with any tweet-level metadata supplied by the page. I chose to use this more direct approach as opposed to other search methods as these can limit the amount of data gathered from Twitter pages or even skew data search results due to unknown criteria for inclusion (Tromble, Storz, & Stockmann, 2017). This is especially the case for candidates with a large volume of Twitter activity during the elections. Finally, as was done for the news data, duplicate Tweets were identified. These tweets were not removed from the data at all stages and for all methods, as will be elaborated in the topic modeling section in this chapter.

In document El blog cinema: una estrategia para trabajar la comprensión lectora en inglés (página 81-92)