SÍNTESIS - Núm. 42 (2008-1) CUADERNOS GEOGRÁFICOS

As Webster & Watson (2002, pXIII) make clear ‘A review of prior, relevant literature is an essential feature of any academic project. An effective review creates a firm foundation for advancing knowledge. It facilitates theory development, closes areas where a plethora of research exists, and uncovers areas where research is needed.’

In an era in which ever-increasing numbers of journals and journal articles examine emergent phenomena, such as Online Social Networks, new techniques are

required to search for, select and synthesise academic material (JISC, 2012). Ware &

Mabe (2015, p6) have estimated that around 2.5 million peer-reviewed scientific, technical and medical articles were published in the English language in 2014. As the rate of production of scholarly output increases and the ease with which electronic documents can be stored and searched improves, nascent text-mining and knowledge discovery technologies – used elsewhere in this research to interrogate social media data (Chapter 5, p186) – now offer high degrees of utility when analysing large corpora of published academic work.

Fully ‘systematic’ literature reviews, particularly popular in the medical research community and increasingly being applied by ‘early adopters’ in the social sciences and humanities (JISC, 2012, p4), rely upon searches for literature, based on key

53 terms, executed against multiple online academic repositories or databases, e.g., JSTOR, Web of Science, PubMed etc. The approach adopted here, more fully detailed in Section 2.2.1 below, may best be described as ‘semi-systematic’; over 1,250 articles have been selected for inclusion in the research literature corpus over a period of more than 7 years, based upon searches executed on popular online repositories and publishers’ websites as well as alerts set up on, and emails received from, Google Scholar and learned societies such as the Political Studies Association. Articles selected for inclusion in the research literature corpus are stored in Mendeley Desktop bibliographic management software and have been read either in full or sectionally, by searching for key terms and the paragraphs or sections that contain them. The literature review which follows (Sections 2.4 to 2.7, pp64-88) therefore mixes a conventional scholarly approach to the task, with a synopsis of key themes given in Section 2.3 (p61), alongside several computerised methods outlined in the following paragraph and more fully described in Sections 2.2.1 and 2.2.2, below.

Usai, Pironti, Mital, & Aouina Mejri (2018) have suggested that a systematic review of literature may be conducted ‘by applying “text mining at the term level, in which knowledge discovery takes place on a more focused collection of words and

phrases that are extracted from and label each document” (Feldman et al., 1998, p1). This approach involves extracting labels which correspond to keywords, which consequently represent the main topic of an article.’ Term labelling may be

achieved manually (e.g., by the researcher marking up article text identifying key terms based upon their own domain expertise) for training in machine learning applications or automatically, as here, through the use of algorithmic processing, e.g., the creation of Term Frequency – Inverse Document Frequency (TF-IDF) matrices (Section 2.2.2.2, p58). Either approach is designed to ‘find nuggets in mountains of textual data’ (Dörre, Gerstl, & Seiffert, 1999), helping to identify key themes running through substantial bodies of literature and to organise the review process accordingly.

54 The following sections of this chapter describe the methods used to search for literature (Section 2.2.1) and present results of data and text-mining analysis (Section 2.2.2, p57). Through this work 136 key terms, each mentioned over 4,000 times within the research literature corpus, have been identified programmatically.

Using acquired domain knowledge (Alexander, 1992), based upon a reading of these articles, identified terms have been assigned (Section 2.2.2.3, p59) to four disciplinary categories; political, communications, geographical and technical. The literature relating to Online Social Network usage, as it applies to each category, is discussed below, starting in Section 2.4 (p64). First, the methods used to search for and store literature are set out.

2.2.1 Methods

This section details the methods used in the literature review, considering the scope of the review, sources of literature and the potential for bias in study

selection (Section 2.2.1.1). Section 2.2.1.2 (p56) outlines the specific methods used to interrogate the research literature corpus held in (Mendeley, 2016) bibliographic management software, further details of which are given in Appendix 3 (p414).

2.2.1.1 Scope of the review, sources of literature and potential for bias

Recognising that a literature review is inherently a ‘retrospective, observational’

task (A. F. Smith & Carlisle, 2015), and that the search process is ‘no more free from the impact of human subjectivity than other research’ (Okoli & Schabram, 2010, p2) the approach adopted here is semi-systematic; aiming to be as ‘explicit’,

‘comprehensive’ and ‘reproducible’ as possible, in line with Fink's (2005)

recommendations. The literature search uses a range of Web-hosted databases and email-based alert tools, as advocated by Dunleavy (2003), Trafford & Leshem (2008) and M. Wallace & Wray (2011). Various systems have been used, including those developed by the University of Portsmouth Library, the Joint Information Systems Committee (JISC) and the British Library. Searches have also been conducted on the Web of Science, Google Scholar and on websites developed by academic publishers

55 including Sage Publishing, Taylor & Francis and John Wiley & Sons, amongst others.

Regular email alerts from Google Scholar, each covering specific topic areas and typically returning around ten potentially relevant articles per email, have also been used.

In this study:

• Searches are conducted in the English language, although non-English texts have not been specifically excluded;

• Preference for inclusion is shown towards published works, particularly works published in journals in recent years;

• Search terms used in alert services have evolved iteratively with the longest running searches on Google Scholar (>1,200 emails since 2011) indexing:

o [ politics \"social network\" ] o [ intitle:\"geo tagging\" ]

• Cross-referencing and article recommender systems have also been used to expand the ‘pool’ of available literature (Teppan & Zanker, 2015).

In ‘screening for inclusion and exclusion’ (Okoli & Schabram, 2010) consideration has been given to:

• The quality of academic writing including the use of English (grammar, spelling, punctuation) and the accuracy and extent of referencing;

• The quantitative measurement of ‘relevance’ as exhibited by citation and/or other bibliometric scores (e.g., journal ‘impact factor’).

Altogether, over 1,250 bibliographic references have been saved to Mendeley during the course of this research. The following section briefly describes how features in Mendeley Desktop, allied to third-party technologies, usefully enable bibliometric analysis of the research literature corpus.

56 2.2.1.2 Bibliometric analysis

Using the Help -> Create Backup… menu in Mendeley Desktop it is possible to create a backup of stored references. These are saved to an SQLite (2016) database file that may be opened, viewed and queried using open-source software (DB Browser for SQLite, 2016). Figure 2-1, based on analysis from this workflow, illustrates the number of references by publication type (journal article, book etc.) selected for inclusion in the literature search and stored in Mendeley Desktop.

Figure 2-1 – Number of references by publication type selected for inclusion Mallig (2010) has highlighted the benefits of using ‘relational databases […] in the field of bibliometrics.’ RDBMSs such as SQLite, the underlying storage technology used by Mendeley Desktop, not only store data (e.g., year of publication etc.) in tables but enable queries to be executed against this stored data. Figure 2-1, which shows that journal articles comprise the majority (68.7%) of saved references in the research literature corpus and Figure 2-2 (p57), which shows the number of

references by publication type by year, could not have been created within Mendeley Desktop itself, but can be graphed using Mendeley’s SQLite backup file and a SQL query run in DB Browser for SQLite (Appendix 11 listing 3, p479). Further details of this, and alternate, techniques for querying bibliographic data are given in Appendix 3 (p414) of this thesis. Pertinent results from this bibliometric analysis

57 exercise are detailed in the following section, alongside a report of the results of text-mining operations conducted in R (The R Foundation, 2018).

2.2.2 Results

2.2.2.1 Literature recency by publication type

A data-based approach provides useful information about the shape (Figure 2-1, p56 and Figure 2-2, below) and composition (Table 1-1, p38) of the research literature corpus. It is possible to draw two key conclusions from this analysis:

1. The literature search exhibits a strong degree of recency, and;

2. The literature search exhibits a strong degree of cross-disciplinarity.

Outwith academic ‘Geography’ many of the references collected as part of this search have been published in ‘Political’, ‘Communications’ or ‘Computer Science’

journals, some of which, e.g., Mobile Media & Communication (Volume 1, 2013), have only recently been established.

Figure 2-2 – Number of references by year by type selected for inclusion These conclusions support the view a) that growth in OSNs and other forms of mobile communication are leading to new forms of scholarship (R. M. Chang,

58 Kauffman, & Kwon, 2014; Cresswell, 2014), and; b) that plenty of geographically relevant content can be found in journals outside the traditional publication bounds of the discipline of geography itself (Miller & Goodchild, 2015).

2.2.2.2 Literature mining for key terms

Around 90% of the ~1,250 references stored in Mendeley Desktop include a PDF file containing source literature content. Using computer programmes (Appendix 3, p414) developed in R (The R Foundation, 2018) the content of 1,111 PDF files has been text-mined using a Term Frequency – Inverse Document Frequency (TF-IDF) algorithm.

TF-IDF scores account for ‘the frequency of terms appearing in a document, the length of the document in which any particular term appears, and the overall uniqueness of the terms across documents in the entire corpus’ (Russell, 2011, p151). A large number of terms (185,577) from 1,111 PDFs containing 159.6MB written text (28.5 times more than the seminal, 5.6MB, Complete Works of William Shakespeare digitised by Project Gutenberg) have been identified using R’s Text Mining (TM) package (Feinerer, Hornik, & Artifex Software Inc, 2016).

In pseudo-code the steps involve:

• Mounting Mendeley’s PDF document repository as a ‘shared folder’ on a Linux Virtual Machine (VM) set up with the R and RStudio packages, and;

• Running scripts written in R to create a corpus from the PDF files, converting all text to lower case, removing punctuation, numbers and English-language stop words (‘and’, ‘the’ etc.) before performing statistical analysis.

Results may be tabulated or, as in Figure 2-3 (p59), visualised using a Word Cloud.

Top ranked terms include, as expected, the words ‘political’, ‘tweets’, ‘twitter’,

‘social’ and ‘media’. Less prominent terms include ‘spatial’, ‘geography’, ‘analytics’

and more.

59 Figure 2-3 – Key terms (TF-IDF frequency >4,000) identified in the literature corpus The successful identification of key literature concepts through TF-IDF analysis of the stored research repository supports JISC's (2012, p3) assertion that ‘text mining and analytics of […] scholarly literature and other digitised text affords a real opportunity to support innovation and the development of new knowledge.’

2.2.2.3 Literature categorisation for thematic analysis

Using acquired ‘domain knowledge’ (Alexander, 1992) the top 136 categorisable terms (frequency > 4,000, Figure 2-3) identified by TF-IDF analysis of the research literature corpus have been ‘hand-coded’ (Swanson & Holton, 2005) into four thematic classes: 65 terms are coded technical, 33 political, 24 communications and 14 geographical. While some overlap between terms (e.g., ‘respondents’) in classes is inevitable, and twelve other terms (e.g., ‘business’) could not easily be

categorised, thematic analysis helps to identify prominent concepts in four inter-related bodies of material derived from the literature search.

60 Figure 2-4 – Percentage publication titles by class

Figure 2-4 shows the percentage of 556 distinct publication titles (e.g., New Media

& Society, The Guardian) allocated to each of the same four classes. As with key terms, technical publications comprise the majority (> 53%) of all references held.

Geographical, communications and political classes together comprise ~42% of all references. A fifth class, news, of which there has been a great deal in the subject area during the research programme, accounts for just under 5% of all references selected for inclusion by the literature search.

2.2.2.4 Value and benefit of text-mining

The quantitative and thematic analyses detailed above, together with a great deal of reading, have helped to distill several key contextual leitmotifs from a large research literature corpus examining OSN usage across four cross-disciplinary boundaries, further illustrating the ‘value and benefit of text-mining’, identified by JISC (2012), in conducting literature reviews (Section 2.2, p52). A contextual synopsis and overview of the curated research literature corpus follows in Section 2.3, after which key terms, concepts and select papers from each of the four main thematic classes shown in Figure 2-4 are discussed consecutively in Sections 2.4 to 2.7 (pp64-88). Technical terms, and the technical literature, are covered lastly in this synthesis as politics, communications and geography do most to conceptually frame the current research project.

In document Núm. 42 (2008-1) CUADERNOS GEOGRÁFICOS (página 77-144)