Having established my aims and research question, in this section, I will outline how I designed and conducted this research, as well as the methodological considerations involved in the selection and analysis of data in this thesis. These were primarily driven by my aims and research questions in terms of:
• How are social media used to self-manage CD?
• How might modes of gamification be used to explore and visualise the self-management of CD and comorbid illness?
In what follows, I break the discussion into eight sections, as a way of providing a thorough overview of the various methodological issues intrinsic in doing the research. Firstly, I explain how I have selected samples from social media. I then introduce readers to the notion and use of APIs in this study, including interface restrictions and the problem of different timelines across different platforms. Thirdly, I discuss the issues involved in using cross-platform media. Here I also outline how and why I have concentrated on two main cities, mainly London and New York. Fourthly, I explain my use of hashtags throughout the study. Fifthly, I zoom in on Facebook comments and why I have analysed those, specifically in the writing of Chapter 4. Sixthly, I discuss of Instagram and the issues involved in reading digital images. And finally, I summarise the ethical issues involved in this study, and how I have tackled them, before concluding.
The chapter is divided into two parts. In the first part, I talk about the general methods of data collection and sampling that I have used, using hashtags.
Selecting Social Media Data Samples
When researching the health-related behaviour of individuals on social media, one must be cognisant of the variety of ways that users use social media, as well as the different kinds of devices they employ to do so (Marres, Gerlitz & Studies, 2016). As social media platforms have started to
be used to share similar sorts of information, such as images and text across both Twitter and Instagram, social researchers are increasingly having to adapt their research methods to catch-up with the myriad of ways that information on key aspects of social life (and in turn chronic social life) are communicated (Michael & Lupton, 2015). This involves understanding how data is channelled, stored and propagated via social media platforms like Twitter and Instagram, as well as how the data produced by individuals can be changed, re-shared and ultimately communicated. Indeed, this thesis is somewhat unique in the way it uses a range of digital data to explore a particular area of social life.
Most traditional social science theses are structured in a way that allows them to collect and analyse data from a specific source, be it interviewees, archived data, or data from statistical databases. However, the way that my methodology is structured is somewhat different, due to the evolving nature of the topics that were captured and followed over the period of time that specific events played out over social media. With regards to the structure of data output by different social media platforms, for the purposes of answering my research questions, the data from Twitter and Instagram were interpreted in different ways. Within the context of researching the representation of CD/comorbidity, the anatomy of the text/images within a tweet or Instagram post were categorised as either statements about a Coeliac’s experience of self-managing CD/comorbidity, as questions about self-managing CD/comorbidity or the GFD, or as part of an ongoing conversation about their experience of CD/comorbidity or the GFD within the Coeliac community or to their social network audience at large. Some of these conversations were short, or extended over a prolonged period of time. For example, when looking at data informed by analysis of my first research question of how some Coeliacs use social media to self-manage CD, and in particular how they dealt with risk at the time of a gluten free food recall – the usual 31 day data collection period had to extended over the course of two years. This was because it took that amount of time for both the incidents in both the UK and the US to play out, and for me to fully capture the events, statements and conversations as they happened.
A key tool I used was Netlytic.org (Gruzd, 2016) to maintain a data collection feed of Twitter and Instagram feeds in relation to key Coeliac and co-related risk hashtags for a continuous period of time between January 2014 and September 2016. This was due to the very nature of these particular social media platforms. That is, due to a cap on free historical data, I had to maintain a period of continuous sampling to make sure I collected a robust and active data sample. This is because Twitter and Instagram do not allow free retroactive data sampling, and because in my situation (and potentially those of other social researchers) there was no budget to pay for expensive and large amounts of historical data.
Table 1 gives an overview of each main data collection period, and how it relates to each chapter in this thesis. It also shows the sub-sample periods for each topic discussed, as well as the sample methods used for each dataset. There are two main datasets, the Symptoms dataset, and the Risk dataset. Both datasets were sampled and used to answer my two research questions:
• How are social media used to self-manage CD?
• How might modes of gamification be used to explore and visualise the self-management of CD and comorbid illness?
The Symptoms dataset sampled a total of 18.15k tweets and 15.7k Instagram posts between January 2014 and September 2016. This data was used to answer both research questions, in terms of looking at how the discourse within tweets and Instagram posts mentioning symptoms could help explore and visualise how social media can be used to communicate the self-management of CD and comorbid symptoms.
Chapter 4 explores the first research question of asking how social media is used to share the symptoms and self-management of CD. With the analysis of 10.7k Instagram posts and 13k tweets, it specifically looks at how certain hashtags like #NoCureNoChoice or #glutened can be created or utilised by Coeliacs and those with similar symptoms as way of spreading awareness of their illness. Chapter 4 also looks at how co-occurring hashtags like #symptoms + #Coeliac OR #Celiac can also be used to explore how the symptoms specific chronic diseases like CD are being managed.
The hashtag and social media discourse data from Chapter 4, is further utilised in Chapter 6, which uses gamification to answer the second research question, of how modes of gamification may be used to explore and visualise the self-management of CD and comorbid illness. In Chapter 6, the 10.7 k Instagram posts and 13k tweets from Chapter 4 are further analysed and visualised using data visualisation, games mechanics and gamification techniques, in the build of eHealth gaming apps ‘Gluten Fighters’ and ‘Coeliac Sam’. The same hashtag and discourse data is also further visualised in Chapter 7, with the build and deployment of the visual research tool, the Spoonie Living app. The Spoonie Living app is used to explore how individuals visually communicate their experiences of symptoms and feelings of biographical disruption with CD and comorbid illnesses.
The methodology table (Table 1) also shows the data output that occurred from the smartphone applications that were built for Chapters 6 and 7. The data from the apps in Chapter 6 (Gluten Fighters and Coeliac Sam) is represented in terms of the number of downloads of each app. This is because as prototype gaming apps, they were built more as a visualisation of the concept of illness and quest narratives first explored in the way that hashtags relating to symptoms were shared in Chapter 4. The Gluten Fighters and Coeliac Sam apps were created to address the research question of how modes of gamification can be used to visualise the social media data shared with regards to the self-management of CD and comorbid illnesses, but not as a way to investigate user interaction with each app. Impact for these two apps was measured through the number of downloads for each app, and any reviews of them in the Apple or Android app stores, and this is further explored in Chapter 6.
In Chapter 7, further analysis of the self-management of CD in the 15.7k Instagram posts of the Symptoms dataset, showed that Coeliacs were also using images to visualise their illness and quest narratives in terms of diagnosis, biographical disruption, and adjusting to life on the GFD. However, analysis of these images found it hard to differentiate between general selfies and photos taken, and those taken by users with CD and comorbid illness. To better explore how modes of gamification could be
the rest of the visual data stream, I devised a way to re-use the hashtags that were visualised for the Gluten Fighters and Coeliac Sam app, and turn them into visual tags and markers of symptoms and biosocial identity with chronic illness. This was done by the creation of the Spoonie Living app, which incorporated making the same hashtag data used in Chapter 4 and 6, into visual stickers that enabled new users to self-label their symptoms and experiences of CD and comorbid illness. It was proposed that allowing users to visually tag their symptoms with previously collected hashtag data might provide new insights into how users felt about their symptoms, and how different modes of biosocial identity can be visualised as co-existing with comorbid illness.
The build of the visual research tool, the Spoonie Living app incorporated the utilisation and analysis of data output in the form of the visual chronic illness stickers added to images, that were shared from the app to the social media platforms, Instagram, Twitter and Tumblr, as well as shared to the internal wall of the app for more private data. This data was also classified by posts and tweets linked with the hashtag #spoonielivingapp, as well as an analysis of the internal posts submitted within the Spoonie Living app.
A separate social media dataset was used with Chapter 5, to explore how Coeliacs communicated how they dealt with the risk of cross-contamination of food with gluten while self-managing CD with the GFD. The use of a separate dataset was because the social media activity that occurred around the Genius Foods and Gluten Free Cheerios incidents between 2014 and 2016, had a distinct set of co-occurring hashtags and keywords related to each incident. These were #cheerios AND #glutenfree or #geniusfoods, with keywords without hashtags, like “Cheerios” or “Genius Foods”. It was also found that within the time period of 2014 - 2016, that there were more tweets and Instagram posts about the two food recall incidents that mentioned the following co-occurring hashtags in conversation: #coeliac OR #celiac AND #risk; and #glutenfree AND # recall, in addition to #cheerios OR #geniusfoods. As discussed in the next section on APIs, because at the time of writing, the free and public Twitter API only output a small sample rate of all conversations on Twitter and Instagram (boyd & Crawford, 2011), it was important to specify the terms
that were more likely to co-occur in conversations connected with risk and coeliac self-care on the GFD, to ensure a more relevant and robust sample of data was retrieved. Thus, in the knowledge that the two incidents of gluten free recall and controversy were playing out, it was decided that a live and continuous period of data sampling from Twitter was the best course of action. As further discussed in this chapter, Facebook was also a source of data, mainly because Genius Foods and Cheerios pointed customers to their Facebook pages, which then made it possible for me to sample data from ongoing conversations on these pages, within the time period of the food recall.
Ta bl e 1. S o ci al me d ia d at a an d t o o ls u se d in t h es is
APIs
An API (Application Programming Interface) is a set of routines, protocols, and tools for building software applications. The API specifies how various software components should interact. For example, an API controls how a link to another website is shortened and shown on Twitter, so that it fits within the 140 character limit set. Access to APIs is the primary way for many researchers to retrieve user data from a wide variety of social media platforms.
Although APIs make it easier to collect data, for many researchers, the proprietary nature of APIs can mean it can be a struggle understand each API’s functionalities and constraints imposed by the terms of service related to data collection, storage and dissemination practices. While there are several proprietary tools that researchers can use to get data from social media networks (Gnip, 2016; Pulsar, 2016), a few free and low subscription platforms have emerged, which utilise the public APIs, and offer ways of downloading consecutive samples of Twitter, Instagram, Facebook and other social media data over periods of time.
The impact that different historical access rules for different social media platforms can have on data collection, is just one of a range of things that can affect how students and social researchers with little to know Big Data collection budgets can follow incidents that occur across different social media platforms over a prolonged period of time. These and other factors have an impact on the quality and type of data harvested from the Application Programme Interfaces (APIs, see below) of social media platforms, and how they are understood. This also impacts on how additional digital research toolkits are designed to fit within user communication practices in this area, as well as how the data output from such toolkits are analysed and understood for further user research. For the purposes of harvesting data about Coeliacs across social media, I chose to use four different pieces of software. This is because social media platforms like Twitter (2015a) and Instagram (2015) do not allow free historical searches of data via their APIs. In this case, I had to find a piece of software that would allow me to harvest data from these platforms for a
70
maximum of 31 days, and then allow me to download and store this data onto a secure server.For harvesting Twitter and Instagram data, I primarily used Netlytic (Gruzd, 2016), a free (tier-based) data mining and social network analysis platform for academic research. The pros of using this software, were that the data harvesting facilities for Netlytic worked across different social media networks, and were thus very useful for monitoring the same hashtag at the same time on both Twitter and Instagram (i.e. #NoCureNoChoice). The platform also provides comprehensive analytical software, that can exported and further analysed externally. However, the 31 day cut-off point for harvesting each data stream meant that the data feed had to be reset each month, so a continuous stream of data was not possible.
Ideally, it would have been good to use Netlytics to harvest data I needed from Facebook, and Tumblr. However, Netlytic’s Facebook capture feature does not currently allow historical timeline capture of data, but only live data from Facebook Pages, and did not harvest from the Tumblr API. Instead for Facebook data, I used Netvizz (Rieder, 2013). Netvizz is a tool that allows the user to add a historical date range, the name of a public user group or Facebook Page, and then harvest all comments/likes for that period. The pros about Netvizz are that one can download data from Facebook’s timeline retroactively makes it a powerful resource, without the need for constant monitoring. One of the main drawbacks is that each dataset has to be manually downloaded, and cannot be saved in a user account. This means that, although it allows the exporting of data into spreadsheet files for further analysis, the tool has to be used in isolation, and not (like Netlytic) as a platform that exports data into an online repository. For the small data samples of Tumblr, I used the Digital Method Initiative’s co- hashtag and post data tool (Borra, 2015). This tool was used mainly because it was the easiest tool to use for capturing small amounts of data from Tumblr’s API. Again, it is a tool that can only be used in isolation, and needs some knowledge about how to process .tab and .gdf output files. However, as a resource for grabbing a small amount of data, it worked quite efficiently.
For more in-depth graph analysis of Twitter, I also used NVivo’s NCapture tool for real-time harvesting and coding of Twitter hashtags (NVivo, 2014). I will go into more detail about how each of these were used within each chapter project, with a summary of any challenges encountered below. Because preliminary research at the start of my thesis found that Coeliacs discuss self-care across different social media networks (e.g. A tweet could have links out to an image and longer post on Instagram, or a blog post on Tumblr or a complaint in a Facebook post to a food manufacturer about cross-contamination), such different data outputs could confuse results. It was therefore important to take into account the effect that these different types of data formats from each of these platforms would have on each illustrative study.
User interface restrictions occur when some social media platforms like Twitter impose character restrictions or proprietary image linking restrictions, which act to skew data output. I found this out early on in my Twitter data collection (2013 -2014), where some tweets that had been tweeted via the Instagram platform, failed to show the attached image, and cut off part of the text posted with the image. In these cases, to avoid large chunks of null data, the link to the original post had to be followed, to fully understand the whole context of the post. However, I also noted that the 2015 introduction of a new sharing algorithm via third party tool called “If This Then That” (IFTTT.com, 2015), meant that users could automate the sharing of their image posts on Instagram to Twitter without losing the overall image. Where this tool was used, it seemed that fewer images were being lost. This only related to those users who were cognisant of this tool;