• No se han encontrado resultados

4. CAPITULO IV ANALISIS SITUACIONAL

4.5. MEDICION DE INDICADORES

4.3.1.1 State 1. Envisioned

The idea that data must first be envisioned is inspired by De Vaus’s (2001) discussion of research design. He argues that the issue of which subjects and which of their features data must represent precedes that of particular data acquisition practices and data analysis methods. From this perspective, research design is a combination of the broad research questions (given as a prerequisite in Table 4.1) and the initial vision of data that would support answering those questions conclusively. Following De Vaus’s logic, the initial vision of data is vital regardless of whether data are generated in a controlled environment or are naturally occurring, whether they are qualitative or quantitative, big or small. Therefore, envisioning data is as vital for social data

science projects as to more traditional forms of research.

The conditions of this state are derived from a combination of the fieldwork evidence and De Vaus’ (ibid.) thinking on research design. Under such views, social research is often interested in studying particular aspects (primary attributes) of a particular real-world entities (primary research subjects). For example, in the InfoMigrants and ShakespeareLives evaluation case studies, among other research tasks, studied the strength of online audience engagement (the attribute) shown by the InfoMigrants and ShakespeareLives audiences (the subjects). In line with De Vaus’ logic, the strength of the conclusions that we could make about their levels of engagement (whether they were low, high or normal) depended on our ability to identify other comparable programmes and initiatives (secondary research subjects), and to make comparisons with the levels audience engagement those managed to provoke.

In the Shakespeare Lives evaluation, we faced a principle inability to do so, as the that cultural programme was unique in many regards. This caused significant struggles for interpretation of findings, as discussed in Section 3.2.5.3. In the subsequent InfoMigrants evaluation, we did identify comparable initiatives. However, since no one of the compared initiatives performed exactly the same activities as InfoMigrants, we had to elicit theirsecondary attributes(qualitative differences such as focus on specific group of immigrants, provision of specific types of services, etc.) that could interfere with the primary attribute of our interest (see Section 3.4.1.2).

DATA: ENVISIONED There is a tentative idea of what kind of data are required

for the project needs.

Primary subjects

of research interest

determined

The real-world entities that are the subject of the research questions and that must be represented in data are deter- mined.

Secondary subjects

of research interest

determined

If relevant, secondary groups of real-world entities that should be covered by data for the needs of baseline / benchmark comparison are determined.

Primary attributes

of research interest

determined

The characteristics of the objects of interest, of their relation- ships and of their interactions that are of direct concern for the research questions and that must be represented in data are determined.

Secondary attributes of research interest deter- mined

If relevant, secondary characteristics of real-world entities that should be studied in conjunction with the primary ones are determined.

Research Questions:

Outlined The research questions are formulated as a broad frameworkthat sets the direction for the project.

4.3. CONTENT OF THE SCORECARDS 133

4.3.1.2 State 2 Operationalised

The fieldwork shows how closeexamination of data sourcesand a correspondingrefinement of research questions(reflected in both prerequisites and conditions in Table 4.2) help to progress from a vision of data to an operational understanding of the required data. In the Hit List production example, the particularities of each online platform and of their data acquisition modes helped to establish the features that must be represented in the data and the level of granularity of the acquired data records (i.e. data demands), as well as the precise criteria of data acquisition formulated as queries to a platform’s API or feed (i.e. data boundaries) (see Section 3.3.3.1).

Table 4.2 also highlights the role of the compliance strategy in operationalisation of data demands. It is worth noting that in the studied projects compliance never actually limited how a team had to operationalise data. However, amatch between data demands and compliancehad to be ensured when studying the Dark Net. It resulted in an elaborate self-evaluation of ethical and risk management compliance of the team’s data collection strategy with a complex risk-assessment framework suggested by Martin and Christin (2016).

DATA: OPERA-

TIONALISED Precise specification of which data should be acquired andused in the project is in place.

Practicalities of data

sources considered The evaluation of potential data sources informs the detailsof what the practically available data may consist of.

Research questions con-

sidered Any further specifications to the research questions – espe-cially in regards to the relevant features that the data must contain – are considered.

Data demands outlined Decisions are made on how much data are required, of what types and for which forms of processing.

Boundaries established The time-, spatial and other boundaries on the data to be acquired are precisely formulated.

Compliance strategy fac-

tored in The compatibility of using the outlined data with the com-pliance strategy is confirmed.

Data Sources: Evaluated The merits and limitations of each source and of their combination are assessed to formulate a firm selection of data sources.

Compliance: Strategised The team members share a clear understanding of how they can achieve ethical and legal compliance.

Research Questions: Re-

fined The questions are redefined and specified to an operationallevel.

4.3.1.3 State 3. Acquired

After the data demands and boundaries are defined to an operational level, data can get acquired fromactive data sourcesintoa working project infrastructure(hence the prerequisites in Table 4.3). The specific conditions suggested for this alpha follow the discussion in our chapter on data curation (Voss et al., 2016). The fieldwork positively confirms the importance of these conditions. For example, new restrictions to the Facebook Graph API that were faced in the second year of the InfoMigrants evaluation project show that data cannot always be easily reacquired as it may seem (see Section 3.4.1.1). The Shakespeare Lives evaluation project shows that keeping an extensive “paper trail” of data characteristics may be the best way to track provenance of the research decisions and thus ultimately have confidence in the findings (see Sections 3.2.6.2 and 3.2.5.3). An appropriate data storage structure was especially important for the Hit List production as some of the data pre-processing had to be applied automatically to newly arriving data for the production cycle to finish on time (see Section 3.3.6.2).

The importance of storing theoriginaldata rather than only a cleaned and processed version was reaffirmed in one of the subsequent studies of cryptomarkets. As data were acquired through scraping HTML pages to JSON files, the acquisition outcomes relied on our knowledge of the pages’ DOM trees and on the consistency of those DOM trees over time. Sometimes we would randomly encounter special cases when the DOM structure was different – e.g. banned cryptomarket traders had an additional HTML element on their profile page, which we did not know about for the first several months of scraping and thus had not factored it in the scraping procedure. The only way for us to repair this omission in the earlier scraping iterations was to apply the reviewed scraping approach to the original raw HTML source files which we luckily stored alongside the processed JSON.

4.3.1.4 State 4. Quality Assured

The name of this state could be interchanged with such synonyms as “Cleaned” and “Pre- processed”, however “Quality Assured” sound more inclusive than the former and shows a clearer intention than the latter (at the end of the day, the line between “pre-processing” and “processing” is quite thin). The contents of the state’s scorecard (see Table 4.4) mostly follow the experience of Hit List collating. Before the analyst could perform the core data analysis, selectionhad to be applied to Twitter data andaggregationhad to be applied to data from all studied platforms bar Facebook (see Section 3.3.3.1). As these quality assurance activities were done to facilitate the core analysis task, they were designed with this task in mind (hence the “Analysis Methods: Selected”prerequisite). On the other hand, the lack of bringing the Twitter data to aunified schemawith data from other platforms was what caused low visibility of Twitter

4.3. CONTENT OF THE SCORECARDS 135

DATA: ACQUIRED The data are acquired, fully documented and appropriately

stored.

Original data kept safely If possible, the originally acquired data are fully preserved.

Data storage mode cho-

sen Data storage mode (flat files, database, etc.) is chosen. Data structure in place Appropriate data structure, convenient for the team, is

developed (e.g. folder structure for flat files, schema for relational database).

Security concerns

addressed The sensitive data are sufficiently protected to the levels thatsatisfy the compliance requirements.

Data fully documented The data are supported with human-readable documentation and machine-processable metadata that describe the data and track their provenance.

Back-ups available The data are stored redundantly with an agreed redundancy

factor.

Compliance: Secured The compliance resources required for the project are se-

cured to the degree that the team can fully engage in the main body of the research work.

Data Sources: Active A data acquisition procedure is in place and successfully acquires the required data.

Infrastructure: Opera-

tional The infrastructure is in use in an operational environment.

Table 4.3: Data: Acquired. Conditions and prerequisites.

data for the production team compared to the other studied platforms (see Section 3.3.5.2). The importance ofdealing with data veracity issuesseems self-evident. The Shakespeare Lives project showed that, interestingly, even interim project data created by the research team poses veracity issues. I encountered those when visualising human data coding (see Section 3.2.6.4). Even though the Excel spreadsheets that the researchers had used to annotate the acquired social media postings had data validation rules applied, somehow several researchers still managed to make spelling mistakes in the names of the variable levels. Furthermore, it was quite common among the researchers to accidentally skip a question and thus create a missing value. Finally, one researcher once changed the ordering of the columns in her spreadsheet which caused my code, hardwired to a precise data schema, to “swap” values of two variables in the resulting visualisations. This bug was easy to repair, but hard tonotice. Luckily, I corrected the mistake before the researcher had time to use wrong plots for her analysis.

The need totake compliance measureswhen performing quality assurance of data was never high in the studied projects. I decided to include the respective condition as a logical next step from

the compliance-related conditions in the previous states and as an extra precautionary measure.

DATA: QUALITY AS-

SURED The data are brought to the required level of veracity,completeness and usability.

Data veracity assured Missing values, unreliable data entries and data glitches resulted from the imperfections of the data acquisition procedure are dealt with.

Selection criteria applied Criteria for what data to consider for further analysis are in place and applied to the data.

Data aggregation

performed If relevant, individual data records are aggregated.

Data record structuring

performed If relevant, unstructured data records and brought to a definedschema.

Compliance measures

taken The data records and elements that violate the compliancerequirements are dealt with (e.g. removed or appropriately anonymised).

Analysis Methods: Se-

lected The final list of research methods is compiled.

Table 4.4: Data: Quality Assured. Conditions and prerequisites.

4.3.1.5 State 5. Utilised

In its final state, data are utilised for the project needs. The conditions suggested for this state (see Table 4.5) are trivial. They add an extra layer of confidence in that the measures that should have been already taken to bring the “Data” alpha through some of its earlier states (such as data documentation and preservation) are in place. As such, they are informed by the same episodes in the fieldwork. They also suggestarchiving the data where possible to strengthen research reproducibility and to support future studies (Potthast et al., 2016).

DATA: UTILISED Data are put to use for the goals of the project.

Full value extracted The potential of the data to inform the research questions is reached.

Interpretability ensured Descriptions of the data and of their provenance are used to interpret the findings.

Data archived If possible, the data are archived for further utilisation.

Analysis Methods: Exe-

cuted The methods are utilised for the needs of the project.

4.3. CONTENT OF THE SCORECARDS 137

4.3.2 Data Sources

4.3.2.1 State 1. Identified

As the Essence of Social Data Science (see Section 4.1) treats data sources as a resource for research, the Social Science Scorecard Deck suggests that envisioning data is a prerequisite to the “Identified” state of the “Data Sources” alpha (see Table 4.6). Putting “Data: Envisioned” as a prerequisite does not lead to assuming a particular linear process of doing social data science where the search for appropriate data sources strictly follows designing the research. In fact, the example of the Hit List production suggests that social data science projects can be first motivated by an opportunity to harness data from some source rather than by any preconceived research question. Arguably, the same may happen in academic work as well.

Rather, this prerequisite tries to convey the idea that it only makes sense to talk about the appropriateness and completeness of the identified list of candidate sources when there is a defined purpose for them. This corresponds well with the aforementioned idea of Gitelman (2013) that data are not data unless they are imagined as such. To follow on the Hit List production example, until the initial idea of using online platforms to learn “what’s buzzing” turned into a vision for data onpopularityof stories among theUKpubliconline(see Section 3.3.1), it was not possible to claim with confidence that, say, Google Trends was a relevant data source while VK.com was not. By contrast, in the Shakespeare Lives evaluation project, VK.com was very relevant fordemographicreasons (see Section 3.2.4.1), while Google Trends was not since the analysis focused on audience engagement on social media and therefore search counts were not arelevant data type(see Section 3.2.1.1).

Be it to cross-validate findings across audiences in different countries (Shakespeare Lives) or to cover a broad spectrum of popular topics from hard news to viral social media content (Hit List), both these projects had to identify and employ a multitude of sources to allow for triangulation and complementation. Data sources withprivileged accessplayed a key role in the two studied evaluation projects. For example, when evaluating InfoMigrants, some members of our team were given editor rights to their public Facebook pages so that we could access Facebook Analytics.

The data sources and the resulting data, according to the definition of the alpha, are primarily (born) digital but not necessarily so. As the fieldwork shows, social data science projects sometimes usetraditional social science data sources: interviews, surveys, focus groups, etc. For example, these were heavily incorporated into the work on the “Cultural Value” Strand of the Shakespeare Lives study (see Section 3.2.1.2). Using traditional social science data sources may involve trade-offs in effort and expenses with the new data sources. Also, thinking

about traditional and data-driven components of the research simultaneously (be it data sources, analytic methods, analysis infrastructure, etc.) can lead to a greater synergy between the two. For example, the traditional methods may prove to be especially valuable for finding explanations of the social phenomena that can be observed using the naturally occurring digital data.

DATA SOURCES:

IDENTIFIED The possible relevant data sources are identified and apreliminary list of appropriate data sources is compiled.

Relevant types of data

considered The relevant types of data are imagined and the sources forthem are identified. E.g., for user-generated online data, data types could be short posts, long posts, threaded discussions, pictures, videos, etc.

Relevant demographics

considered The sources that are populated with data by / on relevantgeographical-, age- or other demographic groups are consid- ered.

Available data sources with privileged control / access considered

Opportunities for privileged access to relevant data that are not in the public domain are considered. Examples may include website visit statistics, customer relationship data, etc.

Traditional sources con-

sidered Traditional social science data acquisition sources such asinterviews and focus groups are considered as a data source.

Complementation and

triangulation assured The identified data sources allows acquiring data that coverdifferent aspects of the studied problem and cross-validate each other.

Data: Envisioned There is a tentative idea of what kind of data are required

for the project needs.

Table 4.6: Data Sources: Identified. Conditions and prerequisites.

4.3.2.2 State 2. Evaluated

The Social Data Science Scorecard Deck puts a huge emphasis on evaluation of potential data sources as their diversity means that they have varied limitations and particularities. As suggested by the fieldwork, these limitations and particularities can, in turn, have profound consequences for methodology and for the employed infrastructure. This leads to quite an elaborated system of conditions in the respective scorecard (see Table 4.7).

The issues of degree of accessand level of controlare related to each other. It is tempting to think about them in terms of a dichotomy between the internal (and thus fully-accessible and well-controlled) versus external, problematic data sources. The fieldwork, however, suggests that this is not the case and, in fact, different modes of acquisition allow for a trade-off between the two. While the commercial tools employed in the Shakespeare Lives evaluation study to acquire data from Twitter and Instagram provided privileged access to the data, the control over

4.3. CONTENT OF THE SCORECARDS 139

the export functions of those systems and, therefore, over the ability to subject the accessed data to our team’s own analysis methods, was quite limited (see Section 3.2.6.1).

Different data sources may lead to different levels ofdata veracitythat have to be considered. As already mentioned, two of the studied projects (evaluations of Shakespeare Lives and InfoMigrants) used both traditional sources of data alongside digital platforms. The veracity issues obviously differ between these two groups. However, even within digital platforms, data veracity strongly depended on the available modes of acquisition. For example, as there were restrictions to accessing the Facebook Graph API in the second year of the InfoMigrants evaluation (see Section 3.4.1.4), some limited scope scraping had to be performed. Since Facebook’s website is highly dynamic, the scraping outcomes were not always as expected and had to be partially manually repaired, which would have been problematic had the scope of

Documento similar