Below I provide a short review of literature on secondary data in research, to explore its affordances and the debates regarding ownership, extent of coverage and convergence between original and new purposes (Creswell and Tashakkori 2007, Boslaugh 2007, Driscoll et.al. 2007).
Flaxman (2014:12) defines secondary data analysis as “any analysis where the data were collected by someone other than the analyst” while Cheng and Phillips (2014) highlight important differences about secondary analysis of data as they explore the benefits and challenges of working with such data. Cheng and Philips distinguish between the research-driven approach whereby the researcher has a hypothesis or research question in mind and looks for datasets to address the question; and the data-driven approach to analysing existing data in which the researcher teases out a research question by looking at the available data. Cheng and Phillips note that “the two approaches are often used jointly and iteratively. Researchers typically
90 start with a general idea about the question or hypothesis and then look for available datasets which contain the variables needed to address the research questions of interest.” (Cheng and Phillips, 2014:373). Similarly, in my research I set off with a set of research questions seeking to explore the factors that foster drop-in among at-risk secondary school students in Eritrea.
There were compelling reasons for deciding to use secondary data given my insider/outsider position. The first reason was practical, to use the data that were readily available to me by virtue of my association with the Eritrea education sector as a UNICEF worker. This was consistent with the thrust of the Ed. D programme of UCL/IoE. Ordinarily, I should have collected raw data and processed it using routine procedures of data cleaning, analysis, interpretation and reporting. However, being a foreigner I had to abide by host Government guidelines on internal travel by foreigners, which made it difficult and unpredictable to access rural schools to collect my own data. Therefore, an existing database was the most pragmatic alternative in the circumstances. But aside from the practical considerations, I also sought to tap the benefits of economy and low cost as suggested by Cheng and Philips (2014). Devine (2003) also commends secondary data for the breadth and depth offered by large databases. Moreover, Boslaugh (2007) and Cheng and Philips (2014) argue that secondary data are usually cleaned by highly qualified professionals, and that the data collection processes for large data sets benefit from expertise and professionalism that individuals or smaller teams may not easily command.
That notwithstanding, scholars have identified some disadvantages of analysing already existing data (Cheng & Philips, 2014; Boslaugh, 2007). The main one is the fact that the data were collected for a different purpose
91 and may not suit the new research purpose and question. Secondly, additional variables may not have been captured and the data may not cover all subgroups or geographical areas. Furthermore, since identifier variables are usually deleted to protect the identities of respondents (apart from the generic identifiers like sex and age-rages), it may be difficult to pick out the nuancing identifiers which would be of interest to other researchers (Cheng & Philips, 2014). For instance, a decision to maintain political correctness, or to mask inequities, may result in the data entrenching disadvantage by not highlighting the sources of vulnerabilities. As I note in section 4.3, these might have been the considerations that underpinned the anomalies detected with the EMIS data for Eritrea. The fourth major disadvantage is that since the secondary analyst was not part of the original team they may be unaware or unintentionally insensitive to the study- specific glitches in the data collection process and may not be in position to remedy evident faults. Boslaugh (2007) therefore suggests that once the researcher has located a secondary dataset for analysis, they should address the following questions:
1. What was the original purpose for which the data were collected? 2. What kind of data are they, and when and how were the data collected?
3. What cleaning and/or recoding procedures have been applied to the data?
My preliminary responses to the foregoing questions were as follows: 1. The various Eritrean MOE reports (2003-2015) indicated that the original purpose of collecting data was to provide useful, relevant, reliable and up-
92 to-date information on education for various stakeholders within and outside the education sector.
2. The data collected annually were on Early Childhood Education (ECE) and General Education (Elementary, Middle and Secondary level).
3. Scholars including Broeck, Argeseanu, Cunningham, Eeckels & Herbst, (2005); Hellerstein (2008) and Winkler (2003) note that even with the most carefully designed studies, errors are bound to occur. Broeck et.al., (2005) argue that all experimental and observational researches must deal with errors from various sources and their effects on study results. As part of the process of quality assurance, researchers therefore undertake the process of data cleaning to minimise the impact of such errors on the results of the study. Data cleaning is the process of detecting or screening, diagnosing and editing faulty data. However, as Broeck et.al. (2005) note, data cleaning is not without controversy. They argue, “Data cleaning is emblematic of the historical lower status of data quality issues and has long been viewed as a suspect activity, bordering on data manipulation” (2005: 966). That caveat notwithstanding, data cleaning is supposed to deal with data problems that have already occurred and it can be conducted at any point during the process since error detection can occur at any point. Broeck et.al. (2005) advise that once a suspicion about data is aroused, it is more efficient to search for the errors systematically.
A systematic screening activity is when the researcher tries to distinguish four basic types of oddities in the data. These are lack of data or excess data; outliers, including inconsistencies; strange patterns in distributions; and unexpected analysis results and other types of inferences and abstractions. In the diagnostic stage of data cleaning the researcher seeks “to clarify the true nature of the worrisome data points, patterns, and
93 statistics” (Broeck et.al. 2005: 968) with the diagnostic possibilities for each data point being erroneous, true extreme, true normal (i.e. the prior expectation was incorrect), or idiopathic (i.e. no explanation found, but still suspect). They note that some data points are clearly logically or biologically impossible. Thus, Broeck et.al. (2005) recommend some diagnostic procedures such as going to the previous stages of the data flow to see whether a value is consistently the same; looking for information that could confirm the true extreme status of an outlying data point; and collecting additional data e.g. by questioning the data collector about what may have happened and where possible repeating the measurement. However, they also note that these procedures can only help if the data cleaning starts soon after data collection.