The GEO label research comprising this thesis commenced with an initial investigation into how users and producers of geospatial data evaluate quality and trustworthiness of datasets. Using a series of face-to-face and telephone interviews the intention was not to elicit specific requirements for the GEO label, but rather to uncover initial information about dataset selection, use and production within representative application areas in order to inform
further research into the design and development of the GEO label. The interviews were relatively informal, and discussion was guided by the following set of high-level questions or prompts:
1. Please describe a current area of your work in which you use external data sources. 2. What data do you use in your work, and where does it come from?
3. How do you choose which datasets to use in your work? What are the reasons for your decisions?
4. Are you aware of any data certificates or seals in selecting your data? Do you look for specific certificates or meta-information in a data set you use? How do you know whether to trust the data?
5. Does the data you use come with sufficient supporting information to allow you to make an informed judgement about which one(s) to choose? How much information do you need?
Based on interviewee responses to the above prompts/questions, follow-up and clarification discussion ensued as appropriate to generate a rich set of qualitative information regarding dataset production, selection and use. Where interviewees used community- or domain- specific jargon, they were asked for further explanation to eliminate any misunderstandings. The interviews were directed to capture sufficient contextually rich, qualitative information to allow distillation of specific information regarding the informational aspects that are perceived as significant in terms of determining dataset trustworthiness and dataset quality assessment for fitness for purpose-based dataset selection.
Representative interviewees – including geospatial data users, researchers, data archivists, academics and data producers – were identified and contacted to participate in the initial interviews. The diversity of interviewees supported the elicitation of a broad and inclusive picture of user needs as they relate to quality assessment of geospatial datasets. A total of 6 participants were recruited for telephone or face-to-face interviews with the author; each interview took between 30-60 minutes.
Table 4.1: Profiles of initial investigation interviewees.
Interviewee 1 is a data archivist who works as part of a science network of people,
organisations, and, most importantly, observation platforms, that performs Long-Term Ecological Research (LTER).
Interviewee 2 is a researcher data user who is a part of a group that is working on projects to
monitor forests and the tropics or, specifically, changes in the forest cover or tree cover in the tropics.
Interviewee 3 is a land use researcher who works for a government department that covers
areas such as: the natural environment; biodiversity; plants and animals; sustainable development and the green economy; food, farming and fisheries; animal health and welfare; environmental protection and pollution control; and rural communities and issues.
Interviewee 4 is a climate forecaster who works on climate forecasting for protected areas using
climate data which he pre-processes himself to get more descriptive data for his own needs. Interviewee 4 does not use a lot of external data.
Interviewee 5 is primarily a data provider who typically takes low level data (typically
oceanography-related) and works it up into higher levels to arrive at some physical product.
Interviewee 6 is an academic researcher in earth and environmental sciences who uses
external data sources “across the board”.
The six interviews were audio-recorded to allow for detailed and accurate data capture as well as in-depth post-interview analysis; the interviewer also took written notes during the course of each interview. Data saturation refers to the point at which consulting additional participants would not have provided new information or identified new themes in the data (Guest et al., 2006; Francis et al., 2010); data saturation points are specific to each study but 73% of thematic discovery can occur from as few as six interviews (Guest et al., 2006). In this study, despite the diversity of interviewees, data saturation occurred after completing the six interviews. The occurrence of data saturation was evident from the repetition of the themes as the interviews progressed and from the fact that the sixth interview did not reveal any new themes. Additionally, data from a further 12 interviews conducted by other GeoViQua partners was used to validate the information collected first hand. As such, after the data analysis was completed, the interview notes from other partners (no full transcripts were available) were carefully reviewed to identify any additional themes. The notes did not reveal new themes or requirements, which further confirmed the validity of the interview results and the occurrence of data saturation in this study.
Verbatim transcripts of the six interview recordings (see Appendix A.2) were generated to support detailed data analysis. A first pass of analysis was conducted in order to derive user stories – that is, very high-level informal statements of requirements that capture what users want to achieve. User stories typically follow the template: "As a <role>, I want <goal/desire> so that <benefit>". These user stories (see Appendix A.3) helped identify high level requirements for a GEO label. The transcripts were further analysed to identify, in greater
detail, the informational facets of importance to users when assessing dataset fitness-for- purpose and to derive detailed user requirements that relate specifically to quality and trustworthiness assessment of datasets for the purpose of making dataset selection decisions (see Appendix A.4). The remainder of this chapter discusses the identified informational facets as they relate to the design of the GEO label.