Problemen Mogelijke oorzaken Oplossingen
8. Technische gegevens
The qualitative interviews have been conducted with six experts, who are researching in the social and political sciences. They have been recorded and, afterwards, transcribed literally. We have conducted a qualitative content analysis of our interviews.
Analysis Categories Following the guidelines for qualitative content analysis, we have
defined four analysis categories. We have analysed the interviews considering these categories in the context of the integration and connection of research data and information.
1
http://no23.de/no23web/MP3_OGG_Aufnahme_Software.aspx?smi=1
3 Linked Data Consumption in the Social Sciences
All of the presented categories are required in order to investigate a beneficial use of Linked Data for social scientific research. The categories have been abstracted from our guideline and are presented in the following list:
• Working Tasks. Statements about concrete working tasks deriving from the information needs of the interviewed experts are considered in this category. This supports the understanding of the context, in which Linked Data standards and technologies can be applied beneficially.
• Problems and Challenges. In this category, statements regarding problems and challenges during the integration and connection of data and information are examined. An analysis of these statements aims to identify connection points for technical solutions and implementations of Linked Data standards and technologies. • Benefits of Linked Data3. Statements regarding the potential benefits of a linked
and connected representation and availability of research data and information are analysed in this category. This also includes requests of the participants regarding what would be helpful or supportive for their work. In this category, we aim to identify true benefits of Linked Data for social science research, which are generated by the use of Semantic Web standards and technologies.
A major result of the interviews has been that two experts have heard of Linked Data, but nobody has ever consumed Linked Data sets for research. Further results of the interviews are ordered among the three identified analysis categories.
Working Tasks
All six experts referred to various precise information needs that emerge directly from their research or service work. By summarizing these different research intentions to working tasks, four central tasks can be identified. The two most frequent tasks have been the design of an own data set comprising and integrating variables and data from distributed data sets, and the merging and accumulation of data according to a specific variable mostly over a range of time. Five of the six experts have mentioned that they typically integrate or merge data by enriching a first data set with additional data from a second data set over a specific key variable. This key variable usually contains entries of taxonomies or code lists like geographical codes, political parties, etc. Additionally, respectively one person mentioned the extension and documentation of data sets with context information like literature and the searching for data.
Problems and Challenges
Although the statements of the experts were similar in the first analysis category, their personal problems and challenges during these tasks expose a broad variety. This could
3
In order to avoid misunderstandings of the technical idea and concepts of LOD, we have used an abstracted imagination of interlinked and connected data during the interviews.
3.1 Qualitative Interviews
have been expected because of the variety of research questions and interests in social science research. Most of the problems can be grouped into four topics: data retrieval, data access, data documentation and modelling, and data matching and integration
• Data Retrieval. All six experts mentioned that gathering research-relevant data is one of the most time-consuming tasks during research. This problem occurs not only on the web, where it is not always known which data is where available and to what extent. Four of the experts addressed the problem to the non-digital world as well, where agencies and organizations have to be asked whether specific data is available. Complicating the problem is the fact that data is often published incomplete, e.g. some values have been proven as wrong, have been lost over the time or have never been collected for a specific country.
• Data Access. Four experts complained that relevant data is not always available on the specific required level of aggregation, i.e. data to a particular variable or in- dicator is not always available for, e.g. countries, districts, and cities simultaneously. Another problem (claimed by two experts) is that some information is only available following payment. This includes especially data mappings, whose creation has been expensive and extensive, e.g. geographical coordinates with specific geographical context information. Three experts mentioned that when working with data on the individual level, data privacy restrictions hinder researchers from accessing and, especially, reusing particular collected data. This is especially the case when aiming to connect sensitive data with further context information, which might allow for an identification of the individual persons. Five experts conclude that when specific data is unavailable or when information is missing, e.g. inside a time series, the researcher can either leave it out with respect to the original research intention or, alternatively, investigate for alternatives or try to reconstruct or calculate the specific missing data value or variable by himself or herself. Again, five experts have mentioned that such a reconstruction or calculation is common practice. • Data Documentation and Modelling. Additional problems concerning the
search and an intended use of data lie in their lack of documentation. Often, not all the specific attributes of data items are included in the documentation and are, therefore, unavailable to the user. But changes in variable definitions or questions, e.g. over time, are of high relevance for researchers. Three experts have complained of this problem. Also, differences in the definitions of variables, e.g. unemployment, between different data providers are referred to as relevant decision criteria by two participants of the interviews. It is not unusual that such specific information about data is not available in its documentation, because the data has still not been preprocessed for scientific use, i.e. necessary information about the process of data collection, how variables are constructed and defined or information on question filtering is missing (stated by one expert).
• Data Matching and Integration. Major challenges regarding the mapping of variables and, specifically, the mapping of entries of code lists for these variables can be identified. Since data is typically matched or enriched with context information
3 Linked Data Consumption in the Social Sciences
according to a key variable, e.g. countries or political parties, not only does the key variable itself have to be identified within the involved data sets, but also their possible entries have to be mapped to each other. Three experts have referred to this problem. Moreover, this challenge varies in complexity. In some cases, it can be carried out very easily and clearly, e.g. mapping of country names. But problems can occur during such mappings if there are ambiguous or incomplete mappings (claimed by two experts), which is, for instance, the case with administrative districts and electoral wards. In some cases, entries of such code lists describe different granularities or summarize attributes, which also increases the complexity. • Data Preprocessing and Data Reuse. In general, two further aspects have
been mentioned by the experts as being problematic and time-consuming. Before analysing data, major effort has to be put into the preprocessing of data (three experts), which typically means the conversion of data from formats like PDF or printed documents into formats required by statistical tools. However, all experts also emphasize that this effort need not be made in many cases because a lot of data is available in processable formats. The second aspect is that a lot of work especially regarding the mapping of variables or structural information has been carried out repeatedly by researchers, although one can be sure that specific work has already been done by others. However, this information is mostly not available as two experts have mentioned.
Benefits of Linked Data
In order to focus on the main ideas behind LOD and since none of the participants has ever used Linked Data sets, technical details have not been discussed with the participants. Moreover, an alternative scenario that accords with their research tasks has been presented and discussed. In these scenarios, specific data sets, variables or sources of context information are already connected and interlinked. Also, a detailed and fine-grained description of data and information is available.
Three major benefits have been identified by the experts.
1. Data documentation. First, as contribution is seen as a more detailed and fine-grained description of data with respect to the specific information necessary for scientific use. This has been stated by three experts.
2. Data linking. Three experts have claimed that the enrichment of data, e.g. statistics, with context information would be a value addition for researchers. But they also have doubts with respect to choosing context information to link to, which is relevant to a preferably large group of researchers. This once again pertains to the variety of research interests.
3. Data matching. The third expected benefit, which has also been named by three experts, is the creation and availability of mappings between entries of code lists, which is used when enriching or integrating data according to specific key variables.
3.2 Prototypical Semantic Data Library for the Social Sciences
This especially refers to the mappings of structural information like geographical regions, electoral wards, etc. that have to be done repeatedly and which could be omitted, if such mapping information were available for reuse.
An additional benefit, as identified by the participants, lies in the possibility of an easy search for available and relevant data based on detailed data documentation. This has been named by two experts. The following benefits have each been claimed by one expert. They could imagine a technical possibility for an easy connection of data, better cross-search over data sets for, e.g. similar variables, and accessing everything that they need together at one time, i.e. data that simultaneously offers precise and detailed documentation, relevant literature and context information.