COMPARACIÓN CON LOS ESTANDARES REQUERIDOS POR EL SIGUIENTE ESLABÓN
3.16 ANALISIS DE SANIDAD, INOCUIDAD Y CERTIFICACIONES
The lack of ground truth in developing countries to help establish the quality and credibility of VGI has motivated the investigation of data provenance as an indicator of VGI quality. Data provenance can provide a valuable dimension when multiple records of the same entity are aggregated to define a final label and improve semantic accuracy. Metadata created in open labelling systems for collaborative projects like VGI can be examined with semantic accuracy measures to collect folksonomies. While a flexible collaborative approach of VGI allows for the rich description of geospatial objects to capture local meanings, it also creates semantic heterogeneities: there may be diverse and conflicting attributes used to describe the same object contributed by many users. Unlike thematic accuracy measure which concentrates on examining the extent to which individuals correctly identify and classify objects in the VGI application, semantic accuracy determination is concerned with the aggregation of multiple records of the same entity from different contributors to define its final label and improve semantic rigour.
Semantic heterogeneity was addressed here by Human Computation (HC) methods (Ballatore et al., 2013; Celino, 2013; Ronzhin, 2015), a technique whereby some computational processes are ‘outsourced’ to humans. In HC, a computer asks a person or group of people to solve a problem, then collects, interprets and integrates their solutions: in VGI this can consolidate contributed datasets from a variety of sources (Law and von Ahn, 2011; Celino, 2013), and addresses the shortcomings of heterogeneous information collection and semantic accuracy challenges common in VGI. Here, HC is a technique that consolidates land parcel labels with similar lexical vocabulary contributed by different volunteers into a single label. A tag with the highest number of aggregated values will then be assigned as the final land parcel label for that entity. Therefore, HC uses the many eyes principle to suggest that an entity labelled similarly by many contributors can be regarded as its correct classification.
The structure of HC is made up of three steps (Figure 5-8): 1) Task definition, where contribution tasks and requirements are clarified to participants, 2) Task execution, where multiple participants are given similar tasks to contribute information, and 3) Task solution, where individual contributions are consolidated and harmonized into a central solution (Celino, 2013). HC addresses the semantic heterogeneity of VGI by consolidating similar contributions into single labels, thus improving quality. In this study, HC was implemented in three stages as shown in Fig 5-8: a) brief demonstrations of tasks and requirements were conducted by the researcher to participants such that they understand what was required of them when interacting
123
with the VGI application, b) to obtain multiple records, participants were then given similar tasks of identifying, classifying and digitizing land parcels of different lands uses in the study area, and c) one of the tasks included classification of occupancies of pre-defined land parcels where similar contributions were later aggregated into a single label.
Figure 5-8. The Human Computation workflow for VGI collection and consolidation (adapted from, Celino (2013)).
Contributed datasets from volunteers were converted into a Resource Description Framework (RDF) – a World Wide Web Consortium (W3C) specification since 2004, such that it can be manipulated and integrated with other external data over the Web. RDF enables source data (Shapefiles) to be converted into a set of triples (subject, predicate and object) in the task solution stage for semantic accuracy determination. Here, the subject is the unique identifier (parcel number) of the contributed entity, the predicate is the attribute of the entity (e.g. land use), and the object represents its attribute name (e.g. commercial).
Semantic accuracy was computed using Datalift, an open platform for publishing and interlinking datasets on the Web. A semantic query language for databases, SPARQL, was used in Datalift to query, retrieve and manipulate data contributed by volunteers. SPARQL, a recursive acronym for SPARQL Protocol and Resource Description Framework Query Language, is a W3C specification (since 2008) used here to aggregate and consolidate VGI based on the tags contributors provide for the same land parcel. The merging process is facilitated by a Natural Language Processing (NLP) technique that uses a text mining algorithm known as Named Entity Disambiguation (NED) to automatically combine similar text snippets from multiple sources to form a summary (Manning and Schutze, 1999; Barzilay, 2003).
SPARQL uses an aggregation algorithm to consolidate VGI tags based on a simple agreement mechanism. Its functionality is such that, as soon as two contributions with similar text content
Task Definition Task Execution Task Solution Clarification of requirements Contributor 1 Contributor 2 Contributor 3 Contributor 4 VGI collector VGI aggregation and consolidation Consolidated VGI Visualization in web map H
124
from two different volunteers are recorded, the algorithm is triggered and the contributions consolidated into a single occupancy label. Figure 5-9 shows a sample SPARQL query on the provenance data which conducts a count and concatenates all land parcel occupancy
classifications (‘Occupant_N’) with similar lexical terms for a single output value.
Figure 5-9. Sample SPARQL query on the provenance data.
Every time a new contribution is made, the algorithm compares it with previously stored labels to determine if consolidation must occur or not. The aggregated results are then displayed as the final label (in the case of a land parcel, this could describe ‘occupancy’, or ‘land use’) as HTML. The HC approach shows how VGI provenance can be leveraged in a data aggregation and consolidation activity to improve VGI quality based on similar words that volunteers use to describe land parcels in the study area. It then creates a hierarchy on the remaining land parcel labels.
Implementation of Human Computation in the study area
For the semantic accuracy measure, multiple labels from the tagging process were used as input data. This information was then converted into RDF and analysed using SPARQL. For example, 15 participants were requested to label the occupancy of a brick moulding plant in Pilane. The actual name of the plant is Pilane Brick Moulding. However, the norm in the area is that some community members, particularly older people. prefer to describe land parcel occupancies using the entity’s owner name or describing the land use of the place as its actual name. Despite the different labels received, some participants labelled it correctly. Therefore, various tags were received of the same entity which include: a) Pilane Brick Moulding (8), b) Ga Thabo (Thabo’s place) (3), b) Ko Diteneng (the place where bricks are made) (4).
125
The count and concatenation of all land parcel occupations with similar lexical terms for a single output value were facilitated by the SPARQL aggregation algorithm. The consolidated outputs can later be incorporated into the VGI application as the final occupancy labels of the land parcels. This increases the semantic accuracy of the land parcel’s classification, as the occupancy label with the highest number of similar contributions is consolidated as its final classification label.