RDF has been designed with the purpose to provide a general model for data representation. As a formalism, it does specify very little about the nature of the content to be represented, limiting itself to the structure for representing it.
Publishing data on the Semantic Web requires the use of shared models, or vocabularies, which allows us to define univocally the meaning of the information contained in the datasets. Vocabularies can be described using a set of primitives, declared in the RDF Schema (RDFS) specification [Guha and Brickley (2014)]. These primitives are:
– rdfs:Resource - the class of anything being referred by a IRI, a blank node, or being a data value.
– rdfs:Class - the class of all classes. For example foaf:Person is an rdfs:Class. – rdfs:Literal - the class of data values.
2.4. SEMANTIC WEB TECHNOLOGIES 37 – rdfs:Datatype - the class of literal datatypes. For example xsd:int is a rdfs:Datatype. – rdf:Property - the class of properties. For example foaf:name is a rdf:Property. – rdf:XMLLiteral - the class of literals of type xml. In other words, the datatype of XML
literals:
rdf:XMLLiteral
rdf:type rdfs:Datatype ;
rdfs:subClassOf rdfs:Literal .
– rdf:type - the property that assigns an rdfs:Class to a resource.
– rdfs:label - assigns a literal as human readable placeholder for the resource. – rdfs:comment - annotate the resource with a comment.
– rdfs:subClassOf - links a subclass to a superclass. For example foaf:Person rdfs:subClassOf foaf:Agent.
– rdfs:subPropertyOf - links a property to a super property. For example skos:prefLabel rdfs:subPropertyOf rdfs:label.
– rdfs:domain - specifies the type of the source of a property. For example:
dcat:downloadURL
rdf:type rdf:Property ; rdfs:range rdfs:Resource .
– rdfs:range - similarly, specifies the type of the target of a property. For example:
dcat:downloadURL
rdf:type rdf:Property ;
rdfs:domain dcat:Distribution .
– rdfs:seeAlso - annotates the resource linking it to another resource of interest.
– rdfs:isDefinedBy - connects a resource to the resource defining it. It can be used to link a class to the document where it is described.
(We leave out other primitive vocabulary elements, like containers, collections or reification, as we will not make use of them in this work.)
Vocabularies are descriptions of classes and properties to be used when developing RDF data on the Web. The value of these descriptions is in the fact that they are (a) shared conceptualisations that can be used by different systems independently and in communication, (b) openly available on the Web, therefore the meaning of each property and class is explicit, and can be acquired from the data by following its IRI, (c) usable in combination, so that data publishers can select and pick the terms from several vocabularies to define new data models to fit their needs. Vocabularies
are generally developed to fit specific use cases. The Friend of a Friend (foaf) vocabulary was developed to describe persons and their social networks on the Web, and the Simple Knowledge Object System (skos) to support the development of conceptual taxonomies like the ones used to organize media objects on content management systems or to publish scientific terminology in RDF (like thesauri in the library domain). Within the Dublic Core Metadata Initiative (DCMI) is maintained the definition of a set of terms widely used across many domains such as dc:creator or dc:publisher, just to mention two prominent examples. The Linked Data (LD) promise of developing an interlinked Web of Data to be query-able as a giant distributed database is based on shared vocabularies to represent any type of knowledge. Therefore vocabularies have different levels of sophistication. The Creative Commons Rights Expression language (cc) includes terms necessary to link resources to Creative Commons licenses on the Web. With the Time Ontology (time) [Cox and Little (2016)] it is possible to express temporal entities like dates, time points and time sequences and intervals. The RDF Data Cube Vocabulary [Reynolds and Cyganiak (2014a)] has been developed to support the publishing of multidimensional data, a common information metamodel particularly in the domain of statistic.
However, shared vocabularies are only a part of data understanding and reuse. An equally important one is the capability of linking resources between datasets. In the Linked Data, resources are linked from a dataset to another one by the means of the owl:sameAs property. This technique is at the core of the Linked Data infrastructure. As a result, an important role is taken by some public databases as naming entity systems. It is the case of DBPedia, a large database generated from Wikipedia that is one of the hubs of Linked Data. Geonames, similarly, publishes a list of geographical toponyms that can be directly used or linked. Shared identifiers are crucial as much as shared schemas for content reuse. For example, Named Entity Recognition systems can be deployed to link textual content (like web pages) and data.
The W3C developed a set of guidelines for Linked Data publishers [Atemezing et al. (2014)]: STEP #1 PREPARE STAKEHOLDERS: Prepare stakeholders by explaining the process of creating and maintaining Linked Open Data.
STEP #2 SELECT A DATASET: Select a dataset that provides benefit to others for reuse.
STEP #3 MODEL THE DATA: modelling Linked Data involves representing data objects and how they are related in an application-independent way.
STEP #4 SPECIFY AN APPROPRIATE LICENSE: Specify an appropriate open data license. Data reuse is more likely to occur when there is a clear statement about the origin, ownership, and terms related to the use of the published data.
2.4. SEMANTIC WEB TECHNOLOGIES 39 STEP #5 GOOD URIs FOR LINKED DATA: The core of Linked Data is a well- considered URI naming strategy and implementation plan, based on HTTP URIs. Consideration for naming objects, multilingual support, data change over time and persistence strategy are the building blocks for useful Linked Data.
STEP #6 USE STANDARD VOCABULARIES: Describe objects with previously defined vocabularies whenever possible. Extend standard vocabularies where necessary, and create vocabularies (only when required) that follow best practices whenever possible.
STEP #7 CONVERT DATA: Convert data to a Linked Data representation. This is typically done by a script or other automated processes.
STEP #8 PROVIDE MACHINE ACCESS TO DATA: Provide various ways for search engines and other automated processes to access data using standard Web mechanisms.
STEP #9 ANNOUNCE NEW DATA SETS: Remember to announce new data sets on an authoritative domain. Importantly, remember that as a Linked Open Data publisher, an implicit social contract is in effect.
STEP #10 RECOGNIZE THE SOCIAL CONTRACT: Recognize your responsibil- ity in maintaining data once it is published. Ensure that the dataset(s) remain available where your organization says it will be and is maintained over time. [Atemezing et al. (2014)]
The Linked Open Data (LOD) is not only made of data and machines exchanging them but also on the practices and developers involved in maintaining it. These include services in support of whoever wants to interact with Linked Data as publisher or consumer. Some of them are:
1. http://datahub.io - the Open Knowledge Foundation catalogue of open datasets
2. http://lov.okfn.org/ - the Linked Open Vocabularies project, to find and choose vocabulary terms
3. http://yasgui.org/ - A user interface to query SPARQL endpoints 4. http://sameas.org/ - To find co-references between different datasets 5. http://prefix.cc - A service to obtain popular namespaces and their prefixes
6. http://lodlaundromat.org/ - A database made of the crawling and cleaning of the LOD Examples of Linked Open Datasets that have a central position in the LOD cloud because of their role of named entity systems are:
2. http://www.geonames.org/ - a geographical database with millions of place names 3. http://sparql.europeana.eu/ - European Cultural Heritage data as LD
Semantic Web Data can be published in various ways, from downloadable files to embed- ded annotations in HTML (using techniques such as RDFa7 and vocabularies such as http: //schema.org/). The follow-your-nose approach is based on dereferencing IRIs mentioned in the RDF datasets in order to find related information. This method, also called graph traversal, enables the discovering of new data in a similar way a user navigates the web following links in HTML pages. HTTP content negotiation allows programs to request the data in a specific seralisa- tion format, choosing from the many available for RDF: RDF/XML, Turtle, JSON-LD, N-Triples, Trix, and so on8. Probably the primary method to consume RDF datasets is the SPARQL Protocol and Query language.