ACTIVIDADES PRESIDIDAS POR UN MANUAL
DEBILIDADES Y AMENAZAS (DA)
2.7. CONTROL INTERNO 1 Concepto
2.8.6. FASES DE LA AUDITORÍA DE GESTIÓN FASE
As noted in Section 2.1, RDF datasets are comprised of triples, each containing three terms — subject, predicate and object — that are represented as strings. When dealing with large datasets, these string representations occupy many bytes and take a large amount of storage space, this is particularly true with datasets in N-Triple format that have long URI references (URIrefs) or literals. Additionally, there is in- creased network latencies when transferring such data over the network. Although Gzip compression can be used to compress RDF dataset, it is difficult to parse and process these datasets without decompressing them first, which imposes a computa- tion overhead. There is, therefore, a need for a compression mechanism that maintain the semantic of the data, consequently, many large scale reasoners such as BigOWLIM [56] and WebPIE [20] adopt dictionary encoding. Dictionary encoding encodes each of the unique URIs in RDF datasets using numerical identifiers such as integers that only occupy 8 bytes each.
2.4.1
MapReduce-based Dictionary Encoding
In order to perform large scale reasoning, Urbani et al. adopt an upfront dictionary encoding [67] based on MapReduce to reduce the data size. The creation of the dictionary and the encoding of the data was distributed between a number of nodes
running the Hadoop framework. Initially, the most popular terms are sampled and encoded into a dictionary table, since these are small this dictionary held in main memory in each of the nodes. The system then deconstructs the statements and encode each of the terms whilst building up the dictionary table. To avoid clash of IDs, each node assigns IDs from the range of numbers allocated to it. The first 4 bytes of the identifier are used to store the task identifier that processed the term and the last 4 bytes are used as an incremental counter within the task.
A similar approach was adopted for decompression and experimented with a number of settings such the popular-term cache. They report the compression of 1.1 billion triples of the LUBM [66] dataset in 1 hour and 10 minutes, with a 1.9 GB dictionary. Although this approach is scalable both in terms of input and computing nodes, the generated dictionaries take more than an hour to build and are in most cases larger than 1 GB of data. This is a challenge when considering loading the whole dictionary in main memory and imposes an IO overhead as the dictionary file needs to searched in-disk. These large dictionaries are due to the fact that no special considerations are given to the common parts of the URIs such as namespaces, hence including these namespaces numerous times in the dictionary.
2.4.2
Supercomputers-based Dictionary Encoding
Weaver and Hendler [21], use a parallel dictionary encoding approach on the IBM Blue Gene/Q by utilising the IBM General Parallel File System (GPFS). Due to disk quotas restrictions they perform LZO [80] compression on the datasets before the dictionary encoding. LZO is a fast block compression algorithm that enables the compressed data to be split into blocks. This feature is utilised such that processes can directly operate on the compressed data blocks. Subsequently, the compressed file blocks are partitioned equally between the processors, the processors collectively
access the file and starts encoding the data. The encoded data is written in separate output files, one for each processor, additionally, when encoding the data processors communicate with each other using MPI to resolve the numeric identifier for each of the terms. For the dictionary encoding of 1.3 billion triples of the LUBM dataset [81], a reported total runtime of approximately 28 minutes by utilising 64 processors and 50 seconds when utilising 32,768 processors. Both reported runtimes exclude the time required to perform the LZO compression on the datasets. The total size reported for the dictionary is 23 GB and 29.8 GB for the encoded data.
2.4.3
DHT-based Dictionary Encoding
A dictionary encoding based on DHT network is presented by Kaoudi et al. [82] to provide efficient encoding for SPARQL queries. In this approach the dictionary nu- merical IDs are composed of two parts, the unique peer identifier and a local numerical counter. When new triples are added to the network, they are encoded by the re- ceiving peers and are resent through the network alongside their dictionary entry. As was noted previously, this approach further exacerbates the network congestion issue known with DHTs due to not only the traffic of the triples, but also their encoding.
2.4.4
Centralised Dictionary Encoding
A comparison of RDF compression approaches is provided by Fern´andez et al. [83]. They compare three approaches, mainly gzip compression, adjacent lists and dictio- nary encoding. Adjacent lists concentrates the repeatability of some of the RDF statements and achieves high compression rates when the data is further compressed using gzip. They also show that datasets with a large number of URIs that are named sequentially can result in a dictionary that is highly compressible. Additionally, it was shown that dictionary encoding for literals can increase the triple representation spe-
cially when the dataset contains a variety of literals, and hence conclude that literals need finer approaches.
A dictionary approach for the compression of long URIRefs in RDF/XML documents was presented by Lee et al. [84]. The compression is carried out in two stages, firstly the namespace URIs in the document are dictionary encoded using numerical IDs, then any of the URIRefs are encoded by using the URI ID as a reference. Then two dictionaries are created, one for the URIs and another one for the URIRefs. The encoded data is then compressed further by using an XML specific compressor. Al- though this approach shows compression rates that are up to 39.5% better than Gzip, it is primarily aimed at compacting RDF/XML documents rather than provided an encoding that reduces both the size and enables the data to be processed in com- pressed format.