• No se han encontrado resultados

ESPECIFICACIONES TÉCNICAS

ESPECIFICACIONES TÉCNICAS OBRA:

J. Medidas de Seguridad y Limpieza:

04.02 PAVIMENTO FLEXIBLE

04.03.02 OBRAS DE CONCRETO

04.03.02.01 ENCOFRADO Y DESENCOFRADO NORMAL EN SARDINEL

Today, we live in the age of big data, where the data volumes we need to work with, demand more and more storage and processing capabilities. Big data come with two initial challenges:

 How to store and work with extremely large data  How to understand data and turn it into advantage.

33 Hadoop helps the market by storing and providing computational capabilities over massive amounts of data. It provides and supports the development of open source software

that supplies a framework for the development of highly scalable distributed computing applications. [39] Furthermore, the Hadoop framework deals with the processing details, leaving the developers free to focus on the development of applications logic.

In a few words, Hadoop is a distributed system made up of a distributed file system and it

offers a way to parallelize and execute programs on a cluster of machines.

In Figure 7, we are able to see how one interacts with a Hadoop cluster. In general, a Hadoop cluster is a set of commodity machines which are networked together in a single location. Data storage and processing take place within this region of machines, while different users are able to submit jobs which require computational processes to Hadoop, using different machines (clients) in remote locations from the Hadoop cluster [40], [41], [42], [48].

Figure 7 -Hadoop cluster

In other words, Hadoop is an open source framework for writing and running distributed applications that process and store massive amounts of data and can run simultaneously. In Figure 8, we can see the two main components of Hadoop that help into the idea of distributed computation and storage (they will be described later). Hadoop, changes the way that data is generally managed and processed, by leveraging

the power of computing resources composed of commodity hardware, while it can automatically recover from failures [50].

34

Figure 8 - Hadoop's components

The differences of Hadoop as for the distributed computing in general, that makes it so different and special, are listed below. In general, as for its computational techniques, Hadoop offers:

Accessibility: Hadoop runs on large clusters of machines, or on cloud computing

services, making it easily accessible by every kind of client.

Robustness: As Hadoop is intended to run on commodity hardware, it is designed

with the idea of having various hardware malfunctions, so it can manipulate failures in a better way.

Scalability: Hadoop scales and can be configured, in order to be able to handle

larger data by adding multiple nodes to the cluster.

Simplicity: Hadoop allows users to quickly write efficient and effective parallel

code, making it even easier to test it, by using its distributed computational techniques and storage.

Hadoop, as shown in Figure 9, is generally a distributed master-slave architecture that consists of the Hadoop distributed file system (HDFS) for storage and the MapReduce framework for computing [38], [42].

35

Figure 9 - Master-slave architecture of Hadoop

HDFS is the storage component of Hadoop. It is a distributed file system which is

developed for having high throughput and works best when it interacts with large files. In order to support this throughput, HDFS creates multiple block sizes and data locality

optimizations, so to reduce network input/output [40]. HDFS, also offers scalability and

availability, which are achieved because data is split into smaller parts and HDFS can manipulate various types of errors. In short, HDFS replicates files for a configured number

of times, it is tolerant of both software and hardware failure and automatically re-replicates data blocks on nodes that have failed [51].In Figure 10, we can see a logical representation of the components in HDFS, the NameNode and the DataNode, which will be described later. It also shows an application that is using the Hadoop file system library, in order to access the HDFS.

36

MapReduce is a distributed computing framework which allows the parallelization of

the work over a large amount of raw data. This type of work, which could take a long time to finish using sequential programming techniques, can be completed in a few minutes using MapReduce, by working on a Hadoop cluster. The MapReduce abstracts away the complexities which someone faces with when working with distributed systems (e.g. computational parallelization, work distribution). In that way, MapReduce allows the programmer to focus on the application development, by splitting work submitted by a client, into small parallelized map - reduce workers (Figure 11).

Figure 11 - MapReduce paradigm

In general, MapReduce provides a model which transforms complex computations into computations over a set of <key, value> pairs. It schedules and monitors the MapReduce jobs and is responsible for executing again the failed tasks.

So, the role of the programmer becomes much easier, as he has only to define the map and the reduce functions, where the map function outputs <key, value> pairs which are then processed by the reduce functions according to the keys of these pairs, in order to produce the final output.

More details about how HDFS and MapReduce work, will be mentioned in the following chapters of the thesis [41], [42], [45], [46], [47].