• No se han encontrado resultados

3. Capítulo: Cadena de Suministro

3.7. Logística versus Cadena de Suministro

4.5.4.1. Preparación de los Productos Semiterminados

In this section, we will see further details on Hadoop MapReduce dataflow with several MapReduce terminologies and their Java class details. In the

MapReduce dataflow figure in the previous section, multiple nodes are connected across the network for performing distributed processing with a Hadoop setup. The ensuing attributes of the Map and Reduce phases play an important role for getting the final output.

The attributes of the Map phase are as follows:

• The InputFiles term refers to input, raw datasets that have been created/ extracted to be analyzed for business analytics, which have been stored in HDFS. These input files are very large, and they are available in several types.

• The InputFormat is a Java class to process the input files by obtaining the text of each line of offset and the contents. It defines how to split and read input data files. We can set the several input types, such as

TextInputFormat, KeyValueInputFormat, and SequenceFileInputFormat, of the input format that are relevant to the Map and Reduce phase.

Chapter 2

[ 49 ]

• The InputSplits class is used for setting the size of the data split.

• The RecordReader is a Java class that comes with several methods to retrieve key and values by iterating them among the data splits. Also, it includes other methods to get the status on the current progress.

• The Mapper instance is created for the Map phase. The Mapper class takes input (key, value) pairs (generated by RecordReader) and produces an intermediate (key, value) pair by performing user-defined code in a Map() method. The Map() method mainly takes two input parameters: key and value; the remaining ones are OutputCollector and Reporter. OutputCollector. They will provide intermediate the key-value pair to reduce the phase of the job. Reporter provides the status of the current job to JobTracker periodically. The JobTracker will aggregate them for later retrieval when the job ends.

The attributes of the Reduce phase are as follows:

• After completing the Map phase, the generated intermediate (key, value) pairs are partitioned based on a key attribute similarity consideration in the hash function. So, each Map task may emit (key, value) pairs to partition; all values for the same key are always reduced together without it caring about which Mapper is its origin. This partitioning and shuffling will be done automatically by the MapReduce job after the completion of the Map phase. There is no need to call them separately. Also, we can explicitly override their logic code as per the requirements of the MapReduce job.

• After completing partitioning and shuffling and before initializing the Reduce task, the intermediate (key, value) pairs are sorted based on a key attribute value by the Hadoop MapReduce job.

• The Reduce instance is created for the Reduce phase. It is a section of user- provided code that performs the Reduce task. A Reduce() method of the Reducer class mainly takes two parameters along with OutputCollector and Reporter, which is the same as the Map() function. They are the OutputCollector and Reporter objects. OutputCollector in both Map and Reduce has the same functionality, but in the Reduce phase, OutputCollector provides output to either the next Map phase (in case of multiple map and Reduce job combinations) or reports it as the final output of the jobs based on the requirement. Apart from that, Reporter periodically reports to JobTracker about the current status of the running task.

Writing Hadoop MapReduce Programs

• Finally, in OutputFormat the generated output (key, value) pairs are provided to the OutputCollector parameter and then written to OutputFiles, which is governed by OutputFormat. It controls the setting of the OutputFiles format as defined in the MapReduce Driver. The format will be chosen from either TextOutputFormat, SequenceFileOutputFileFormat, or NullOutputFormat.

• The factory RecordWriter used by OutputFormat to write the output data in the appropriate format.

• The output files are the output data written to HDFS by RecordWriter after the completion of the MapReduce job.

To run this MapReduce job efficiently, we need to have some knowledge of Hadoop shell commands to perform administrative tasks. Refer to the following table:

Shell commands Usage and code sample

cat To copy source paths to stdout: Hadoop fs -cat URI [URI …] chmod To change the permissions of files:

Hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI …]

copyFromLocal To copy a file from local storage to HDFS:

Hadoop fs –copyFromLocal<localsrc> URI copyToLocal To copy a file from HDFS to local storage:

Hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

cp To copy a file from the source to the destination in HDFS: Hadoop fs -cp URI [URI …] <dest>

du To display the aggregate length of a file: Hadoop fs -du URI [URI …]

dus To display the summary of file length: Hadoop fs -dus<args>

get To copy files to a local filesystem:

Hadoop fs -get [-ignorecrc] [-crc] <src><localdst> ls To list all files in the current directory in HDFS:

Hadoop fs –ls<args> mkdir To create a directory in HDFS:

Chapter 2

[ 51 ]

Shell commands Usage and code sample

lv To move files from the source to the destination:

Hadoop fs -mv URI [URI …] <dest> rmr To remove files from the current directory:

Hadoop fs -rmr URI [URI …] setrep To change the replication factor of a file:

Hadoop fs -setrep [-R] <path>

tail To display the last kilobyte of a file to stdout:

Hadoop fs -tail [-f] URI

Writing a Hadoop MapReduce example

Now we will move forward with MapReduce by learning a very common and easy example of word count. The goal of this example is to calculate how many times each word occurs in the provided documents. These documents can be considered as input to MapReduce's file.

In this example, we already have a set of text files—we want to identify the

frequency of all the unique words existing in the files. We will get this by designing the Hadoop MapReduce phase.

In this section, we will see more on Hadoop MapReduce programming using Hadoop MapReduce's old API. Here we assume that the reader has already set up the Hadoop environment as described in Chapter 1, Getting Ready to Use R and Hadoop. Also, keep in mind that we are not going to use R to count words; only Hadoop will be used here.

Basically, Hadoop MapReduce has three main objects: Mapper, Reducer, and Driver. They can be developed with three Java classes; they are the Map class, Reduce class, and Driver class, where the Map class denotes the Map phase, the Reducer class denotes the Reduce phase, and the Driver class denotes the class with the main() method to initialize the Hadoop MapReduce program.

In the previous section of Hadoop MapReduce fundamentals, we already discussed what Mapper, Reducer, and Driver are. Now, we will learn how to define them and program for them in Java. In upcoming chapters, we will be learning to do more with a combination of R and Hadoop.

Writing Hadoop MapReduce Programs

There are many languages and frameworks that are used for building MapReduce, but each of them has different strengths.

There are multiple factors that by modification can provide

high latency over MapReduce. Refer to the article 10 MapReduce Tips by Cloudera at http://blog.cloudera.com/

blog/2009/05/10-mapreduce-tips/.

To make MapReduce development easier, use Eclipse configured

with Maven, which supports the old MapReduce API.

Understanding the steps to run a