• No se han encontrado resultados

We have talked about files being broken into splits as part of the job startup and the data in a split being sent to the mapper implementation. However, this overlooks two aspects: how the data is stored in the file and how the individual keys and values are passed to the mapper structure.

InputFormat and RecordReader

Hadoop has the concept of an InputFormat for the first of these responsibilities. The InputFormat abstract class in the org.apache.hadoop.mapreduce

package provides two methods as shown in the following code:

public abstract class InputFormat<K, V> {

public abstract List<InputSplit> getSplits( JobContext context) ; RecordReader<K, V> createRecordReader(InputSplit split,

TaskAttemptContext context) ; }

These methods display the two responsibilities of the InputFormat class:

‹ To provide the details on how to split an input file into the splits required for

map processing

‹ To create a RecordReader class that will generate the series of key/value

pairs from a split

The RecordReader class is also an abstract class within the org.apache.hadoop. mapreduce package:

public abstract class RecordReader<Key, Value> implements Closeable {

public abstract void initialize(InputSplit split, TaskAttemptContext context) ;

public abstract boolean nextKeyValue() throws IOException, InterruptedException ; public abstract Key getCurrentKey() throws IOException, InterruptedException ; public abstract Value getCurrentValue() throws IOException, InterruptedException ; public abstract float getProgress() throws IOException, InterruptedException ; public abstract close() throws IOException ; }

A RecordReader instance is created for each split and calls getNextKeyValue to return a Boolean indicating if another key/value pair is available and if so, the getKey and getValue methods are used to access the key and value respectively.

The combination of the InputFormat and RecordReader classes therefore are all that is required to bridge between any kind of input data and the key/value pairs required by MapReduce.

Hadoop-provided InputFormat

There are some Hadoop-provided InputFormat implementations within the org.apache. hadoop.mapreduce.lib.input package:

‹ FileInputFormat: This is an abstract base class that can be the parent of any

file-based input

‹ SequenceFileInputFormat: This is an efficient binary file format that will be

discussed in an upcoming section

‹ TextInputFormat: This is used for plain text files

The pre-0.20 API has additional InputFormats defined in the org. apache.hadoop.mapred package.

Note that InputFormats are not restricted to reading from files;

FileInputFormat is itself a subclass of InputFormat. It is possible

to have Hadoop use data that is not based on the files as the input to MapReduce jobs; common sources are relational databases or HBase.

Hadoop-provided RecordReader

Similarly, Hadoop provides a few common RecordReader implementations, which are also

present within the org.apache.hadoop.mapreduce.lib.input package:

‹ LineRecordReader: This implementation is the default RecordReader class for text files that present the line number as the key and the line contents as the value

‹ SequenceFileRecordReader: This implementation reads the key/value from the

binary SequenceFile container

Again, the pre-0.20 API has additional RecordReader classes in the org.apache.hadoop. mapred package, such as KeyValueRecordReader, that have not yet been ported to the

OutputFormat and RecordWriter

There is a similar pattern for writing the output of a job coordinated by subclasses of OutputFormat and RecordWriter from the org.apache.hadoop.mapreduce

package. We'll not explore these in any detail here, but the general approach is similar, though OutputFormat does have a more involved API as it has methods for tasks such as validation of the output specification.

It is this step that causes a job to fail if a specified output directory already exists. If you wanted different behavior, it would require a subclass of

OutputFormat that overrides this method.

Hadoop-provided OutputFormat

The following OutputFormats are provided in the org.apache.hadoop.mapreduce. output package:

‹ FileOutputFormat: This is the base class for all file-based OutputFormats

‹ NullOutputFormat: This is a dummy implementation that discards the output and

writes nothing to the file

‹ SequenceFileOutputFormat: This writes to the binary SequenceFile format ‹ TextOutputFormat: This writes a plain text file

Note that these classes define their required RecordWriter implementations as inner

classes so there are no separately provided RecordWriter implementations.

Don't forget Sequence files

The SequenceFile class within the org.apache.hadoop.io package provides an efficient binary file format that is often useful as an output from a MapReduce job. This is especially true if the output from the job is processed as the input of another job. The

Sequence files have several advantages, as follows:

‹ As binary files, they are intrinsically more compact than text files

‹ They additionally support optional compression, which can also be applied at

different levels, that is, compress each record or an entire split

This last characteristic is important, as most binary formats—particularly those that are compressed or encrypted—cannot be split and must be read as a single linear stream of data. Using such files as input to a MapReduce job means that a single mapper will be used to process the entire file, causing a potentially large performance hit. In such a situation, it is preferable to either use a splitable format such as SequenceFile, or, if you cannot avoid receiving the file in the other format, do a preprocessing step that converts it into a splitable format. This will be a trade-off, as the conversion will take time; but in many cases—especially with complex map tasks—this will be outweighed by the time saved.

Summary

We have covered a lot of ground in this chapter and we now have the foundation to explore MapReduce in more detail. Specifically, we learned how key/value pairs is a broadly applicable data model that is well suited to MapReduce processing. We also learned how to write mapper and reducer implementations using the 0.20 and above versions of the Java API.

We then moved on and saw how a MapReduce job is processed and how the map

and reduce methods are tied together by significant coordination and task-scheduling machinery. We also saw how certain MapReduce jobs require specialization in the form of a custom partitioner or combiner.

We also learned how Hadoop reads data to and from the filesystem. It uses the concept of InputFormat and OutputFormat to handle the file as a whole and RecordReader and RecordWriter to translate the format to and from key/value pairs.

With this knowledge, we will now move on to a case study in the next chapter, which demonstrates the ongoing development and enhancement of a MapReduce application that processes a large data set.

4

Developing MapReduce Programs

Now that we have explored the technology of MapReduce, we will spend this chapter looking at how to put it to use. In particular, we will take a more substantial dataset and look at ways to approach its analysis by using the tools provided by MapReduce.

In this chapter we will cover the following topics:

‹ Hadoop Streaming and its uses ‹ The UFO sighting dataset

‹ Using Streaming as a development/debugging tool ‹ Using multiple mappers in a single job

‹ Efficiently sharing utility files and data across the cluster

‹ Reporting job and task status and log information useful for debugging

Throughout this chapter, the goal is to introduce both concrete tools and ideas about how to approach the analysis of a new data set. We shall start by looking at how to use scripting programming languages to aid MapReduce prototyping and initial analysis. Though it may seem strange to learn the Java API in the previous chapter and immediately move to different languages, our goal here is to provide you with an awareness of different ways to approach the problems you face. Just as many jobs make little sense being implemented in anything but the Java API, there are other situations where using another approach is best suited. Consider these techniques as new additions to your tool belt and with that experience you will know more easily which is the best fit for a given scenario.