4. EVOLUCIÓN DEL PROCESO DE INSERCIÓN SOCIOLABORAL DE
5.2. Análisis de los programas de inserción sociolaboral que desarrollan
Imagine that you’re working with large text files that, even when compressed, are many times larger than the HDFS block size. To avoid having one map task process an entire large compressed file, you’ll need to pick a compression codec that can support splitting that file.
LZOP fits the bill, but working with it is more complex than the examples detailed in the previous technique because LZOP is not in and of itself splittable. “Wait,” you may be thinking, “didn’t you state earlier that LZOP is splittable?” LZOP is block-based, but you can’t perform a random seek into an LZOP file and determine the next block’s starting point. This is the challenge we’ll tackle in this technique.
■ Problem
You want to use a compression codec that will allow MapReduce to work in parallel on a single compressed file.
■ Solution
In MapReduce, splitting large LZOP-compressed input files requires the use of LZOP- specific input format classes, such as LzoInputFormat. The same principle applies when working with LZOP-compressed input files in both Pig and Hive.
■ Discussion
The LZOP compression codec is one of only two codecs that allow for compressed files to be split, and therefore to be worked on in parallel by multiple reducers. The other codec, bzip2, suffers from compression times that are so slow they arguably render the codec unusable. LZOP also offers a good compromise between compres- sion and speed.
What’s the difference between LZO and LZOP? Both LZO and LZOP codecs are supplied for use with Hadoop. LZO is a stream-based compression store that doesn’t have the notion of blocks or headers. LZOP has the notion of blocks (that are checksummed), and therefore is the codec you want to use, especially if you want your compressed output to be splittable. Confusingly, the Hadoop codecs by default treat files ending with the .lzo extension to beLZOP-encoded, and files ending with the .lzo_deflate extension to be LZO-encoded. Also, much of the documentation seems to use LZO and LZOPinterchangeably.
169
TECHNIQUE 32 Splittable LZOP with MapReduce, Hive, and Pig
Preparing your cluster for LZOP
Unfortunately, Hadoop doesn’t bundle LZOP for licensing reasons.20
Getting all the prerequisites compiled and installed on your cluster is laborious, but rest assured that there are detailed instructions in the appendix. To compile and run the code in this section, you’ll need to follow the instructions in the appendix.
Reading and writing LZOP files in HDFS
We covered how to read and write compressed files in section 4.2. To perform the same activity with LZOP requires you to specify the LZOP codec in your code. This code is shown in the following listing.21
public static Path compress(Path src,
Configuration config) throws IOException { Path destFile = new Path( src.toString() + new LzopCodec().getDefaultExtension()); LzopCodec codec = new LzopCodec();
codec.setConf(config); FileSystem hdfs = FileSystem.get(config); InputStream is = null; OutputStream os = null; try { is = hdfs.open(src); os = codec.createOutputStream(hdfs.create(destFile)); IOUtils.copyBytes(is, os, config);
} finally { IOUtils.closeStream(os); IOUtils.closeStream(is); } return destFile; }
public static void decompress(Path src, Path dest, Configuration config) throws IOException {
LzopCodec codec = new LzopCodec(); codec.setConf(config);
FileSystem hdfs = FileSystem.get(config); InputStream is = null;
OutputStream os = null; try {
20LZOP used to be included with Hadoop, but with the work performed in JIRA ticket https:// issues.apache.org/jira/browse/HADOOP-4874, it was removed from Hadoop version 0.20 and newer releases due to LZOP’s GPL licensing limiting its redistribution.
Listing 4.3 Methods to read and write LZOP files in HDFS
21GitHub source: https://github.com/alexholmes/hiped2/blob/master/src/main/java/hip/ch4/ LzopFileReadWrite.java.
is = codec.createInputStream(hdfs.open(src)); os = hdfs.create(dest);
IOUtils.copyBytes(is, os, config); } finally {
IOUtils.closeStream(os); IOUtils.closeStream(is); }
}
Let’s write and read an LZOP file, and then make sure that LZOP utilities can work with the generated file (replace $HADOOP_CONF_HOME with the location of your Hadoop config directory):
$ hadoop fs -put $HADOOP_CONF_DIR/core-site.xml core-site.xml $ hip hip.ch4.LzopFileReadWrite core-site.xml
The preceding code will generate a core-site.xml.lzo file in HDFS.
Now make sure you can use this LZOP file with the lzop binary. Install an lzop binary on your host.22 Copy the LZOP file from HDFS to local disk, uncompress it with the native lzop binary, and compare it with the original file:
$ hadoop fs -get core-site.xml.lzo /tmp/core-site.xml.lzo $ lzop -l /tmp/core-site.xml.lzo
method compressed uncompr. ratio uncompressed_name LZO1X-1 454 954 47.6% core-site.xml # cd /tmp
$ lzop -d core-site.xml.lzo $ ls -ltr
-rw-r--r-- 1 aholmes aholmes 954 May 5 09:05 core-site.xml -rw-r--r-- 1 aholmes aholmes 504 May 5 09:05 core-site.xml.lzo $ diff core-site.xml $HADOOP_CONF_DIR/conf/core-site.xml
$
The diff verified that the file compressed with the LZOP codec could be decom- pressed with the lzop binary.
Now that you have your LZOP file, you need to index it so that it can be split.
Creating indexes for your LZOP files
Earlier I made the paradoxical statement that LZOP files can be split, but that they’re not natively splittable. Let me clarify what that means—the lack of block-delimiting synchronization markers means you can’t do a random seek into an LZOP file and start reading blocks. But because internally it does use blocks, all you need is a prepro- cessing step that can generate an index file containing the block offsets.
The LZOP file is read in its entirety, and block offsets are written to the index file as the read is occurring. The index file format, shown in figure 4.6, is a binary file con- taining a consecutive series of 64-bit numbers that indicate the byte offset for each block in the LZOP file.
22For RedHat and Centos, you can install the rpm from http://pkgs.repoforge.org/lzop/lzop-1.03-1.el5.rf .x86_64.rpm.
171
TECHNIQUE 32 Splittable LZOP with MapReduce, Hive, and Pig
You can create index files in one of two ways, as shown in the following two code snip- pets. If you want to create an index file for a single LZOP file, here is a simple library call that will do this for you:
shell$ hadoop com.hadoop.compression.lzo.LzoIndexer core-site.xml.lzo
The following option works well if you have a large number of LZOP files and you want a more efficient way to generate the index files. The indexer runs a MapReduce job to create the index files. Both files and directories (which are scanned recursively for LZOP files) are supported:
shell$ hadoop \
com.hadoop.compression.lzo.DistributedLzoIndexer \ core-site.xml.lzo
Both approaches depicted in figure 4.6 will generate an index file in the same directory as the LZOP file. The index filename is the original LZOP filename suffixed with .index. Running the previous commands would yield the filename core-site.xml.lzo.index.
Now let’s look at how you can use the LzoIndexer in your Java code. The following code (from the main method of LzoIndexer) will result in the index file being created synchronously:
LzoIndexer lzoIndexer = new LzoIndexer(new Configuration()); for (String arg: args) {
try {
lzoIndexer.index(new Path(arg)); } catch (IOException e) {
LOG.error("Error indexing " + arg, e); }
With the DistributedLzoIndexer, the MapReduce job will launch and run with N map- pers, one for each .lzo file. No reducers are run, so the (identity) mapper, via the custom LzoSplitInputFormat and LzoIndexOutputFormat, writes the index files directly.
Block 1 offset 0 Byte Block 2 offset … … 8 Block 3 offset 16 0 64 Bit LZOP index file
Block 1 LZOP file
Block 2
Block 3
Figure 4.6 An LZOP index file is a binary containing a consecutive series of 64-bit numbers.
If you want to run the MapReduce job from your own Java code, you can use the DistributedLzoIndexer code as an example.
You need the LZOP index files so that you can split LZOP files in your MapReduce, Pig, and Hive jobs. Now that you have the aforementioned LZOP index files, let’s look at how you can use them with MapReduce.
MapReduce and LZOP
After you’ve created index files for your LZOP files, it’s time to start using your LZOP files with MapReduce. Unfortunately, this brings us to the next challenge: none of the existing, built-in Hadoop-file-based input formats will work with splittable LZOP because they need specialized logic to handle input splits using the LZOP index file. You need specific input format classes to work with splittable LZOP.
The LZOP library provides an LzoTextInputFormat implementation for line-oriented LZOP-compressed text files with accompanying index files.23
The following code shows the steps required to configure the MapReduce job to work with LZOP. You would perform these steps for a MapReduce job that had text LZOP inputs and outputs:24
job.setInputFormatClass(LzoTextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class);
job.getConfiguration().setBoolean("mapred.output.compress", true); job.getConfiguration().setClass("mapred.output.compression.codec",
LzopCodec.class, CompressionCodec.class);
Compressing intermediary map output will also speed up the overall execution time of your MapReduce jobs:
conf.setBoolean("mapred.compress.map.output", true); conf.setClass("mapred.map.output.compression.codec",
LzopCodec.class, CompressionCodec.class);
You can easily configure your cluster to always compress your map output by editing hdfs-site.xml: <property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>mapred.map.output.compression.codec</name> <value>com.hadoop.compression.lzo.LzopCodec</value> </property>
The number of splits per LZOP file is a function of the number of LZOP blocks that the file occupies, not the number of HDFS blocks that the file occupies.
23The LZOP input formats also work well with LZOP files that don’t have index files.
24GitHub source: https://github.com/alexholmes/hiped2/blob/master/src/main/java/hip/ch4/ LzopMapReduce.java.
173
Chapter summary
Now that we’ve covered MapReduce, let’s look at how Pig and Hive can work with splittable LZOP.
Pig and Hive
Elephant Bird,25 a Twitter project containing utilities to work with LZOP, provides a number of useful MapReduce and Pig classes. Elephant Bird has an LzoPigStorage class that works with text-based, LZOP-compressed data in Pig.
Hive can work with LZOP-compressed text files by using the com.hadoop.mapred .DeprecatedLzoTextInputFormat input format class found in the LZO library.
■Summary
Working with splittable compression in Hadoop is tricky. If you’re fortunate enough to be able to store your data in Avro or Parquet, they offer the simplest way to work with files that can be easily compressed and split. If you want to compress other file formats and need them to be split, LZOP is the only real candidate.
As I mentioned earlier, the Elephant Bird project provides some useful LZOP input formats that will work with LZOP-compressed file formats such as XML and plain text. If you need to work with an LZOP-compressed file format that isn’t supported by either Todd Lipcon’s LZO project or Elephant Bird, you’ll have to write your own input format. This is a big hurdle for developers. I hope at some point Hadoop will be able to support compressed files with custom splitting logic so that end users don’t have to write their own input formats for compression.
Compression is likely to be a hard-and-fast requirement for any production envi- ronment where resources are always scarce. Compression also allows faster execution times for your computational jobs, so it’s a compelling aspect of storage. In the previ- ous section I showed you how to evaluate and pick the codec best suited for your data. We also covered how to use compression with HDFS, MapReduce, Pig, and Hive. Finally, we tackled the tricky subject of splittable LZOP compression.
4.3
Chapter summary
Big data in the form of large numbers of small files brings to light a limitation in HDFS, and in this chapter we worked around it by looking at how you can package small files into larger Avro containers.
Compression is a key part of any large cluster, and we evaluated and compared the different compression codecs. I recommended codecs based on various criteria and also showed you how to compress and work with these compressed files in Map- Reduce, Pig, and Hive. We also looked at how to work with LZOP to achieve compres- sion as well as blazing-fast computation with multiple input splits.
This and the previous chapter were dedicated to looking at techniques for picking the right file format and working effectively with big data in MapReduce and HDFS. It’s now time to apply this knowledge and look at how to move data in and out of Hadoop. That’s covered in the next chapter.
174