2.4 La tecnología y la productividad en banca
CAPÍTULO 3. ESTUDIO EMPÍRICO DE LA BANCA POR INTERNET
The MapReduce framework tries to execute a task four times before it marks the job as failed. It attributes the task failure to hardware, the network, or some other reason beyond the task’s control and assumes that, on subsequent execution, the task should run
successfully. This approach works as long as no data is causing the task to fail. Then the task will fail every time you run at the same place with the same reason.
Depending on the analysis you are doing (for example, if you are doing analysis on a large volume of data), skipping or ignoring a few records might be acceptable. In that case, you might write your Mapper or Reducer to gracefully handle those bad records while
processing without actually causing the runtime exception to happen or the task to fail. But in some cases, such as if you are using a third-party library or you don’t have access to the source code, that might not be possible. Then you can instruct the MapReduce framework to automatically skip bad records on subsequent task retries.
By default, skipping mode is disabled because of the extra overhead or bookkeeping it requires. You can enable it by using the SkipBadRecords class either in Mapper or Reducer or in both. When enabled, two task failures are considered to be a normal failure. On the third failure, the MapReduce framework captures the data that caused the failure. The fourth time, the MapReduce framework skips the identified data to avoid failure again.
format in the job’s output directory under the _logs/skip subdirectory, for later analysis as needed.
Counter
The MapReduce framework provides built-in counters to gather application-level statistics or metrics that can be used for maintaining quality control, diagnosing a problem,
monitoring and tracking the progress of jobs, and more. Some of these built-in counters report various metrics for your job: for example, FILE_BYTES_READ,
FILE_BYTES_WRITTEN, NUM_KILLED_MAPS, NUM_KILLED_REDUCES,
NUM_FAILED_MAPS, and NUM_FAILED_REDUCES. The MapReduce framework increments their values automatically in response to the respective events.
Beside built-in counters, the MapReduce framework enables developers to create and define user-defined or custom counters, in case the existing built-in counters do not suit your need. You can increment (with a positive value as the parameter) or decrement (with a negative value as the parameter) the value of the counter inside the Mapper, Reducer, or Driver—they are globally available to the job. The increment method is synchronized, to avoid a race condition in incrementing and decrementing a counter value with parallel executing code.
Note: Counter for Communication
Think of counters as lightweight objects that provide a simple mechanism for communication between a Driver, a Mapper, and a Reducer during execution. You can use counters to gather job-related statistics and metrics, and you can access them from anywhere (Mapper, Reducer, or Driver) in your application.
You can create and define the custom counter using a Java enum that contains the names of all the custom counters for your job and acts as a group of user-defined counters:
public enum MyCounters { MapperCounter,
ReducerCounter }
You can use the increment method of the context object to increment or decrement the value of a counter (in either the Mapper or the Reducer). The MapReduce framework then globally aggregates the value of the counter:
Click here to view code image
//Incrementing counter value by 1
context.getCounter(MyCounters.MapperCounter).increment(1); //Decrementing counter value by 1
context.getCounter(MyCounters.MapperCounter).increment(-1);
Alongside the built-in counters, the values for your custom counters are displayed on a summary web page for your job. Alternatively, you can access them programmatically using the following statements:
Click here to view code image
long mapperCounterValue =
long reducerCounterValue =
job.getCounters().findCounter(MyCounters.MapperCounter);
During execution, each task periodically sends the counters and their values to the TaskTracker. In return, the task is sent to the JobTracker, where it is consolidated to provide a holistic view for the complete job.
Because overhead is incurred when having and maintaining too many counters (remember, storing each counter requires additional memory used by the JobTracker), the maximum number of counters is limited to 120, by default, to ensure that no single job accidentally uses all the available JobTracker memory. You can change this, however, using the
mapreduce.job.counters.limit property in the mapred-site.xml
configuration file for a cluster-wide change; alternatively, you can change it in your job, which makes it applicable to your job only.
Click here to view code image
<property>
<name>mapreduce.job.counters.limit</name> <value>1200</value>
</property>
Note: Counters Come with an Overhead
Counters are a good means for tracking some important global data on an application level. But counters are fairly expensive in terms of the memory needed to store them (counters from each map or reduce task) at the
JobTracker level during the entire application execution cycle. Thus, they are not recommended for use to aggregate very fine-grained statistics of
applications.
YARN
Hadoop 1.0 was designed and developed more than seven years ago. The need then was to process a massive amount of data in parallel, distributed over hundreds of thousands of nodes in a batch-oriented manner. Now times have changed, and the hardware has gotten even cheaper. Companies today demand more horizontal scaling while, at the same time, ensuring high availability; companies want to process data interactively as well as with batch mode data processing. Hadoop 2.0 overcomes these limitations with its new architecture and components. Consider some of the limitations of Hadoop 1.0, with respect to the MapReduce processing framework and the advances of Hadoop 2.0:
Single point of failure—JobTracker, like name node, is a single point of failure. Limited to running MapReduce jobs on HDFS, a lack of support for interactive
queries, or an underutilized HDFS—The MapReduce component of Hadoop 1.0 was designed for batch mode processing, and it had no true support for interactive and streaming query support. Iterative applications implemented using MapReduce were several times slower and thus were not suitable for usage-like graph
processing. In Hadoop 1.0, only MapReduce can access data in HDFS. This is
restrictive and does not allow multitenancy, forcing other applications (which do not work on batch mode processing) to store data somewhere other than the HDFS or to
use HDFS via MapReduce only.
Low computing resource utilization—Computing resources on each node are divided as Map slots and Reduce slots in a hard partition manner. Suppose that, at a given time, many Mappers are running and only a few Reducers are running. All the Map slots would be used, but only a few Reduce slots are being used. No resources are being shared for better utilization. In other words, Mappers can use only Map slots—if they are not available, the Mappers must wait, even though some Reduce slots might be available.
Overly crowded JobTracker—The JobTracker alone is responsible for managing both resources and the application life cycle. As more jobs are run concurrently, the chance of failure is higher. All the jobs then need to be resubmitted by the users because all the running and queued jobs are killed upon restart. The JobTracker suffers from an increased memory requirement (for a larger cluster or complex MapReduce jobs), causing problems related to scalability and reliability. The JobTracker also has a rigid threading model.
In Hadoop 2.0, MapReduce has undergone a complete overhaul, with a new layer on top of HDFS. In Figure 5.3, the shaded box indicates core Hadoop components and the white box indicates different technologies in the Hadoop ecosystem. This new layer is called YARN (Yet Another Resource Negotiator). YARN takes care of two major functions, called resource management and application life-cycle management. the JobTracker took care of these previously.
FIGURE 5.3 Hadoop 2.0 high-level architecture and components. Now MapReduce is just a batch-mode computational layer sitting on top on YARN.
YARN, however, acts like an operating system for the Hadoop cluster, providing resource management and application life-cycle management functionalities and making Hadoop a general-purpose data-processing platform that is not constrained only to MapReduce. This change brings some important benefits:
Segregation of duties—Now MapReduce can focus on what it is good at doing: batch mode data processing. YARN focuses on resource management and
application life-cycle management.
access to the data stored in HDFS. This brings a broader array of interaction patterns for the data stored in HDFS beyond just MapReduce. It broadens the horizon. Now you have interactive, streaming graphics processing support for accessing data from HDFS by leveraging YARN, all sharing common resource management. Now
YARN can handle much more than batch-oriented MapReduce jobs and can assign cluster capacity to the different applications so that it meets the service level
demands of a particular workload and stays in line with constraints such as queue capacities and user limits.
Scalability and performance—Unlike the JobTracker, which previously handled resource management and application cycle management together, YARN has a new global Resource Manager and per-machine Node Manager to manage the resources of the cluster and form the computation fabric of the cluster. At the same time, it has an application-specific Application Master that negotiates resources from the
Resource Manager and manages the application life cycle. This segregation brings the new scalability and performance of Hadoop 2.0.
Note: MapReduce 2.0 Versus YARN
In Hadoop 2.0, MapReduce is referred to as MapReduce 2.0, MRv2, NextGen MapReduce, or YARN. You will find this different nomenclature used in different artifacts. But just to summarize, the resource management and application life-cycle management capabilities have been moved from the classic MapReduce into the YARN layer. MapReduce can now focus on what it does best (batch-mode data processing), and YARN can work as an
operating system for the Hadoop cluster serving requests from multiple applications, not just limited to MapReduce.
In other words, MapReduce 2 and YARN are different; referring to them as the same thing is misleading or deceiving. YARN is a generic framework or data operating system that provides Hadoop with cluster resource
management and job scheduling/monitoring capabilities. YARN sits on top of HDFS and provides these services to any application targeting it, to leverage the services in Figure 5.3, whereas MapReduce 2 is just one application among many applications leveraging services provided by YARN.
Note that YARN is in no way tied to MapReduce only and can be leveraged by any application wanting to use it. You can find the latest details on other applications that leverage the power of YARN here:
http://wiki.apache.org/hadoop/PoweredByYarn.
As you saw in Figure 5.3, YARN now takes care of resource management, job scheduling, and monitoring, making MapReduce a pure computational layer on top of YARN for batch processing. It’s now possible for the other components to be written natively over YARN, but that doesn’t just happen. Even in a Hadoop 2.0 cluster, many things still translate down into MR: The community is working on YARN versions of most components, and these are in various states of completion.