• No se han encontrado resultados

ii. Hadoop replication factor:- Replication factor in Hadoop framework is such that 2/3 of each block (of a whole file) are replicated into different data nodes across racks in a cluster. Since Application Master is expected to monitor the execution of a job/application (with its complete number of blocks) in a cluster, AM will need to communicate data nodes with input splits of the corresponding job/application across racks to be able to monitor this execution. Communication across racks will result to higher latency in job execution.

iii. Job completion time:- Since only resource manager coordinates the release of resources for execution of jobs, several Application Masters (AMs) polling from Resource Manager of this framework during resource request is a bottleneck for the system. It slows down processing, which means that total turnaround time (job completion time) for each job will be high.

new framework has five (6) phases – Job submission, job initialization, task assignment, task execution, progress/update and job completion phase. Figure 3.2 explains MapReduce job execution on our new framework.

Figure 3.2: MapReduce Job Execution on the new framework a. Job Submission Phase

Step 1: The job gets submitted to job client.

Step 2: The job client request for a new job id.

Step 3: The job client then checks if output directory has been created.

After verifying this, it copies the job resources to the HDFS.

Step 4: The job client then submits the job to the Resource Manager.

b. Job Initialization Phase

Step 5: Resource Manager gets input splits for the said job.

Step 6: With the information in Step 5, the Resource Manager schedules appropriate Rack Unit Resource Manager (RU_RM) with 2/3 of the input split to execute the job.

Step 7: The scheduler at the RU_RM picks the job and contacts the appropriate Node Manager to launch Application Master for the job.

Step 8: Application Master creates object for the job. This is done for book keeping purposes and task management. The Application Master creates 1 map per split from each input split on data node.

The Application Master at this point decides how to execute the job. If the job is a small task, the Application Master runs the job in its JVM to avoid unnecessary overhead.

c. Task Assignment Phase

Step 9: If the job is large, Application Master requests the Rack Unit Resource Manager to allocate the computing resources needed (container). Scheduler at this point knows where the resources are located. It gathers this information from the heartbeat it gets from each worker node in the rack. It uses this information to consider data locality while assigning a task. The scheduler tries as much as possible to assign a task to where the data are located.

If this is not possible, it assigns the task to another node within the rack.

d. Task Execution Phase

Step 10: The Rack Unit Resource Manager through the appropriate Node Manager launches the YARN child.

Step 11: YarnChild retrieves all job resources from the HDFS.

Step 12: YarnChild now runs the map and reduce tasks.

e. Progress and Update Phase

In this phase, YarnChild sends the progress report every 3 seconds to the Application Master. Application Master in turn aggregates and sends update directly to the job client.

f. Job Completion Phase

Application Master sends output to HDFS

Application Master and task containers clean up their working state with the help of Container Expirer.

3.2.1 Justification of the New System

To justify the working methodology of our new model, a hypothetical evaluation is carried out, which analyse the results obtained in this new model to results from the YARN model. Let us assume to have three jobs (applications) to be processed and that, each step in executing any of these job takes 0.01ns (assume that three jobs are of the same size). For each job therefore, the first 5 steps in the existing framework holds as obtained in Figure 3.1.

Since there is only one resource manager, the last seven steps will require Resource Manager communicating with Node Manager to launch containers and with Application Master for resource requests. Since no more than one instruction can be given at a time, it means that Resource Manager will interleave these instructions between the three jobs (applications).

Assume that the time taken for each job to be attended to is 3ns and the interveaning of process follows FIFO order. It means that Job1 get resources immediately hence, delay time is zero (0). Job2 will get assess to resource at time 3ns while Job3 at time 7ns. The overall time it will take to process the three jobs will be as follows:

Job 1 = (0.01ns x 5) + 0ns = 0.05ns Job 2 = (0.01ns x 5) + 3ns = 3.05ns Job 3 = (0.01ns x 5) + 7ns = 7.05ns

Total instruction time needed to process the three jobs = 0.05ns + 3.05ns + 7.05ns = 10.15ns

With the new model, the first 7 steps holds for all the three jobs as obtained in Figure 3.2. Since each RU_RM node executes just one job at a time, the last 5steps therefore are carried out at the same time on different RU_RM node.

Therefore, if it takes 3ns for the three jobs to be attended to in the existing system; it will take 1/3ns of 5steps for these jobs to be attended to in the new model. Hence, the process time for the three jobs will be as follows:

Job 1 = (0.01ns x 7) + 1/3ns of 5 on RU_RM1 = 0.07ns + 1.67ns = 1.74ns Job 2 = (0.01ns x 7) + 1/3ns of 5 on RU_RM2 = 0.07ns + 1.67ns = 1.74ns Job 3 = (0.01ns x 7) + 1/3ns of 5 on RU_RM3 = 0.07ns + 1.67ns = 1.74ns

Total instruction time needed to process these three jobs = 1.74ns x 3 + = 5.22ns From our analysis; ignoring bottlenecks associated with network bandwidth and communication overhead, we observe that it takes 10.15ns to pass instruction that will execute three jobs in YARN model, whereas it takes 5.22ns to do same with our new model. This shows that, our new model promises a better response time and lower turnaround time compared to the existing model.

Documento similar