In this method, two ways of implementing the auxiliary calculations nodes: having one node as standby to take over if one of the nodes fails, and having replicated nodes for each calculation node executing in parallel. In the scenario of one node on standby, only one extra calculation node is needed, and if one of the main calculation nodes fails, then the standby calculation node takes over and completes the failed task. This will cause a certain time delay in the overall processing. In the replicated node scenario, exactly the same number of nodes is required, all the calculation nodes execute the tasks at the same time, and a single task is executed in two calculation nodes at the same time. By doing so, if one of the calculations has failed, it will not affect the overall calculation depending on how it is implemented. The main advantage of this method is that the overall failure rate is considerably lower than when running a single task on a single calculation node. However, for this method, the number of calculation nodes needed is doubled, for a one-to-two ratio; meanwhile, if the ratio is one-to-four, then the number of calculation nodes needed is four times more. This method is highly suited for the calculations that are a combination of time- critical and data-critical. For high reliability, a ratio of one-to-four is better suited, that is, the same task is executed in four separate calculation nodes at the same time. These types of implementations are investigated using CPU-core-level distributed processing, and they are discussed in detail in Chapter 7. Due to the availability of lower cost small form factor (SFF) computers, this method is highly feasible and relatively easy to implement compared with more complex dynamic load balancing algorithms.
127
For testing purposes, a single task executed in two separate calculation nodes at the same time. Six workstations are selected as calculation nodes, and the job is split into three parallel tasks that are executed at the same time. For example, if a task failure is 1 in 1,000 in a single calculation node, then, by executing the same task in two separate calculation nodes at the same time, the failure rate is reduced to 1 in 2 million for both calculation nodes that are executing same task. For reducing the calculation time, calculation nodes are paired as the most powerful with the least powerful calculation nodes within the selected cluster. Table 5.12 lists the calculation time for each calculation node using auxiliary processing, and Figure 5.15 shows the maximum and minimum calculation time for each group.
Table 5.12: Calculation time for auxiliary processing using two groups
Calculation Node(i) T(i) (sec) Task ID Group ID
NHKWST68 92 1 1 NHKWST63 84 2 1 NHKWST70 78 3 1 NHKWST66 76 3 2 NHKWST65 72 2 2 NHKWST62 72 1 2
128
From the collected data, at the normal operating condition, that is, no calculation node failure, calculation time is 76 seconds; meanwhile, for the worst-case failure, assuming that two calculation nodes executing the same task that have not failed at same time, the calculation time is 92 seconds. Hence, the auxiliary processing method proved to be a very resilient and failproof method for data-critical and time-critical calculations.
For single standby auxiliary node testing, another workstation was selected to act as an auxiliary node with the calculation time of 29 seconds (TA). Hence, if any of the main
calculation nodes failed, then it will take another 29 seconds to complete the task. In this case, using Equation (5.23), TD=0, TM=TA, and TN= (T(i) x K +TA). Table 5.13
shows distributed calculation times with and without calculation node failure using the single auxiliary node method. As can be seen from the data, if no node failures, then the calculation completes in 46 seconds; meanwhile, if the most powerful node failed, then it will take 72 seconds to complete, and failure of the least powerful workstation requires 84 seconds to complete.
Table 5.13: Distributed calculation times with and without calculation node failure using the single auxiliary node method
Calculation
Node (i) T(i) (sec) K T(i) x K (sec) TA (sec) TN (sec)
NHKWST68 46 1.2 55 29 84 NHKWST63 42 1.2 50 29 79 NHKWST66 39 1.2 47 29 76 NHKWST65 38 1.2 46 29 75 NHKWST62 36 1.2 43 29 72 NHKWST61 36 1.2 43 29 72
129
5.11 Chapter Summary
This chapter has demonstrated the original contribution to the specific design methods and implementation of load balancing and task allocations techniques for complex applications. Due to the complex and bespoke nature of Northwest’s software applications used in the distributed processing cluster, the load balancing techniques employed have to be highly specific for each application scenario. No single technique is suitable for all possible scenarios. Hence, the general load balancing algorithms widely used in known distributed system configurations are not suitable for Northwest’s distributed processing system. Because of the critical nature of the calculations, the entire distributed processing has to be completed within the allocated time limits without any failures or has to operate with minimum failure rates. Therefore, the static load balancing is a crucial part of the overall load balancing implementations and must be able to predict the approximate total calculation time for given batch calculations; meanwhile, the dynamic load balancing acts as a safeguarding mechanism to improve the distributed processing efficiency.
The applications implemented within the distributed processing systems are mainly based on coarse-grained task distribution techniques, and coarse-grained processes are not suitable for efficient dynamic load balancing due to the longer processing time compared with the fine-grained processes that have shorter processing time. The fine- grained processes are highly suited for dynamic load balancing and perform well with different types of dynamic load balancing algorithms. Meanwhile, coarse-grained processes have very limited performance with dynamic load balancing; hence, very few dynamic load balancing techniques can be applied. Hence, in the case of Northwest, the use of a general type of dynamic load balancing has limited advantages compared with static load balancing. The company’s network configurations and workstations are highly reliable and continuously monitored for hardware and software performance, hence, the failure rate of each calculation node is kept minimal. Therefore, the calculation time for each batch process within the selected workstation cluster can be predicted with a small margin of error and, in most cases, the static load balancing algorithms perform well without activating dynamic load balancing
130
algorithms. However, the dynamic load balancing acts as a safeguarding mechanism to protect the system when an unpredictable disturbance within the system. In addition, it is implemented as a part of the centralised load balancing mechanism and managed by the distributed processing controller, and is only activated by the controller if unpredicted events happen within the system.
Even though many techniques and algorithms can be applied for static and dynamic load balancing for different types of distributed processing systems, these significantly depend on what type of distributed processing systems are used and their implementations. Some of them are highly specific and others are generic in nature, and they depend on what type of applications, the scale of the distributed processing systems, and hardware and software used within the systems. As far as Northwest’s distributed processing system is concerned, the load balancing techniques and algorithms have to be highly specific. It has to be simple to implement, and easy to manage with minimum disruption to the business. Because of the bespoke nature of the applications and software programs used within the distributed processing system, the load balancing algorithms have to be designed to work with the system concerned, and the fuzzy logic type rule-based algorithms are shown to perform effectively with these types of systems. The investigation on load balancing techniques has shown that a number of possible ways of achieving the fine-tuned load balanced condition within any type of distributed processing system to improve the processing efficiency of the calculation cluster. However, at the current stage, only a few of the bespoke techniques are suitable to implement within the company, and going forward, other techniques will be investigated as part of further research and development.
131
6
Dedicated Calculation Grid Design Using Peer-to-Peer
Network for High-Volume Distributed Processing
6.1
Introduction
This chapter describes the original design, development, and testing of a dedicated distributed process calculator that acts as a calculation grid using SFF computers and PCs to build P2P network-based clusters with logically separated networks using Windows workgroups. This is the third phase of the investigation that explained in section 1.4 in Chapter 1. The purpose of building the dedicated calculation grid is to facilitate the research and development team and quantitative researchers to test and simulate various mathematical models and trading scenarios in real time and, in addition, to execute long-running batch processes as separate processes. These calculations require considerable amounts of processing power and cannot be implemented in a single PC or server that take many hours to complete. Hence, a requirement for a dedicated calculation grid that can be used for performing various calculations and can be employed 24x7. Unlike the workstation-based calculation cluster described in Chapter 4 that involves the user workstations, the dedicated calculation grid is isolated from user access and fully dedicated for certain applications and mathematical models that require extensive testing before being used in live- trading environments.
Two separate calculation clusters are built: one with small form factor (SFF) computers and the other with PCs. The PC cluster uses dedicated PCs as calculation nodes and these are spare PCs in good working conditions but not in use. The small form factor computer-based cluster is built using low-cost, single-board computers that are small in size and low in power consumption. Intel’s NUC computers are used for this cluster. This is the main cluster that will be expanded by adding more calculation nodes in future developments. Meanwhile, the PC cluster that consists of spare PCs will only be used for testing purposes, and it will not be used in the production environment due to its high power consumption and requirement to be
132
located in a larger space. The distributed processing controller software and calculation node’s controller software are modified accordingly to include a dedicated calculation grid cluster, and in addition, a number of changes are made to the SQL server database to incorporate the dedicated cluster nodes within the distributed processing control system. The message passing between calculation nodes and distributed processing controller is similar to the technique that is used and described in Chapter 4 with minor modifications to work with the P2P network.
A number of different tests are carried out on both clusters with and without load balancing conditions, and calculation time reduction using dedicated calculation clusters shows considerable improvements for compute-intensive calculations when used with distributed process-based calculation using both dedicated calculation grids. The NUC cluster is most suited for long-term development of dedicated calculation grids for the company due to its compact size, high reliability, lower power consumption, and considerably low cost per calculation node. Meanwhile, the PC cluster is most suited for testing load balancing algorithms and prototype mathematical models for distributed processing due to its nonlinearity. However, the PC cluster is not suitable for continuous operations due to its high power consumption, space requirements, noise, and unreliability of the hardware. Further analysis on collected data is discussed in detail in Chapter 8 where the results and discussions are presented.
133