7.3.1
Speedup with a 1-layer Reduce
In this subsection our objective is to prove that our DAVQ algorithm brings a substantial speedup which corresponds to the gain of wall time execution brought by more computing resources compared to a sequential execution. Consequently the experiments presented in this subsection are similar to those provided in Chap- ter 6. In the previous chapter, we proved that a careful attention should be paid to the parallelization scheme, but without any specifications on the architecture. We should bear in mind that the experiments presented in Chapter 6 simulated a generic distributed architecture. In this subsection we present some experiments made with our true cloud implementation.
The following settings will attempt to create a synthetic but realistic situation for our algorithms. We load the local memory of the workers with data generated using the splines mixtures random generators defined in Section 6.3. Using the notation introduced in this section, each worker is endowed with n = 10000 vector data that are sampled spline functions with d = 1000. The benefit of using synthetic data for our implementations has been discussed in Section 6.3. The number of clusters is set to κ = 1000 while there are G = 1500 centers in the mixture and the number of knots for each spline χ is set to 100. The model made in Chapter 5 has shown that it is not possible to tune the relative amount of time taken by a gradient computation compared to the amount of time taken by prototypes communication. Indeed, both of them linearly depend on κ and d. The chosen settings ensure that the complexity of the clustering job would lead to approximately an hour execution on a sequential machine when using the same learning rate as in the previous chapter: εt= √1t for t > 0.
Similarly to Section 6.4, we investigate the speedup ability of our cloud DAVQ algorithm. Let us keep in mind that the number of samples in the total data set
equals to nM . Thus, this number is varying with M (the number of ProcessSer- vice instances). Consequently, the evaluations are performed with the normalized quantization criterion given by equation (6.5).
Figure 7.4 shows the normalized quantization curves for M = 1, 2, 4, 8, 16. We can notice that our cloud DAVQ algorithm has a good scalability property for M = 1, 2, 4, 8: the algorithm converges more rapidly because it has processed much more points as shown at the end of the section. Therefore, in terms of wall time, the DAVQ algorithm with M = 2, 4, 8 clearly outperforms the single exe- cution of the VQ algorithm (M = 1). However we can remark the bad behavior of our algorithm with M = 16. This behavior is explained in the subsequent subsection and a solution to overcome the problem is presented and evaluated.
4.4E+04 4.9E+04 5.4E+04 5.9E+04 6.4E+04 6.9E+04 0 500 1000 1500 2000 2500 3000 3500 E mp ir ic al d is to rt ion seconds one layer M=1 M=2 M=4 M=8 M=16
Figure 7.4: Normalized quantization curves with M = 1, 2, 4, 8, 16. Our cloud DAVQ algorithm has good scalability properties up to M = 8 instances of the ProcessService. Troubles appear with M = 16 because the ReduceService is overloaded.
7.3.2
Speedup with a 2-layer Reduce
The performance curve with M = 16 of Figure 7.4, shows an unsatisfactory situ- ation where the algorithm implementation does not provide better quantization results than the sequential VQ algorithm. Actually, the troubles arise from the fact that the ReduceService is overloaded3. Indeed, too many ProcessService
3. The Lokad.Cloud framework provides a web application that provides monitoring tools, including a rough profiling system and real time analysis of queues length, which helped us to
instances communicate their results (the displacements terms) in parallel and fill the queue associated to the ReduceService with new tasks resulting in a Reduce- Service overload. Once again the elasticity of the cloud enables us to allocate more resources and to create a better scheme by distributing the reducing tasks workload on multiple workers. As we did for Batch K-Means, we now set an intermediate layer in the reducing process by replacing the ReduceService by two new QueueServices: the PartialReduceService and the FinalReduceService. We deploy√M instances of PartialReduceService that will be responsible for computing the sum of the displacement terms, but only for a certain fraction of ProcessService instances: each PartialReduceService instance is gathering the displacement terms of√M ProcessService instances in the same way as for Batch K-Means. Then, the unique instance of FinalReduceService retrieves all the intermediate sums of displacement terms made by the PartialReduceService instances and computes the new shared version. Figure 7.6 is a synthetic represen- tation of the new layers, which are introduced to remove the difficulty brought by the overload of the initial ReduceService. In Figure 7.5 we plot the performance curves with PartialReduceService instances. We can remark that a good speedup is obtained up to M = 32 ProcessService workers. However, the algorithm does not scale-up to M = 64.
Contrary to the phenomenon observed when using a single-layer Reduce, the queues were not overloaded in the case M = 64. This unsatisfactory result high- lights again the sensitivity of VQ algorithms to the learning rate {εt}t>0and to
the first iterations: we have observed in some of these unsatisfactory experiments that the aggregated displacement term was too strong when M was exceeding a threshold value. Beyond this threshold, the aggregated displacement term leads indeed to an actual shared version of the prototypes that oscillate around some value, while moving away from it. These issues may be solved by adapting the learning rate to the actual value of M for the first iterations, when M exceeds a specific threshold.
7.3.3
Scalability
Let us recall that the scalability is the ability of the distributed algorithm to process growing volumes of data gracefully. The experiments reported in Figure 7.7 investigate this processing ability with more accuracy, by counting the samples that have been processed for the computation of the shared version. The graph shows the results with one or two layers for the reducing task and with different
3.8E+04 4.3E+04 4.8E+04 5.3E+04 5.8E+04 6.3E+04 6.8E+04 7.3E+04 7.8E+04 0 500 1000 1500 2000 2500 3000 3500 E mp ir ical d is to rt ion seconds two layers M=8 M=16 M=32 M=64
Figure 7.5: Normalized quantization curves with M = 8, 16, 32, 64 instances of ProcessService and with an extra layer for the so called “reducing task”. Our cloud DAVQ algorithm has good scalability properties for the quantization performance up to M = 32. However, the algorithm behaves badly when the number of computing instances is raised to M = 64.
displacement terms partial sums PartialReduceService instances FinalReduceService instance shared version
Figure 7.6: Overview of the reducing procedures with two layers: the PartialReduceService and the FinalReduceService.
values of M , namely M = 1, 2, 4, 8 for the first case and M = 8, 16, 32, 64 for the second one. The curves appear linear, which proves that the algorithms behave well. The slopes of the multiple curves, characteristic of the processing ability
of the algorithm, exhibit a good property: the slope is multiplied by 2 when the number of processing instances M is multiplied by 2. This proves that the algorithm has a good processing scalability as we had hoped for. This is confirmed by Figure 7.8, showing a linear curve for the number of points processed in a given time span (3600 seconds) using different computing instances M . We can remark that the scalability in terms of data processed is still good up to M = 64 while the algorithm does not work well in terms of accuracy (i.e. quantization level). Furthermore, the introduction of the second layer does not slow down the execution even for intermediate values of M (M = 8) as proved by the charts displayed in Figure 7.7.