2.- Amparo contra Leyes - UNIVERSIDAD PANAMERICANA

The model training can be parallelized using another technique referred as asynchronous SGD (ASGD) [12, 17,26]. It was first proposed to run across CPU servers as shown in Fig.7.3. In this architecture, the DNN model is stored across several (3 in the figure) computers referred as parameter server pool. The parameter server pool is the master. The master sends the model parameters to the slaves, each

Fig. 7.3 Illustration of asynchronous SGD. Shown in the figure is a master server pool and three slave clusters (Picture courtesy of Erdinc Basci)

of which is consisted of several (4 in the figure) computers. Each slave works on a subset of the training data. It calculates and sends the gradients of each minibatch to the master. The master updates the parameters and sends the new values back to the slave.

Since each computer in the slave contains part of the model, the activations need to be copied across computers. To reduce the communication cost, the models need to have sparse connection between the components stored on different computers. Hav-ing several computers on the master helps to reduce the communication cost between the master and slaves since each computer pair only needs to transfer a subset of the parameters. The key to the success of the ASGD, however, is to use asynchronous lock-free update. In other words, the parameters on the server is not locked during updating. When the master gets gradients from several slaves, it updates the model independently in different threads. When the master sends the new parameter to the slave, part of the parameters may be updated using gradients sent from several slaves.

At the first glance, this may cause convergence problem. In practice, however, the parameters converge fine and the model training time is significantly reduced since each slave does not need to wait for other slaves to finish [12,17]. The model grad-ually evolves as it is exposed to random records from the training dataset. A proof on the convergence of ASGD is given in [17].

7.1 Training Speedup 123

There are several practical issues to consider in the ASGD. First, some slaves may take longer than others to finish an epoch. As a result, the gradient calculated on these slaves might be based on a very old model. The simplest approach to handle this problem is to send a time stamp in all communications. The master just abandons the outdated gradient and sends the slave the most updated model if the time stamp difference between that sent from the slave and that on the master exceeds a threshold. If a slave is consistently slower, the data assigned to that slave needs to be redistributed to other slaves. This can be easily done by fetching minibatches from the same pool. Second, it is obvious that the delayed-update problem that happens in the pipelined BP also happens in ASGD. For this reason, we need to either reduce the number of slaves or reduce the learning rate to compensate for the problem. Either of these solutions, however, will slow down the training. Third, the delayed-update problem manifests itself most when the gradients are large, which typically happens at the early stage of the model training. This problem can be alleviated by a technique called warm start, which starts the ASGD training from a model trained with one pass of SGD.

Although ASGD works on CPU clusters [12], the communication cost is very high and can become the bottleneck. For example, running ASGD on 1,000 distributed CPU cores perform approximately as fast as 8 GPUs on a single machine. The main reason to use ASGD on CPUs is to take advantage of the existing CPU clusters and to train models that cannot fit to GPU memory.

Alternatively, the ASGD algorithm can be applied to GPUs on a single hosting machine [26]. Since in speech recognition the DNN model can be fit into both CPU and GPU memory, we can use the hosting machine (CPU) as the master and each GPU as a slave. Note that with GPU-based ASGD the overall speed is significantly improved. This is because each minibatch takes much less time on GPUs and the communication between the GPUs and the hosting machine (through PCIe bus) is significantly faster than that between CPU machines. Even on GPUs the communi-cation may still become the bottleneck if the minibatch is too small. This problem can be addressed by reducing the frequency of data transmission between the master and the slaves. Instead of updating the model after every minibatch, the GPU slave can accumulate updated gradients and send them to the master every three to four minibatches. Since this essentially increased the minibatch size, the learning rate may need to be reduced to compensate for it.

Table7.2, which is extracted from [26], compares the character error rate (CER) and training time on 10 h of data between SGD and ASGD on a Chinese task. The 42-dimensional feature is formed from 13-dimensional PLP and a 1-dim pitch with the first- and second-order derivatives appended. The DNN is trained on a 130-h dataset and is tuned by another one hour data as development set. Concatenations of 11 frames are used as input to the DNN, which has 5 hidden layers each with 2,048 neurons. The output layer has 10,217 senones. The systems are evaluated on two individual test sets, namely clean7k and noise360, which were collected through mobile microphone under clean and noise environments, respectively. The NVidia GeForce GTX 690 was used for training. This table indicates that ASGD achieves a 3.2 times speedup on four GPUs compared to the SGD running on a single GPU.

Table 7.2 Compare character error rate (CER) and training times in minutes per 10 h of data on a Chinese speech recognition task (Summarized from [26])

CER Time (min)

Clean7K (%) Noise360 (%)

GMM BMMI 11.30 36.56 –

DNN SGD 9.27 26.99 195.1

DNN ASGD (4 GPU) 9.05 25.98 61.1

7.1.3 Augmented Lagrangian Methods and Alternating Directions

In document UNIVERSIDAD PANAMERICANA (página 139-145)