The data used to train the final hand-motion network varied in two respects to that used to create the prototype networks discussed in the previous section. First the data was gathered from the motion of the hand during the formation of genuine signs, as it was felt this would increase the system's accuracy when required to classify the motion component of actual signs within the final system. The second difference was that a closer examination of the Auslan dictionary revealed that the direction of rotation of the hand during a circling motion was never a distinguishing factor between different signs. Therefore the number of different motions to be classified was reduced to 13. It should be noted that this network was still designed only to recognise those hand motions which Stokoe described as directional or circular. Hand-internal and wrist motions were not considered at this stage. The requirement that the hand motions be measured during the act of signing increased the time required to gather this data. Therefore in the interests of efficiency it was decided to also gather at the same time the data necessary for testing the full sign classification system, and for developing techniques for automatically segmenting continuous sequences of signs. This involved each user producing short sequences of four signs during which data was recorded from both the CyberGlove and the Polhemus. The start and end point of each sign in the sequence was marked by the signer using a switch held in the non-signing hand. In this manner the continuous signing data required for developing and testing the segmentation techniques discussed in Chapter 11 was gathered. In order to produce the segmented motions required for training the hand-motion classification network this continuous data was post-processed to separate the individual signs, as indicated by the manually generated segmentation points.
One other issue which had to be considered in gathering this data was the timing of the data-gathering in the final system. When the SLARTI system is applied to classifying actual signs in real-time, the output of the various feature detection networks has to be calculated for each time frame of input data. This will result in a slight delay before the CyberGlove and Polhemus can be polled again. Although the recurrent architecture used to perform motion recognition has some immunity to time-warped sequences, it was felt that the performance of the final system would be improved if these delays
were also included in the data gathered for training the network. Therefore a forward pass through networks of the appropriate size was performed between each sampling of the sensors during this data gathering process. The final structure of the handshape, location and orientation networks had been determined at this stage. However an estimate had to be made of the size of the hand motion network. The earlier experiments with the prototype network were used as indication of the likely number of nodes required for the final network, and so a network with 3 inputs and 30 recurrent nodes was used.
Examples were gathered from the same groups of registered and unregistered signers used in the development of the handshape, orientation and location networks. For each motion 4 examples were gathered from each signer, yielding a registered signers' training and test set each containing 364 examples and an unregistered signers' test set of 156 examples.
10.2.2 Classification with a recurrent network
The same recurrent network architecture as used in the prototype motion recognition experiments was applied to the new hand motion data. As before the data was pre-processed so that the 3 inputs presented to the network were the difference between the current position and the previous position. Ten networks were trained from different starting weights, with results as summarised in Table 10.2. A step size of 0.05 was used and the networks were trained for a maximum of 50,000 pattern presentations.
Table 10.2 Summary of the classification rate of ten recurrent neural networks trained to distinguish between 13 different hand motions
Training set Reg. test set Unreg. test set
Mean 89.7 78.6 63.4
Minimum 88.2 76.1 57.7
Maximum 91.8 81.9 68.6
A comparison of Tables 10.1 and 10.2 reveals that the recurrent networks failed to perform as well on the new hand motion data as they fared on the prototype data. This can be explained by the nature of the data gathering processes. The data gathered for the prototype system was from users concentrating only on producing hand motions, whereas in the final hand data the users were performing actual signs. Therefore it would be expected that the motions were not performed as accurately in the second set of data
gathering. In addition the prototype data was gathered from only three users, whilst the second set of data was derived from seven signers.
10.2.3 Classification with a non-recurrent network
The performance of the recurrent networks described in the previous section was below the level expected to be required in order for the overall sign classification to perform at a suitable level of accuracy. This failure was somewhat surprising as a reasonably high level of classification accuracy could be obtained from visual inspection of the input sequences. Therefore it seemed that this failure was due mainly to deficiencies in the power of the BPTT learning algorithm to find suitable weights for the recurrent network. It was decided to compare the performance of these recurrent networks against a non-recurrent architecture. However, given the length of the input sequences (on the order of 15 to 25 time frames) the tapped-delay line architectures described in Section 6.2.1 would contain an extremely large number of weights and hence require a very large amount of training data in order to ensure high levels of generalisation. Rather than engage in the lengthy process of gathering more data, the decision made was to process the existing data to extract suitable features from the input sequences and then train a standard spatial network on these features.
After some experimentation an input vector of 8 features was found to contain enough information to allow good rates of classification. The features were designed specifically to reflect the characteristics of the motions which were useful in visual classification of the data. Hence they measured characteristics such as the total amount of motion relative to each of the three axes, as this helped to separate circling and back-and-forth motions from each other. The final input vector is:
The input vectorIt= Δxt
Σ
t=2 P ,Σ
Δyt t=2 P ,Σ
Δzt t= 2 P ,Σ
Δxt t= 2 P ,Σ
Δyt t=2 P ,Σ
Δzt t= 2 P ,Σ
Vt–Vt– 1 t=2 P ,P 25 wherePis the length of the original data sequence
xt,yt,ztare the calibrated Polhemus values at timet Δxt=xt–xt– 1 Δyt=yt–yt– 1 Δzt=zt–zt– 1 Vt= Δxt 2 +Δyt2+Δzt 2
Ten networks with a 8:8:13 architecture were trained using this pre- processing method for 750,000 pattern presentations at a learning rate of 0.05. The results of these networks are reported in Table 10.3.
Table 10.3 Summary of the classification rate of ten non-recurrent neural networks trained to distinguish between 13 different hand motions
Training set Reg. test set Unreg. test set
Mean 93.5 91.6 75.7
Minimum 92.9 90.4 74.4
Maximum 94.2 92.3 77.6
A comparison of Tables 10.2 and 10.3 shows that the combination of pre- processing allied with a non-recurrent network produces much better accuracy than the recurrent network, particularly with regards to generalisation to the test set. In addition the time required to train the recurrent networks is approximately five times as long as is needed for the non-recurrent networks.29 For these reasons a non-recurrent network was
selected for incorporation into the final sign classification system discussed in the next chapter.
These results highlight a major limitation of current neural network methodologies, which is the difficulty in training recurrent networks to perform complex tasks. This is an area which should be a focus for future research because although recurrent architectures have several inherent benefits as described in Chapter 6, their use is currently restricted by the difficulties encountered in training. These results also indicate the benefits which can be realised by combining neural network techniques with problem-specific pre-processing of the input data.
29 In terms of pattern presentations the non-recurrent training is significantly longer at
750,000 pps to only 50,000 pps. However using connection crossings as the measure of training time provides a more valid comparison, as it takes into account the difference in the size of the networks, and the amount of calculations involved in the training process. Training a non-recurrent net takes around 140,000,000 ccs as opposed to approximately 750,000,000 ccs for a recurrent network.
11 Classification of signs
Chapters 9 and 10 of this thesis describe the creation of neural networks which classify input data in terms of the fundamental features of signs – handshape, orientation, location and motion. As depicted in Figure 8.1, the final stage of the SLARTI system consists of classifying the input signing sequence on the basis of these features. This chapter details the development of this final sign classifier, and reports the performance of the SLARTI system as a whole.
Section 11.1 deals with the creation of techniques for classifying signs on the basis of the feature-vectors produced by the feature-extraction networks described in Chapters 9 and 10. Sections 11.1.1 to 11.1.3 provide the background and experimental details of this aspect of the research. Sections 11.1.4 to 11.1.6 discuss the nature of the sign classification algorithm, and trial several different techniques of performing this task. Sections 11.1.6 and 11.1.7 then describe possible extensions to this classification algorithm to reduce the number of misclassifications, and move the system towards the domain of continuous signing.
Section 11.2 serves as a summary of the final state of the SLARTI system. It details the structure and performance of the final system, compares it to the previously developed systems examined in Chapter 4 and discusses possible applications of the system.