In this experiment, the switching NSSM is used in a partially supervised mode, i.e. it is trained as explained in Section 6.3.3. Using only this modification does not, however, lead to good results. The computational complexity of the NSSM learning allows using only a couple of thousands of samples of data. With the preprocessing used in this work, this means that the NSSM must be trained with only a few dozen words. This training material is not at all enough to learn models for all the phonemes, as many of them appear only a few times in the training data set.
To circumvent this problem, the training of the model was split into several phases. In the first phase, the complete model was trained with a data set of 23 words forming some 5069 sample vectors. This should be enough for the NSSM part to find a reasonable representation for the data. The number of sweeps for this training was 10000. After this step no NSSM parameters were adapted any more.
The number of words in the training set of the first phase is small when compared to the data set of the previous experiment. This is due to inclusion of significant segments of silence to the samples. It is important for the model to also learn a good model for silence, since in real life speech is always embedded into silence or plain background noise.
In the second phase, the training data set was changed to a new one for continuing training the HMM. The new data set consisted of 786 words forming 100064 sample vectors. The continuous hidden states for the new data were initialised with an auxiliary MLP as presented in Section 6.2.4. Then the standard learning algorithm was used for 10 sweeps to adapt only the continuous states.
Updating just the HMM and leaving the NSSM out saved a significant amount of time. As Table 7.1 shows, HMM training is about two orders of magnitude faster than NSSM training. Using just the HMM part takes about 20 % of the time needed by the complete switching NSSM. The rest of the improvement is due to the fact that the HMM training converges with far fewer iterations. These savings allowed using a much larger data set to efficiently train the HMM part.