Combination of Multiple Acoustic Models with Multi-scale Features for Myanmar Speech Recognition
Keywords:acoustic modelling, deep convolutional neural networks, multi-scale features, Myanmar speech recognition, ROVER combination.
We proposed an approach to build a robust automatic speech recognizer using deep convolutional neural networks (CNNs). Deep CNNs have achieved a great success in acoustic modelling for automatic speech recognition due to its ability of reducing spectral variations and modelling spectral correlations in the input features. In most of the acoustic modelling using CNN, a fixed windowed feature patch corresponding to a target label (e.g., senone or phone) was used as input to the CNN. Considering different target labels may correspond to different time scales, multiple acoustic models were trained with different acoustic feature scales. Due to auxiliary information learned from different temporal scales could help in classification, multi-CNN acoustic models were combined based on a Recognizer Output Voting Error Reduction (ROVER) algorithm for final speech recognition experiments. The experiments were conducted on a Myanmar large vocabulary continuous speech recognition (LVCSR) task. Our results showed that integration of temporal multi-scale features in model training achieved a 4.32% relative word error rate (WER) reduction over the best individual system on one temporal scale feature.
B. Hoffmeister, C. Plahl, P. Fritz, G. Heigold, J. Loof, R. Schluter, et al. “Development of the 2007 RWTH mandarin LVCSR system,” In Proc. of IEEE 2007 Workshop on Automatic Speech Recognition and Understanding (ASRU), 2007.
C. Breslin and M.J. Gales. “Generating complementary systems for speech recognition,” In Proc. INTERSPEECH, 2006.
D. Eigen, C. Puhrsch and R. Fergus. “Depth map prediction from a single image using a multi-scale deep network,” In Proc. NIPS, 2014, pp. 2366–2374.
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek and N. Goel. “The kaldi speech recognition toolkit,” In Proc. of IEEE 2011 Workshop on ASRU, 2011.
D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon and K. Visweswariah. “Boosted MMI for model and feature-space discriminative training,” In Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2008.
G.E. Dahl, D. Yu, L. Deng and A. Acero. “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition.” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30-42, 2012.
G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen, M. Ranzato, M. Devin, et al. “Multilingual acoustic models using distributed deep neural networks,” In Proc. of ICASSP, 2013, pp. 8619–8623.
G. Hinton, L. Deng, D. Yu, G.E. Dahl, A. Mohamed, N. Jaitly, et al. “Deep neural networks for acoustic modeling in seech recognition: The shared views of four research groups.” Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82-97, 2012.
H.M.S. Naing, A.M. Hlaing, W.P. Pa, X. Hu, Y.K. Thu, C. Hori, et al. “A Myanmar large vocabulary continuous speech recognition system,” In Proc. of APSIPA Annual Summit and Conference, 2015, pp. 320-327.
J.G. Fiscus. “A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER),” In IEEE Workshop on ASRU, 1997, pp.347-354.
J. Long, E. Shelhamer and T. Darrell. “Fully convolutional networks for semantic segmentation,” In Proc. CVPR, 2015.
L. Mangu, E. Brill and A. Stolcke. “Finding consensus among words: lattice-based word error minimization,” In Proc. Eurospeech, 1999, pp. 495-498.
L.R. Rabiner. “A tutorial on hidden Markov models and selected applications in speech recognition.” In Proc. IEEE, vol.77, no. 2, Feb. 1989.
L.R. Rabiner and B.H. Juang. Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliff, New Jersey, 1993.
M. Gibson and T. Hain. “Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition,” In Proc. INTERSPEECH, 2006, pp. 2406-2409.
M. Lin, Q. Chen and S. Yan. “Network in network,” arXiv preprint arXiv: 1312.4400, 2013.
N. Kanda, R. Takeda and Y. Obuchi. “Elastic spectral distortion for low resource speech recognition with deep neural networks,” In Proc. of ASRU, 2013, pp. 309–314.
O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn and D. Yu. “Convolutional neural networks for speech recognition.” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 22, no. 10, pp. 1533-1545, 2014.
P.R. Dixon, C. Hori and H. Kashioka. “Development of the SprinTra WFST speech decoder,” NICT Res. J. 59 (3/4), pp.15-20, 2012.
P. Sermanet and Y. LeCun. “Traffic sign recognition with multi-scale convolutional networks,” In Proc. of IEEE 2011 International Joint Conference in Neural Networks (IJCNN), 2011, pp. 2809–2813.
P. Shen, X. Lu, N. Kanda, M. Saiko and C. Hori. “The NICT ASR system for IWSLT 2014,” In Proc. IWSLT, 2014, pp. 113–118.
S.F. Chen and J. Goodman. An Empirical Study of Smoothing Techniques for Language Modeling, TR-10-98, Computer Science Group, Harvard University, 2008.
T.N. Sainath, A. Mohamed, B. Kingsbury and B. Ramabhadran. “Deep convolutional neural networks for LVCSR,” In Proc. ICASSP, 2013, pp. 8614-8618.
T. Sercu, C. Puhrsch, B. Kingsbury and Y. LeCun, “Very deep multilingual convolutional neural networks for LVCSR,” In Proc. ICASSP, 2016, pp. 4955-4959.
Ujjwalkarn. (2016). “An explanation of CNN.” [On-line], Available: https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/ [Aug. 22, 2017].
X. Cui, J. Xue, B. Xing and B. Zhou. “A study of bootstrapping with multiple acoustic features for improved automatic speech recognition,” In Proc. INTERSPEECH, 2009, pp. 240-243.
How to Cite
Authors who submit papers with this journal agree to the following terms.