Automatic speech signal segmentation based on the innovation adaptive filter
Speech segmentation is an essential stage in designing automatic speech recognition systems and one can find several algorithms proposed in the literature. It is a difficult problem, as speech is immensely variable. The aim of the authors' studies was to design an algorithm that could be employed at the stage of automatic speech recognition. This would make it possible to avoid some problems related to speech signal parametrization. Posing the problem in such a way requires the algorithm to be capable of working in real time. The only such algorithm was proposed by Tyagi et al., (2006), and it is a modified version of Brandt's algorithm. The article presents a new algorithm for unsupervised automatic speech signal segmentation. It performs segmentation without access to information about the phonetic content of the utterances, relying exclusively on second-order statistics of a speech signal. The starting point for the proposed method is time-varying Schur coefficients of an innovation adaptive filter. The Schur algorithm is known to be fast, precise, stable and capable of rapidly tracking changes in second order signal statistics. A transfer from one phoneme to another in the speech signal always indicates a change in signal statistics caused by vocal track changes. In order to allow for the properties of human hearing, detection of inter-phoneme boundaries is performed based on statistics defined on the mel spectrum determined from the reflection coefficients. The paper presents the structure of the algorithm, defines its properties, lists parameter values, describes detection efficiency results, and compares them with those for another algorithm. The obtained segmentation results, are satisfactory.
- Almpanidis, G. and Kotropoulos, C. (2007). Phonetic segmentation using the generalized Gamma distribution and small sample Bayesian information criterion, Speech Communication 50(1): 38-55.
- Almpanidis, G., Kotti, M. and Kotropoulos, C. (2009). Robust detection of phone boundaries using model selection criteria with few observations, IEEE Transactions on Audio, Speech, and Signal Processing 17(2): 287-298.
- Barkat, M. (1991). Signal Detection and Estimation, Artech House, Boston, MA.
- Brandt, A.V. (1983). Detecting and estimating the parameters jumps using ladder algorithms and likelihood ratio test, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Boston, MA, USA, pp. 1017-1020.
- Brugnara, F., Falavinga, D. and Omolongo, M. (1993). Automatic segmentation and labeling of speech based on hidden Markov models, Speech Communication 12(4): 357-370.
- Delacourt, P. and Wellekens, C.J. (2000). DISTBIC: A speaker-based segmentation for audio data indexing, Speech Communication 32(1-2): 111-126.
- Gomez, J.A. and Calvo, M. (2011). Improvements on automatic speech segmentation at the phonetic level, in C. San Martin and S.-W. Kim (Eds.), CIARP 2011, Lecture Notes in Computer Science, Vol. 7042, Springer-Verlag, Berlin/Heidelberg, pp. 557-564.
- Haykin, S. (1996). Adaptive Filter Theory, Prentice-Hall, Englewood Cliffs, NJ.
- Jamouli, H., Al Hail, M.A. and Sauter, D. (2012). A mixed active and passive GLR test for a fault tolerant control system, International Journal of Applied Mathematics and Computer Science 22(1): 9-23, DOI: 10.2478/v10006-012-0001-1.
- Kay, S.M. (1988). Modern Spectral Estimation, Prentice-Hall, Englewood Cliffs, NJ.
- Kay, S.M. (1998). Fundamentals of Statistical Signal Processing, Vol. II: Detection Theory, Prentice-Hall, Englewood Clifft, NJ.
- Kroon, P. and Deprettere, E.F. (1988). A class of analysis-by-synthesis predictive coders for high quality speech coding at rates between 4.8 and 16 kbits/s, IEEE Journal on Selected Areas in Communications 6(2): 353-363.
- Lee, D.T.L., Morf, M. and Friedlander, B. (1981). Recursive least squares ladder estimation algorithms, IEEE Transactions on Circuits and Systems 28(6): 627-641.
- Lopatka, M., Adam, O., Laplanche, C., Zarzycki, J. and Motsch, J-F. (2005). Effective analysis of non-stationary short-time signals based on the adaptive Schur filter, IEEE/SP 13th Workshop on Statistical Signal Processing, Bordeaux, France, pp. 251-256.
- Lopatka, M., Adam, O., Laplanche, C., Motsch, J-F. and Zarzycki, J. (2006). Sperm whale click analysis using a recursive time-variant lattice filter, Applied Acoustics 67(11-12): 1118-1133.
- Makowski, R. and Zimroz, R. (2013). A procedure for weighted summation of the derivatives of reflection coefficients in adaptive Schur filter with application to fault detection in rolling element bearings, Mechanical Systems and Signal Processing 38(1): 65-77.
- Mporas, I., Ganchev, T. and Fakotakis, N. (2008). Phonetic segmentation using multiple speech features, International Journal of Speech Technology 11(1): 73-85.
- Park, S.S. and Kim, N.S. (2007). On using multiple models for automatic speech segmentation, IEEE Transactions on Audio, Speech, and Language Processing 15(8): 2202-2212.
- Prasad, V.K., Nagarajan, T. and Murthy, H.A. (2004). Automatic segmentation of continuous speech using minimum phase delay functions, Speech Communication 42(3-4): 429-446.
- Puig, V. (2010 ). Fault diagnosis and fault tolerant control using set-membership approaches: Application to real case studies, International Journal of Applied Mathematics and Computer Science 20(4): 619-635, DOI: 10.2478/v10006-010-0046-y.
- Rabiner, L. and Gold, B. (1975). Theory and Application of Digital Signal Processing, Prentice-Hall, Englewood Cliffs, NJ.
- Rabiner, L. and Juang, B-H. (1993). Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs, NJ.
- Rudoy, D., Quatieri, T.F. and Wolfe, P.J. (2011). Time-varying autoregressions in speech: Detection theory and applications, IEEE Transaction on Audio, Speech, and Language Processing 19(4): 977-989.
- Scharenborg, O., Wan, V. and Ernestus, M. (2010). Unsupervised speech segmentation: An analysis of the hypothesized phone boundaries, Journal of Acoustical Society of America 127(2): 1084-1095.
- Schwarz, P., Matejka, P. and Cernocky, J. (2006). Hierarchical structures of neural networks for phoneme recognition, IEEE International Conference on Acoustics, Speech, and Signal Processing, Toulouse, France, Vol. 1, pp. 325-328.
- Sharma, M. and Mammone, R. (1996). Blind speech segmentation: Automatic segmentation of speech without linguistic knowledge, Proceedings of the International Conference on Spoken Language Processing, Philadelphia, PA, USA, pp. 1237-1240.
- Toledano, D.T., Hernandez Gomez, L.A. and Villarrubia Grande, L. (2003) Automatic phonetic segmentation, IEEE Transactions on Speech and Audio Processing 11(6): 617-625.
- Tyagi, V., Bourlard, H. and Wellekens, C. (2006). On variable-scale piecewise stationary analysis of speech signals for ASR, Speech Communication 48(9): 1182-1191.