Custom cover image
Custom cover image

Automatic Speech Segmentation Through Forward and Inverse Characteristics of The Vocal Tract (PhD Thesis)

By: Material type: TextTextLanguage: English Publication details: Karachi, NED University of Engineering and Technology Department of Electrical Engineering 2021Description: XI, 129 p. : illSubject(s): DDC classification:
  • 006.454378242 JAV
Online resources: Summary: Abstract : Speech segmentation refers to the splitting of the continuous speech signal into syllable, word and phoneme segments. Time-aligned segmented and labeled speech at phoneme level is used to develop a large corpus. Precise time-aligned corpus at phoneme level finds its significance in linguistic research and building automatic speech recognition (ASR), speaker verification and speech synthesis systems. Manual segmentation is considered to be more accurate than automatic segmentation, because humans are better in locating the boundaries of distinct events based on inherent temporal and spectral cues in the speech signal. But for the development of large corpus, !t is very time-consuming, painstaking, laborious, costly and human-resource intensive. For this reason, currently the use of automatic means of phonetic time-alignment is inevitable. The accuracy of automatic phonetic alignment method is based on the hypothesis that it can reach a level close to human performance by utilizing the evidence in the way human experts do. This thesis presents an unsupervised or implicit method of automatic speech segmentation to identify the phoneme boundaries in utterances. A framework is designed by employing short-time spcctral and temporal speech characteristics and based on Cosine distance scores (CDS). This framework informs the user about the performance of various speech processing techniques, thus simplifying the selection process of the appropriate technique from among the many available options. Furthermore, a systematic step-by-step study of the phonetic spectral characteristic of the TIMIT dataset [1] is made. Based on the spectral characteristics of each phoneme, they are grouped into 0-2 kHz, 0-4 kHz and 4-8 kHz band of frequencies. In order to model the vocal tract dynamic :n these three distinct frequency ranges, 'Selective Linear Predictive' (SLP) has been employed. It helps us in extracting spectral information persist in these three frequency ranges individually. In this thesis, we use SLP-based forward and inverse characteristics of the vocal tract in developing the automatic segmentation technique at phoneme level. Based on the developed framework, a nove! feature is formed. The proposed feature combines the best scores of each frequency range and is named as "Extended Forward and Inverse Characteristic of Vocal Tract using selective Linear predictive analysis (EFICV)". The performance of the EFICV is evaluated with the manually marked phoneme boundaries of the TIMIT dataset. The accuracy of the proposed EFICV system is found to be 61.13 %, 81.4 %, 85.82% and 88.85 % in 10 msec, 20 msec, 30 msec and 105 msec respectively, in agreement with TIMIT boundaries. The error rate is found to be 17.96 %. The percent improvement in boundaries with respect to the state-of-the-art in 5 msec, 10 msee, 15 msec, 20 msec, 25 msec, and 30 msec accuracy range is 4.27 %, 14.04 %, 12.59 %, 9.6 % and 6.60 % respectively. Whereas the error rate in 30 msec is reduced to 30.7%. The results show that the developed EFICV technique outperforms the current state-of-the-art schemes in terms of both accuracy and error rate.
Holdings
Item type Current library Shelving location Call number Status Date due Barcode
Reference Collection Reference Collection Government Document Section Govt Publication Section 006.454378242 JAV Available 97710
Reference Collection Reference Collection Government Document Section Govt Publication Section 006.454378242 JAV Available 97711

Abstract :

Speech segmentation refers to the splitting of the continuous speech signal into syllable, word and phoneme segments. Time-aligned segmented and labeled speech at phoneme level is used to develop a large corpus. Precise time-aligned corpus at phoneme level finds its significance in linguistic research and building automatic speech recognition (ASR), speaker verification and speech synthesis systems. Manual segmentation is considered to be more accurate than automatic segmentation, because humans are better in locating the boundaries of distinct events based on inherent temporal and spectral cues in the speech signal. But for the development of large corpus, !t is very time-consuming, painstaking, laborious, costly and human-resource intensive. For this reason, currently the use of automatic means of phonetic time-alignment is inevitable. The accuracy of automatic phonetic alignment method is based on the hypothesis that it can reach a level close to human performance by utilizing the evidence in the way human experts do.
This thesis presents an unsupervised or implicit method of automatic speech segmentation to identify the phoneme boundaries in utterances. A framework is designed by employing short-time spcctral and temporal speech characteristics and based on Cosine distance scores (CDS). This framework informs the user about the performance of various speech processing techniques, thus simplifying the selection process of the appropriate technique from among the many available options. Furthermore, a systematic step-by-step study of the phonetic spectral characteristic of the TIMIT dataset [1] is made. Based on the spectral characteristics of each phoneme, they are grouped into 0-2 kHz, 0-4 kHz and 4-8 kHz band of frequencies. In order to model the vocal tract dynamic :n these three distinct frequency ranges, 'Selective Linear Predictive' (SLP) has been employed. It helps us in extracting spectral information persist in these three frequency ranges individually.

In this thesis, we use SLP-based forward and inverse characteristics of the vocal tract in developing the automatic segmentation technique at phoneme level. Based on the developed framework, a nove! feature is formed. The proposed feature combines the best scores of each frequency range and is named as "Extended Forward and Inverse Characteristic of Vocal Tract using selective Linear predictive analysis (EFICV)".

The performance of the EFICV is evaluated with the manually marked phoneme boundaries of the TIMIT dataset. The accuracy of the proposed EFICV system is found to be 61.13 %, 81.4 %, 85.82% and 88.85 % in 10 msec, 20 msec, 30 msec and 105 msec respectively, in agreement with TIMIT boundaries. The error rate is found to be 17.96 %. The percent improvement in boundaries with respect to the state-of-the-art in 5 msec, 10 msee, 15 msec, 20 msec, 25 msec, and 30 msec accuracy range is 4.27 %, 14.04 %, 12.59 %, 9.6 % and 6.60 % respectively. Whereas the error rate in 30 msec is reduced to 30.7%. The results show that the developed EFICV technique outperforms the current state-of-the-art schemes in terms of both accuracy and error rate.