Speech Recognition and Speaker Recognition Coursework Example | Topics and Well Written Essays

Speech recognition and speaker recognition Speech recognition and speaker identification are very important for verification and authentication in security purpose; however, they are hard to achieve (Kamruzzaman, Et al.) The speaker substantiation is taken as a subclass of automatic speaker recognition (ASR) system. It can be applied to make out the identity of a person. As a result, the problem of speaker authentication is an accept-reject (true-false) (Chen &Luo, 2009). Background noise significantly effects on the general effectiveness of a voice recognition system, and this is considered as a tricky and a great challenge for the speaker recognition system. Progress in communication along with security technologies has become significant to have the strength of implanted biometric systems (Khan, Farhan& Ali, 2011).This chapter demonstrates the methods of speaker recognition that were employed in this field and offered reviews for the major weaknesses along with strengths for every classifier. 2.1 Introduction Speaker identification has been the theme of dynamic research for several years and has a lot of likely applications where the appropriateness of the concern of information source. Speaker identification is the technique that automatically recognizes a speaker by machine by means of the sound of the speaker. The most trendy programs of speaker identification systems is used in access control, for instance, featured information through the phone or admittance to a room. In addition, it has a very valuable usage for adaptation of speaker in the system of automatic speech recognition. Speaker recognition can be classified into two categories: verification and identification. Speaker identification is the process of making out which recorded speaker gives a certain expression. Speaker verification is the course of acceptance or rejection of the identity alleges of a speaker. Methods of Speaker recognition can as well be divided into text-dependent as well as text-independent methods (Kamruzzaman et al) 2.2 Description of Technique 2.2.1 MFCC MFCC is perhaps the most popular as well as best recognized as representing the speech signal for the function of speaker recognition. MFCCs are dependent on the known disparity of the human ear’s decisive bandwidths with frequency. 2.2.2 The MFCC processor Figure ‎2.1.1: Block diagram of the MFCC processor (Kamruzzaman et al). In the diagram above as made known in Figure 2.1 signify the configuration of MFCC processor.The speech input is recorded at a rate of sampling of 22050 Hz. This frequency of sampling is selected to lessen the impacts of obfuscation in the analog-to-digital relocate process (Kamruzzaman et al). 2.2.3 Mel-frequency wrapping The speech signal usually has tones at diverse frequencies. For every tone with a definite frequency f, measured in Hz, a skewed pitch is measured on the Mel scale (Kamruzzaman, Et al.) A mel unit depends on perceived hesitate on the human ear. The mel scale has roughly linear hesitate spacing under 1000Hz along with a logarithmic spacing above 1000Hz (Nijhawan&Soni, 2014). The following formula to calculate the mels for a specified frequency f in Hz (Kamruzzaman et al) Mel (f)= 2595*log10(1+f/700) ……….. (1) 2.2.4 Cepstrum The mel frequency cepstral coefficients (MFCCs) is used in transforming the log mel spectrum at the time and since the mel spectrum coefficients are actual numbers, they might be transformed to the time domain by use of the discrete cosine transform (DCT) (Kamruzzaman et al). The following equation is used to calculate MFCCs: 2.2.5 Where n=1, 2, …, K. The number of mel cepstrum coefficients, K, is classically selected as 20. The first component, c~ 0, is debarred from the DCT since it signifies the input signal mean value that carries little speaker explicit information (Kamruzzaman et al.).The subjective spectrum is to use a filter bank, consistently spaced on the mel scale as demonstrated in figure 2.1.2(Khan, Farhan& Ali, 2011). Figure ‎2.1.2 An example of mel-spaced filterbank(Khan, Farhan& Ali, 2011). LINEAR PREDICTIVE CODING (LPC) LPC is among the most prevailing techniques of speech analysis and is a valuable method for encoding excellence speech at a low bit rate. The fundamental idea behind the linear predictive analysis is that an explicit speech sample at the existing time can be estimated as a linear combination of previous speech samples (Shrawankar).LP model bases on human speech production. It uses a conventional source-filter model, whereby the glottal, vocal tract, along with lip radiation transfer functions are incorporated into one all-pole filter that replicates vocal tract acoustics. The principle behind the use of LPC is to lessen the sum of the squared differences amid the estimated speech signal and the original speech signal over a fixed duration (Shrawankar). POWER SPECTRAL ANALYSIS (FFT) One of the commonly used techniques of studying a speech signal is through the power spectrum. The speech signal power spectrum explains the frequency content of the signal in the end. The initial step towards computing the speech signal power spectrum is to conduct a Discrete Fourier Transform (DFT). A DFT computes the frequency data of the equivalent time domain signal. Because a speech signal has only actual point values, we can utilize a real point Fast Fourier Transform (FFT) for increased efficiency. The consequential output comprises of both the magnitude as well as phase information of the original time domain signal (Shrawankar). Support vector machine Support vector machine: SVM was formed by Vapinik (1998).It is among the most important advancements in pattern recognition in the last one decade. Other techniques such as Gaussian mixture models (GMM) and Hidden Markov Models (HMM) and which are utilized in feature matching are prone to overfitting. Moreover, they do not frankly optimize discrimination (Chen &Luo, 2009). 2.1.1.1 Three alternative methods for training SVMs Chunking: The chunking algorithm utilizes the fact that the quadratic form value is the same if you take out the rows as well as columns of the matrix that matches to zero Lagrange multipliers. Thus, the large quadratic optimization problem can be splitted into a series of smaller quadratic problems, whose final goal is to recognize all the non-zero Lagrange multipliers and abandon all the zero Lagrange Multipliers (Kamruzzaman et al.). Osuna’s algorithm: In 1997, Osuna, attested a theorem that suggests an entire new set of SVMs quadratic algorithms. The theorem attests that the large quadratic optimization problem can be splitted into a smaller quadratic sub-problems series (Kamruzzaman et al.) SMO: Sequential minimal optimization (SMO) is an easy algorithm that can promptly solve the SVM problem without any additional matrix storage and without the use of numerical quadratic optimization steps totally. SMO decomposes the general quadratic problem into quadratic sub-problems, by use of Osuna’s theorem to ascertain convergence (Kamruzzaman et al.) Three steps are demonstrated for each method in the Figure3. The horizontal thin line at each step stands for the training set, whereas the thick boxes stand for the Lagrange multipliers that are being optimized at that step. For chunking, a certain number of examples are added in each step, whilst the zero Lagrange multipliers are removed at each step. Therefore, the number of examples trained at each step has a tendency to grow. For Osuna’s algorithm, a preset number of examples are optimized in each step: the equal number of examples is added to and subtracted from the problem at each step. For SMO, merely two examples are optimized analytically at each step so that every step is extremely fast (Kamruzzaman et al.) Figure ‎2.1.3 Three alternative methods for training SVMs:Chunking, Osuna’s algorithm, and SMO (Kamruzzaman, et al.) LDA: Linear Discriminant Analysis is a statistical technique that decreases the dimension of the features even as maximizing the data conserved in the minimized feature space. Use of LDA after MFCC radically minimizes the dimension of features since LDA finds optimal conversion matrix that preserve a lot of the data and the same can be utilized in discriminating amid a range of classes (Khan, Farhan& Ali, 2011). Vector Quantization (VQ): VQ is the mapping process of vectors from a huge vector space to a limited number of regions in that space (Manjunath& PB, 2012). The purpose of VQ is to condense data and pick the more efficient features rather than using the entire feature vectors. By clustering the feature vector of the speaker into a known cluster numbers, the speaker models are produced. Each cluster is known as centroid and is symbolized by a code vector that constitutes a codebook. Every feature vector of the input is then evaluated with all the other codebooks. Eventually, the codebook that offers the minimum distance is chosen as the best (Nijhawan&Soni, 2014). 2.3 Comparative Analysis Method Strengths Weaknesses SMO Has a faster training time as compared to other methods. SMO is better to chunking and Osuna in computation time (Kamruzzaman, Et al.) ــــــــــــــ Chunking Can solve a quadratic optimization problem. It cannot solve large-scale training problems, since even this condensed matrix cannot fit in memory. Osuna’s algorithm Suggests keeping a regular size matrix for all quadratic sub-problems, which involves adding as well as deleting the same number of examples at all steps. It is inefficient. Mel frequency cepstral coefficients (MFCC) The commonly best known classifier used for both speaker recognition and speech recognition. Cost-effective and robust computation - LINEAR PREDICTIVE CODING (LPC) Reduce the sum of the squared differences amid the estimated speech signal and the original speech signal over a limited time (Shrawankar). POWER SPECTRAL ANALYSIS (FFT) FFT-based approach is best for its linearity in the frequency domain along with its speed of computation. FFT does not distort or discard data in any anticipatory way (Shrawankar). SVM It is highly accurate. Directly optimize discrimination. Easy to train and scale complicated high dimensional data in comparison with neural networks (Khan, Farhan & Ali, 2011). LDA Reduce the features’ dimension while maximizing the data preserved in the reduced feature space (Khan, Farhan& Ali, 2011). Finds optimal conversion matrix that preserves a lot of data and can also be used to distinguish between the a range of classes (Khan, Farhan& Ali, 2011). Neural Networks Models are noticeably trained. It is not generalizable. Hidden Markov Models HMM systems at all times produce the excellent performance. --- VQ Highly accurate and can be easily implemented (Manjunath& PB, 2012). Table‎2.1.1 : Comparisons between the techniques. 2.4 Conclusions This paper describes some methods which are used in the speaker recognition and speech recognition systems so as to enhance the security as well as provide good outcomes. The comparative analysis has been demonstrated for each technique. 2.5 References Chen, S., &Luo, Y. (2009).Speaker Verification Using MFCC and Support Vector Machine.Proceedings Of The International Multiconference Of Engineers And Computer Scientists. Retrieved from http://www.iaeng.org/publication/IMECS2009/IMECS2009_pp532-535.pdf. Kamruzzaman, S. M., Rezaul Karim, A. N. M., Saiful Islam, Md., & Emdadul Haque, Md. Speaker Identification using MFCC-Domain Support Vector Machine.Retrieved from http://arxiv.org/ftp/arxiv/papers/1009/1009.4972.pdf. Khan, A., Farhan, M., & Ali, A. (2011). Speech Recognition: Increasing Efficiency of Support Vector Machines. International Journal of Computer Applications, 35.Retrieved from http://arxiv.org/ftp/arxiv/papers/1204/1204.4257.pdf. Manjunath, N., & PB, M. (2012).Isolated Word Speech Recognition Using Vector Quantization (VQ). International Journal Of Advanced Research In Computer Science And Software Engineering, 2(5). Retrieved from http://www.ijarcsse.com/docs/papers/May2012/Volum2_issue5/V2I500451.pdf Nijhawan, G., &Soni, M. k. (2014). Speaker Recognition using Support Vector Machine.International Journal Of Computer Applications, 87(2).Retrieved from http://research.ijcaonline.org/volume87/number2/pxc3893379.pdf Shrawankar, U. TECHNIQUES FOR FEATURE EXTRACTION IN SPEECH RECOGNITION SYSTEM : A COMPARATIVE STUDY. Retrieved from http://arxiv.org/ftp/arxiv/papers/1305/1305.1145.pdf Read More

Speech Recognition and Speaker Recognition - Coursework Example

Extract of sample "Speech Recognition and Speaker Recognition"

CHECK THESE SAMPLES OF Speech Recognition and Speaker Recognition

Possibility of Developing a Voice Recognition System in an Aircraft

Mentally challenged people entering our prisons and correctional facilities

The Use of Voice Recognition for Forensic Investigations

Article response paper

Building an International Image of Procter and Gamble International

Automatic Speech Recognition

Automatic Speaker Verification System

Outside speech