Computer-based processing and identification of human voices is known as speech recognition. It can be used to authenticate users in certain systems, as well as provide instructions to smart devices like the Google Assistant, Siri or Cortana.
Essentially, it works by storing a human voice and training an automatic speech recognition system to recognize vocabulary and speech patterns in that voice. In this article, we’ll look at a couple of papers aimed at solving this problem with machine and deep learning.
The authors of this paper are from Baidu Research’s Silicon Valley AI Lab. Deep Speech 1 doesn’t require aphonemedictionary, but it uses a well-optimized RNN training system that employs multiple GPUs. The model achieves a 16% error on theSwitchboard 2000 Hub5dataset. GPUs are used because the model is trained using thousands of hours of data. The model has also been built to effectively handle noisy environments.
The major building block of Deep Speech is arecurrent neural networkthat has been trained to ingestspeech spectrogramsand generate English text transcriptions. The purpose of the RNN is to convert an input sequence into a sequence of character probabilities for the transcription.
The RNN has five layers of hidden units, with the first three layers not being recurrent. At each time step, the non-recurrent layers work on independent data. The fourth layer is a bi-directional recurrent layer with two sets of hidden units. One set has forward recurrence while the other has backward recurrence. After prediction,Connectionist Temporal Classification (CTC)loss is computed to measure the prediction error . Training is done usingNesterov’s Accelerated gradient method.