in , ,

A 2019 Guide for Automatic Speech Recognition, Hacker News

A 2019 Guide for Automatic Speech Recognition, Hacker News

Derrick Mwiti

Computer-based processing and identification of human voices is known as speech recognition. It can be used to authenticate users in certain systems, as well as provide instructions to smart devices like the Google Assistant, Siri or Cortana.

Essentially, it works by storing a human voice and training an automatic speech recognition system to recognize vocabulary and speech patterns in that voice. In this article, we’ll look at a couple of papers aimed at solving this problem with machine and deep learning.

The authors of this paper are from Baidu Research’s Silicon Valley AI Lab. Deep Speech 1 doesn’t require aphonemedictionary, but it uses a well-optimized RNN training system that employs multiple GPUs. The model achieves a 16% error on theSwitchboard 2000 Hub5dataset. GPUs are used because the model is trained using thousands of hours of data. The model has also been built to effectively handle noisy environments.

The major building block of Deep Speech is arecurrent neural networkthat has been trained to ingestspeech spectrogramsand generate English text transcriptions. The purpose of the RNN is to convert an input sequence into a sequence of character probabilities for the transcription.

The RNN has five layers of hidden units, with the first three layers not being recurrent. At each time step, the non-recurrent layers work on independent data. The fourth layer is a bi-directional recurrent layer with two sets of hidden units. One set has forward recurrence while the other has backward recurrence. After prediction,Connectionist Temporal Classification (CTC)loss is computed to measure the prediction error . Training is done usingNesterov’s Accelerated gradient method.


In order to reduce variance during training, the authors add a dropout of between 5% and 10% in the feedforward layers. However, this isn’t applied to the recurrent hidden activations. They also integrate anN-gram language modelin their system because N-gram models are easily trained from huge unlabeled text corpora. The figure below shows an example of transcriptions from the RNN.


Here’s how this model performs in comparison to other models :


In the second iteration of Deep Speech, the authors use an end-to-end deep learning method to recognize Mandarin Chinese and English speech. The proposed model is able to handle different languages ​​and accents, as well as noisy environments. The authors use high-performance computing (HPC) techniques to achieve a 7x speed increment from their previous model. In their data center, they implement Batch Dispatch with GPUs.

The English speech system is trained on 11, 940 hours of speech, while the Mandarin system is trained on 9, 400 hours. During training, the authors usedata synthesisto augment the data.

The architecture used in this model has up to 11 layers made up of bidirectional recurrent layers and convolutional layers. The computation power of this model is 8x faster than that of Deep Speech 1. The authors use Batch Normalization for optimization.

For the activation function, they use theclipped rectified linear (ReLU) function. At its core, this architecture is similar to Deep Speech 1. The architecture is a recurrent neural network trained to ingest speech spectrograms and output text transcriptions. The model is trained using theCTCloss function.


Below is a comparison of the Word Error Rate comparison for various arrangements of convolution layers.


The comparison of the Word Error Rate of Deep Speech 1 and Deep Speech two is shown below. Deep Speech 2 has a much lower Word Error Rate.


The authors benchmark the system on two test datasets from the Wall Street Journal corpus of news articles. The model outperforms humans on the Word Error Rate on three out of four occasions. TheLibriSpeech corpusis also used.

The authors of this paper are fromStanford University. In this paper, they present a technique that performs first-pass large vocabulary speech recognition using a language model and a neural network.

The neural network is trained using the connectionist temporal classification (CTC) loss function. CTC enabled the authors to train a neural network that predicts the character sequence of test utterances on theWall Street Journal LVCSR corpuswith a character error rate (CER) below 10%.

They integrate an n-gram language model with the CTC trained neural networks. The architecture of this model isReaction-diffusion neural networks (RDNN). A modified version of the rectifier nonlinearity that clips large activations to prevent divergence during network training is used. Here are the character error rates results obtained by the RDNN.


Authors from IBM Research present this paper aimed at verifying whether speech recognition techniques have achieved human performance. They also present a set of acoustic and language modeling techniques.

The acoustic side has three models : oneLSTMwith multiple feature inputs, a second LSTM trained with speaker-adversarial multitask learning , and a third residual net with 25 convolutional layers.

The language model uses character LSTMs and convolutional WaveNet-Style language models. The authors’ English conversational telephone LVCSR system has a Word Error Rate of 5.5% / 10 .3% on theSwitchboard/ (CallHome) subsets (SWB / CH).

The architecture used in this paper consists of 4–6 bidirectional layers with 1024 cells per layer, one linear bottleneck layer with 256 units, and an output layers with 32 K units. Training consists of 14 passes of cross-entropy followed by 1 pass of Stochastic Gradient Descent (SGD) sequence training using the boosted MMI (Maximum Mutual Information) criterion.

This process is smoothed by adding the scaled gradient of cross-entropy loss. The LSTM was implemented in Torch withCuDNNversion 5.0 backend. Cross-entropy training for each model was done on a single Nvidia K (GPU device and took about two weeks for 700 M samples per epoch.


For the convolutional network acoustic modeling, the authors trained residual networks. The next table shows several ResNet architectures and their performance on the test data.


The figure below shows how the residual networks were adapted for acoustic modeling. The network has 12 residual blocks, 30 weight layers, and 67 .1M parameters. Training was done using theNesterov accelerated gradientwith learning rate 0. 03 and momentum 0. 99. The CNN was also implemented on Torch using the cuDNN v5.0 backend. The cross-entropy training took 80 days for 1.5 billion samples using a Nvidia K 80 GPU with a 64 batch size per GPU.


Let’s now look at the word error rates for the LSTMs and ResNets:


The authors also experimented with four LSTM language models, namely WordLSTM, Char-LSTM, Word-LSTM-MTL, and Char-LSTM-MTL. The figure below shows their architecture.

The Word-LSTM has one word-embedding layer, two LSTM layers, one fully-connected layer, and one softmax layer. The Char-LSTM has an LSTM layer to estimate word embeddings from character sequences. Both Word-LSTM and Char-LSTM used cross-entropy loss for predicting the next word. Multi-task learning (MTL) is introduced in Word-LSTM-MTL and Char-LSTM-MTL.

WordDCC consists of a word embeddings layer, causal convolution layers with dilation, convolution layers, fully-connected layers , a softmax layer, and residual connections.


Authors from Facebook AI Research propose Wav2Letter, an open-source deep learning speech recognition framework. It’s written in C and uses the ArrayFire tensor library.

The ArrayFire tensor library is used because it can execute on multiple back-ends such as a CUDA GPU back-end and a CPU back- end, which results in faster execution. Constructing and working with arrays is also much easier in ArrayFire compared to other C tensor libraries. The figure on the left shows how to build and train a one layer MLP (Multi-Layer Perceptron) with the binary cross-entropy loss.


The model was evaluated on the Wall Street Journal (WSJ) dataset. Training time was evaluated using 2 types of neural network architectures: recurrent, with 30 million parameters, and purely convolutional, with 100 million parameters. The figure below shows the Word Error Rate obtained on LibriSpeech.


Google Brain authors preset a simple data augmentation method for speech recognition known as SpecAugment. The method operates on the log mel spectrogram of the input audio.

On the LibriSpeech test-other set, the authors achieve a 6.8% WER (Word Error Rate) without a language model, and 5.8% WER with a language model. For Switchboard, they achieve 7.2% / 14. 6% on the Switchboard / CallHome.

Using this method, the authors are able to train end-to-end ASR (automatic speech recognition) networks known asListen, Attend and Spell (LAS). The data augmentation policy used involvestime warping,frequency masking, andtime masking.


In the LAS Networks, the input log mel spectrogram is passed into a 2-layer convolutional neural network (CNN) with a stride of 2. The output of this CNN is passed through an encoder that hasDstacked bi-directional LSTMs with cell sizewto produce a series of attention vectors.

The attention vectors are fed into a 2-layer RNN decoder of cell dimensionw.This outputs the tokens for the transcript. Tokenization of the text is done using a Word Piece Model of 16 k for LibriSpeech vocabulary and 1k for Switchboard. The final transcripts are obtained by a beam search with beam size 8.

Here’s the Word Error Rate performance of LAS SpecAugment.


Authors from Facebook AI Research explore unsupervised pre- training for speech recognition by learning representations of raw audio. The result is Wav2Vec, a model that’s trained on a huge unlabeled audio dataset.

The representations obtained from this are then used to improve acoustic model training. A simple multi-layer convolutional neural network is pre-trained and optimized through a noise contrastive binary classification task. Wav2Vec achieves a 2. 43% WER on thenov 92test set.

The approach used in pre-training is optimizing the model to predict future samples from a single context. The model takes a raw audio signal as input and then applies an encoder network and a context network.

The encoder network embeds the audio signal in a latent space, and the context network combines multiple time-steps of the encoder to obtain representations that have been contextualized. The objective function is then computed from both networks.


Layers in the encoder and context networks are made up of a causal convolution with 512 channels, a group normalization layer, and a ReLU nonlinearity activation function. The representations produced by the context network during training are fed to the acoustic model. Training and evaluation of acoustic models are done using thewav2letter toolkit. For decoding, alexiconand a separate language model trained on the WSJ language modeling dataset are used.

Here’s the Word Error Rate for this model compared to other speech recognition models.


Scalable Multi Corpora Neural Language Models for ASR

In this paper, authors from Amazon Alexa offer solutions to some of the challenges encountered when using Neural Language Models for large scale ASR systems.

The challenges the authors seek to address are:

  • Training the NLM on multiple heterogenous corpora
  • Personalizing the Neural Language Model (NLM) by passing biases for classes such as contact names from the first-pass model through the NLM
  • Incorporating the NLM into the ASR system, while limiting the latency impact

For the task of learning from heterogeneous corpora, parameters of the neural network are estimated using a variant of stochastic gradient descent. This method is dependent upon each minibatch being an Independent and Identically (iid) sample of the distribution that is being learned from. Minibatches are constructed stochastically by drawing samples from each corpus with probability based on its relevance. Constructing n-gram models from each data source and optimizing their linear interpolation weights on a development set is used for relevance weights.

Generating synthetic data for first-pass LM is done by constructing an n-gram approximation of NLM by sampling a large text corpus from NLM and estimating an n-gram model from the corpus. A subword NLM is used to generate synthetic data to ensure that the generated corpus is not limited to the vocabulary of the current version of the ASR system. The written text corpora used in the model contains over 50 billion words in total. The NLM architecture is made up of twoLong Short-Term Memory Projection Recurrent Neural Network (LSTMP)layers, each comprising 1024 hidden units projected down to a dimension of 512. There are residual connections between the layers.

Here are some of the results obtained from this model. It obtains a 1.6% relative WERR by generating synthetic data from the NLM.


We should now be up to speed on some of the most common – and a couple of very recent – techniques for performing automatic speech recognition in a variety of contexts.

The papers / abstracts mentioned and linked to above also contain links to their code implementations. We’d be happy to see the results you obtain after testing them.

Editor’s Note:Heartbeatis a contributor-driven online publication and community dedicated to exploring the emerging intersection of mobile app development and machine learning. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by(Fritz) *********************************************************************************************************************, the machine learning platform that helps developers teach devices to see, hear, sense, and think. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to ourcall for contributors. You can also sign up to receive our weekly newsletters ((Deep Learning Weekly) *********************************************************************************************************************andHeartbeat), join us onSlack, and follow us onTwitterfor all the latest in mobile machine learning.

Brave Browser


Read More

What do you think?

Leave a Reply

Your email address will not be published.

GIPHY App Key not set. Please check settings

This Stock's 'Infinity Short Squeeze' Could Put Bitcoin's Surge to Shame – CCN Markets, Crypto Coins News

This Stock's 'Infinity Short Squeeze' Could Put Bitcoin's Surge to Shame – CCN Markets, Crypto Coins News

Elon Musk's SpaceX is scouting for possible locations to land Starship on Mars – Firstpost,

Elon Musk's SpaceX is scouting for possible locations to land Starship on Mars – Firstpost,