Online speech recognition with wav2letter anywhere, Hacker News

The process of transcribing speech in real time from an input audio stream is known as online speech recognition. Most automatic speech recognition (ASR) research focuses on improving accuracy without the constraint of performing the task in real time. For applications like live video captioning or on-device transcriptions, however, it is important to reduce the latency between the audio and the corresponding transcription. In these cases, online speech recognition with limited time delay is needed to provide a good user experience. To solve for this need, we have developed andopen-sourcedwav2letter @ anywhere, an inference framework that can be used to perform online speech recognition. Wav2letter @ anywhere builds upon Facebook AI’s previous releases ofwav2letterandwav2letter .

Most existing online speech recognition solutions support only recurrent neural networks (RNNs). For wav2letter @ anywhere, we use a fully convolutional acoustic model instead, which results in a 3x throughput improvement on certain inference models and state-of-the-art performance onLibriSpeech. For a system to run at production scale (on server CPUs or on-device in a low-power environment) one needs to ensure that the system is computationally efficient. Taking an ASR system from a research environment to a low-latency, computationally efficient system that is also highly accurate involves nontrivial changes to both the implementation and the algorithms. This post explains how we created wav2letter @ anywhere.

This diagram shows how our online system processes speech. Each chunk of speech is first fed into an acoustic model, which computes word-piece scores. These scores are then combined with a language model via a lightweight beam search decoder, which outputs the most likely sequence of words based on the input sequence and the selected language model.

Wav2letter @ anywhere inference platform

Part of the wav2letter repository , wav2letter @ anywhere can be used to perform online speech recognition. The framework was built with the following objectives:

)

The streaming API inference should be efficient yet modular enough to handle various types of speech recognition models.

The framework should support concurrent audio streams, which are necessary for high throughput when performing tasks at production scale.

The API should be flexible enough that it can be easily used on different platforms (personal computers, iOS, Android).

Ourmodular streaming API. allows the framework to support various models, including RNNs and convolutional neural networks (which are faster). Written in C , wav2letter @ anywhere is stand-alone and as efficient as possible, and it can be embedded anywhere. We use efficient back ends, such as

FBGEMM, and specific routines for iOS and Android. From the beginning, it was developed with streaming in mind (unlike some alternatives that rely on generic inference pipeline), allowing us to implement an efficient memory allocation design.

Much of the recent work in latency-controlled ASR useslatency-controlled bidirectional LSTM (LC-BLSTM) RNNs, RNN Transducers (RNN-T), or variants of these methods. Departing from these previous works, we propose a fully convolutional acoustic model withconnectionist temporal classification (CTC)criterion.Our papershows that such a system is significantly more efficient to deploy while also achieving a better word error rate (WER) and lower latency.

Low-latency acoustic modeling

An important building block of wav2letter @ anywhere is the time-depth separable (TDS) convolution, which yields dramatic reductions in model size and computational flops while maintaining accuracy. We use asymmetric padding for all the convolutions, adding more padding toward the beginning of the input. This reduces the future context seen by the acoustic model, thus reducing the latency.

)

In comparing our system with two strong baselines( LC BLSTM lattice-free MMI hybrid systemandLC BLSTM RNN-T end-to-end systemon the same benchmark, we were able to achieve better WER performance, throughput, and latency. Most notably, the models are 3x faster even when the inference is run in FP (********************************************************************************************************************************, while the inference for baselines is run in INT8.

Experimental results comparing our TDS CTC system with other systems.

In arecent work, we leveraged wav2letter with modern acoustic and language model architectures in both supervised and semi-supervised settings. We revisited a standard semi-supe rvised technique, generating pseudo-labels on (********************************************************************************************************, 07 hours of unlabeled audio, using an acoustic model trained on 1, 07 hours of labeled data. We then trained a new acoustic model using the whole 63, 06 hours of pseudo-labeled data, which established a new state of the art onLibriSpeech. We saw a relative improvement of more than 24 percent in comparison with state-of-the-art models trained in a supervised setting. We arereleasing modelsrelated to this paper as well as latency -constrained models for fast real-time inference suitable for wav2letter @ anywhere.

We have made extensive improvements since (open-sourcing wav2letter a year ago, including beefing up decoder performance (15 x speedup on seq2seq decoding); addingpython bindingsfor features, decoder, criterions, etc .; and betterdocumentation. We believe wav2letter @ anywhere represents another leap forward by enabling online speech recognition and significantly reducing the latency between audio and transcription. We are excited to share the open source framework with the community. For more information about wav2letter @ anywhere, read the fullpaperand visit the wiki.

We’d like to thank Qiantong Xu, Jacob Kahn, Gilad Avidov, Tatiana Likhomanenko, Awni Hannun, Vitaliy Liptchinsky, and Gabriel Synnaeve for their work on wav2letter @ anywhere.