Introduction

This page shows you how we implement streaming X-former transducer models for ASR.

Hint

X-former transducer here means the encoder of the transducer model uses Multi-Head Attention, like Conformer, EmFormer etc.

Currently we have implemented two types of streaming models, one uses Conformer as encoder, the other uses Emformer as encoder.

Streaming Conformer

The main idea of training a streaming model is to make the model see limited contexts in training time, we can achieve this by applying a mask to the output of self-attention. In icefall, we implement the streaming conformer the way just like what WeNet did.

Note

The conformer-transducer recipes in LibriSpeech datasets, like, pruned_transducer_stateless, pruned_transducer_stateless2, pruned_transducer_stateless3, pruned_transducer_stateless4, pruned_transducer_stateless5 all support streaming.

Note

Training a streaming conformer model in icefall is almost the same as training a non-streaming model, all you need to do is passing several extra arguments. See Pruned transducer statelessX for more details.

Hint

If you want to modify a non-streaming conformer recipe to support both streaming and non-streaming, please refer to this pull request. After adding the code needed by streaming training, you have to re-train it with the extra arguments metioned in the docs above to get a streaming model.

Streaming Emformer

The Emformer model proposed here uses more complicated techniques. It has a memory bank component to memorize history information, what’ more, it also introduces right context in training time by hard-copying part of the input features.

We have three variants of Emformer models in icefall.

pruned_stateless_emformer_rnnt2 using Emformer from torchaudio, see LibriSpeech recipe.

conv_emformer_transducer_stateless using ConvEmformer implemented by ourself. Different from the Emformer in torchaudio, ConvEmformer has a convolution in each layer and uses the mechanisms in our reworked conformer model. See LibriSpeech recipe.

conv_emformer_transducer_stateless2 using ConvEmformer implemented by ourself. The only difference from the above one is that it uses a simplified memory bank. See LibriSpeech recipe.