Introduction

This page shows you how we implement streaming X-former transducer models for ASR.

Hint

X-former transducer here means the encoder of the transducer model uses Multi-Head Attention, like Conformer, EmFormer etc.

Currently we have implemented two types of streaming models, one uses Conformer as encoder, the other uses Emformer as encoder.

Streaming Conformer

The main idea of training a streaming model is to make the model see limited contexts in training time, we can achieve this by applying a mask to the output of self-attention. In icefall, we implement the streaming conformer the way just like what WeNet did.

Note

The conformer-transducer recipes in LibriSpeech datasets, like, pruned_transducer_stateless, pruned_transducer_stateless2, pruned_transducer_stateless3, pruned_transducer_stateless4, pruned_transducer_stateless5 all support streaming.

Note

Training a streaming conformer model in icefall is almost the same as training a non-streaming model, all you need to do is passing several extra arguments. See Pruned transducer statelessX for more details.

Hint

If you want to modify a non-streaming conformer recipe to support both streaming and non-streaming, please refer to this pull request. After adding the code needed by streaming training, you have to re-train it with the extra arguments metioned in the docs above to get a streaming model.

Streaming Emformer

The Emformer model proposed here uses more complicated techniques. It has a memory bank component to memorize history information, what’ more, it also introduces right context in training time by hard-copying part of the input features.

We have three variants of Emformer models in icefall.

  • pruned_stateless_emformer_rnnt2 using Emformer from torchaudio, see LibriSpeech recipe.

  • conv_emformer_transducer_stateless using ConvEmformer implemented by ourself. Different from the Emformer in torchaudio, ConvEmformer has a convolution in each layer and uses the mechanisms in our reworked conformer model. See LibriSpeech recipe.

  • conv_emformer_transducer_stateless2 using ConvEmformer implemented by ourself. The only difference from the above one is that it uses a simplified memory bank. See LibriSpeech recipe.