Shallow fusion for Transducer

External language models (LM) are commonly used to improve WERs for E2E ASR models. This tutorial shows you how to perform shallow fusion with an external LM to improve the word-error-rate of a transducer model.

Note

This tutorial is based on the recipe pruned_transducer_stateless7_streaming, which is a streaming transducer model trained on LibriSpeech. However, you can easily apply shallow fusion to other recipes. If you encounter any problems, please open an issue here icefall.

Note

For simplicity, the training and testing corpus in this tutorial is the same (LibriSpeech). However, you can change the testing set to any other domains (e.g GigaSpeech) and use an external LM trained on that domain.

Hint

We recommend you to use a GPU for decoding.

For illustration purpose, we will use a pre-trained ASR model from this link. If you want to train your model from scratch, please have a look at Pruned transducer statelessX.

As the initial step, let’s download the pre-trained model.

$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
$ cd icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$ git lfs pull --include "pretrained.pt"
$ ln -s pretrained.pt epoch-99.pt # create a symbolic link so that the checkpoint can be loaded
$ cd ../data/lang_bpe_500
$ git lfs pull --include bpe.model
$ cd ../../..

To test the model, let’s have a look at the decoding results without using LM. This can be done via the following command:

$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/
$ ./pruned_transducer_stateless7_streaming/decode.py \
    --epoch 99 \
    --avg 1 \
    --use-averaged-model False \
    --exp-dir $exp_dir \
    --bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model \
    --max-duration 600 \
    --decode-chunk-len 32 \
    --decoding-method modified_beam_search

The following WERs are achieved on test-clean and test-other:

$ For test-clean, WER of different settings are:
$ beam_size_4       3.11    best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4       7.93    best for test-other

These are already good numbers! But we can further improve it by using shallow fusion with external LM. Training a language model usually takes a long time, we can download a pre-trained LM from this link.

$ # download the external LM
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
$ # create a symbolic link so that the checkpoint can be loaded
$ pushd icefall-librispeech-rnn-lm/exp
$ git lfs pull --include "pretrained.pt"
$ ln -s pretrained.pt epoch-99.pt
$ popd

Note

This is an RNN LM trained on the LibriSpeech text corpus. So it might not be ideal for other corpus. You may also train a RNN LM from scratch. Please refer to this script for training a RNN LM and this script to train a transformer LM.

To use shallow fusion for decoding, we can execute the following command:

$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$ lm_dir=./icefall-librispeech-rnn-lm/exp
$ lm_scale=0.29
$ ./pruned_transducer_stateless7_streaming/decode.py \
    --epoch 99 \
    --avg 1 \
    --use-averaged-model False \
    --beam-size 4 \
    --exp-dir $exp_dir \
    --max-duration 600 \
    --decode-chunk-len 32 \
    --decoding-method modified_beam_search_lm_shallow_fusion \
    --bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model \
    --use-shallow-fusion 1 \
    --lm-type rnn \
    --lm-exp-dir $lm_dir \
    --lm-epoch 99 \
    --lm-scale $lm_scale \
    --lm-avg 1 \
    --rnn-lm-embedding-dim 2048 \
    --rnn-lm-hidden-dim 2048 \
    --rnn-lm-num-layers 3 \
    --lm-vocab-size 500

Note that we set --decoding-method modified_beam_search_lm_shallow_fusion and --use-shallow-fusion True to use shallow fusion. --lm-type specifies the type of neural LM we are going to use, you can either choose between rnn or transformer. The following three arguments are associated with the rnn:

  • --rnn-lm-embedding-dim

    The embedding dimension of the RNN LM

  • --rnn-lm-hidden-dim

    The hidden dimension of the RNN LM

  • --rnn-lm-num-layers

    The number of RNN layers in the RNN LM.

The decoding result obtained with the above command are shown below.

$ For test-clean, WER of different settings are:
$ beam_size_4       2.77    best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4       7.08    best for test-other

The improvement of shallow fusion is very obvious! The relative WER reduction on test-other is around 10.5%. A few parameters can be tuned to further boost the performance of shallow fusion:

  • --lm-scale

    Controls the scale of the LM. If too small, the external language model may not be fully utilized; if too large, the LM score may dominant during decoding, leading to bad WER. A typical value of this is around 0.3.

  • --beam-size

    The number of active paths in the search beam. It controls the trade-off between decoding efficiency and accuracy.

Here, we also show how –beam-size effect the WER and decoding time:

Table 1 WERs and decoding time (on test-clean) of shallow fusion with different beam sizes

Beam size

test-clean

test-other

Decoding time on test-clean (s)

4

2.77

7.08

262

8

2.62

6.65

352

12

2.58

6.65

488

As we see, a larger beam size during shallow fusion improves the WER, but is also slower.