Data Preparation

After Environment setup, we can start preparing the data for training and decoding.

The first step is to prepare the data for training. We have already provided that would prepare everything required for training.

cd /tmp/icefall
export PYTHONPATH=/tmp/icefall:$PYTHONPATH
cd egs/yesno/ASR


Note that in each recipe from icefall, there exists a file, which you should run before you run anything else.

That is all you need for data preparation.

For the more curious

If you are wondering how to prepare your own dataset, please refer to the following URLs for more details:

If you already have a Kaldi dataset directory, which contains files like wav.scp, feats.scp, then you can refer to

A quick look to the generated files

./ puts generated files into two directories:

  • download

  • data


The download directory contains downloaded dataset files:

tree -L 1 ./download/

|-- waves_yesno
`-- waves_yesno.tar.gz


Please refer to for how the data is downloaded and extracted.


tree ./data/

|-- fbank
|   |-- yesno_cuts_test.jsonl.gz
|   |-- yesno_cuts_train.jsonl.gz
|   |-- yesno_feats_test.lca
|   `-- yesno_feats_train.lca
|-- lang_phone
|   |--
|   |--
|   |--
|   |--
|   |-- lexicon.txt
|   |-- lexicon_disambig.txt
|   |-- tokens.txt
|   `-- words.txt
|-- lm
|   |--
|   `-- G.fst.txt
`-- manifests
    |-- yesno_recordings_test.jsonl.gz
    |-- yesno_recordings_train.jsonl.gz
    |-- yesno_supervisions_test.jsonl.gz
    `-- yesno_supervisions_train.jsonl.gz

4 directories, 18 files


This directory contains manifests. They are used to generate files in data/fbank.

To give you an idea of what it contains, we examine the first few lines of the manifests related to the train dataset.

cd data/manifests
gunzip -c  yesno_recordings_train.jsonl.gz  | head -n 3

The output is given below:

{"id": "0_0_0_0_1_1_1_1", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_0_1_1_1_1.wav"}], "sampling_rate": 8000, "num_samples": 50800, "duration": 6.35, "channel_ids": [0]}
{"id": "0_0_0_1_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_1_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48880, "duration": 6.11, "channel_ids": [0]}
{"id": "0_0_1_0_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_1_0_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48160, "duration": 6.02, "channel_ids": [0]}

Please refer to for the meaning of each field per line.

gunzip -c  yesno_supervisions_train.jsonl.gz  | head -n 3

The output is given below:

{"id": "0_0_0_0_1_1_1_1", "recording_id": "0_0_0_0_1_1_1_1", "start": 0.0, "duration": 6.35, "channel": 0, "text": "NO NO NO NO YES YES YES YES", "language": "Hebrew"}
{"id": "0_0_0_1_0_1_1_0", "recording_id": "0_0_0_1_0_1_1_0", "start": 0.0, "duration": 6.11, "channel": 0, "text": "NO NO NO YES NO YES YES NO", "language": "Hebrew"}
{"id": "0_0_1_0_0_1_1_0", "recording_id": "0_0_1_0_0_1_1_0", "start": 0.0, "duration": 6.02, "channel": 0, "text": "NO NO YES NO NO YES YES NO", "language": "Hebrew"}

Please refer to for the meaning of each field per line.


This directory contains everything from data/manifests. Furthermore, it also contains features for training.

data/fbank/yesno_feats_train.lca contains the features for the train dataset. Features are compressed using lilcom.

data/fbank/yesno_cuts_train.jsonl.gz stores the CutSet, which stores RecordingSet, SupervisionSet, and FeatureSet.

To give you an idea about what it looks like, we can run the following command:

cd data/fbank

gunzip -c yesno_cuts_train.jsonl.gz | head -n 3

The output is given below:

{"id": "0_0_0_0_1_1_1_1-0", "start": 0, "duration": 6.35, "channel": 0, "supervisions": [{"id": "0_0_0_0_1_1_1_1", "recording_id": "0_0_0_0_1_1_1_1", "start": 0.0, "duration": 6.35, "channel": 0, "text": "NO NO NO NO YES YES YES YES", "language": "Hebrew"}], "features": {"type": "kaldi-fbank", "num_frames": 635, "num_features": 23, "frame_shift": 0.01, "sampling_rate": 8000, "start": 0, "duration": 6.35, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/yesno_feats_train.lca", "storage_key": "0,13000,3570", "channels": 0}, "recording": {"id": "0_0_0_0_1_1_1_1", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_0_1_1_1_1.wav"}], "sampling_rate": 8000, "num_samples": 50800, "duration": 6.35, "channel_ids": [0]}, "type": "MonoCut"}
{"id": "0_0_0_1_0_1_1_0-1", "start": 0, "duration": 6.11, "channel": 0, "supervisions": [{"id": "0_0_0_1_0_1_1_0", "recording_id": "0_0_0_1_0_1_1_0", "start": 0.0, "duration": 6.11, "channel": 0, "text": "NO NO NO YES NO YES YES NO", "language": "Hebrew"}], "features": {"type": "kaldi-fbank", "num_frames": 611, "num_features": 23, "frame_shift": 0.01, "sampling_rate": 8000, "start": 0, "duration": 6.11, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/yesno_feats_train.lca", "storage_key": "16570,12964,2929", "channels": 0}, "recording": {"id": "0_0_0_1_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_1_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48880, "duration": 6.11, "channel_ids": [0]}, "type": "MonoCut"}
{"id": "0_0_1_0_0_1_1_0-2", "start": 0, "duration": 6.02, "channel": 0, "supervisions": [{"id": "0_0_1_0_0_1_1_0", "recording_id": "0_0_1_0_0_1_1_0", "start": 0.0, "duration": 6.02, "channel": 0, "text": "NO NO YES NO NO YES YES NO", "language": "Hebrew"}], "features": {"type": "kaldi-fbank", "num_frames": 602, "num_features": 23, "frame_shift": 0.01, "sampling_rate": 8000, "start": 0, "duration": 6.02, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/yesno_feats_train.lca", "storage_key": "32463,12936,2696", "channels": 0}, "recording": {"id": "0_0_1_0_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_1_0_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48160, "duration": 6.02, "channel_ids": [0]}, "type": "MonoCut"}

Note that yesno_cuts_train.jsonl.gz only stores the information about how to read the features. The actual features are stored separately in data/fbank/yesno_feats_train.lca.


This directory contains the lexicon.


This directory contains language models.