Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplified memory bank for Emformer #440

Merged
merged 30 commits into from
Jul 12, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
9c39d8b
Merge remote-tracking branch 'k2-fsa/master'
yaozengwei Apr 29, 2022
70634d5
Merge remote-tracking branch 'k2-fsa/master'
yaozengwei May 6, 2022
ecfb3e9
Merge remote-tracking branch 'k2-fsa/master'
yaozengwei May 7, 2022
bcef517
Merge remote-tracking branch 'k2-fsa/master'
yaozengwei May 12, 2022
c9d84ae
Merge remote-tracking branch 'k2-fsa/master'
yaozengwei May 15, 2022
fbbc24f
Merge remote-tracking branch 'k2-fsa/master'
yaozengwei May 26, 2022
5453166
Merge remote-tracking branch 'origin/master'
yaozengwei May 26, 2022
bb7ea31
Merge remote-tracking branch 'k2-fsa/master'
yaozengwei May 31, 2022
2a5a70e
Merge remote-tracking branch 'k2-fsa/master'
yaozengwei Jun 13, 2022
ec8646d
Merge remote-tracking branch 'k2-fsa/master'
yaozengwei Jun 13, 2022
1c067e7
init files
yaozengwei Jun 13, 2022
193b44e
use average value as memory vector for each chunk
yaozengwei Jun 13, 2022
5d877ef
change tail padding length from right_context_length to chunk_length
yaozengwei Jun 17, 2022
c27bb1c
correct the files, ln -> cp
yaozengwei Jun 17, 2022
208bbb6
fix bug in conv_emformer_transducer_stateless2/emformer.py
yaozengwei Jun 17, 2022
5b19011
fix doc in conv_emformer_transducer_stateless/emformer.py
yaozengwei Jun 21, 2022
42e3e88
refactor init states for stream
yaozengwei Jun 21, 2022
9c37c16
modify .flake8
yaozengwei Jun 22, 2022
10662c5
fix bug about memory mask when memory_size==0
yaozengwei Jul 4, 2022
dbea9a9
Merge remote-tracking branch 'k2-fsa/master' into emformer_conv_simpl…
yaozengwei Jul 5, 2022
1f6c822
add @torch.jit.export for init_states function
yaozengwei Jul 6, 2022
61794d8
update RESULTS.md
yaozengwei Jul 6, 2022
69a3ef3
minor change
yaozengwei Jul 6, 2022
f9c6014
update README.md
yaozengwei Jul 7, 2022
12c176c
modify doc
yaozengwei Jul 7, 2022
5cfdbd3
replace torch.div() with <<
yaozengwei Jul 8, 2022
2057124
fix bug, >> -> <<
yaozengwei Jul 8, 2022
e3e8b19
use i&i-1 to judge if it is a power of 2
yaozengwei Jul 8, 2022
ad68987
minor fix
yaozengwei Jul 8, 2022
1a44724
fix error in RESULTS.md
yaozengwei Jul 12, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .flake8
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ per-file-ignores =
egs/*/ASR/pruned_transducer_stateless*/*.py: E501,
egs/*/ASR/*/optim.py: E501,
egs/*/ASR/*/scaling.py: E501,
egs/librispeech/ASR/conv_emformer_transducer_stateless/*.py: E501, E203
egs/librispeech/ASR/conv_emformer_transducer_stateless*/*.py: E501, E203

# invalid escape sequence (cause by tex formular), W605
icefall/utils.py: E501, W605
Expand Down
4 changes: 2 additions & 2 deletions egs/librispeech/ASR/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@ The following table lists the differences among them.
| `pruned_transducer_stateless5` | Conformer(modified) | Embedding + Conv1d | same as pruned_transducer_stateless4 + more layers + random combiner|
| `pruned_transducer_stateless6` | Conformer(modified) | Embedding + Conv1d | same as pruned_transducer_stateless4 + distillation with hubert|
| `pruned_stateless_emformer_rnnt2` | Emformer(from torchaudio) | Embedding + Conv1d | Using Emformer from torchaudio for streaming ASR|
| `conv_emformer_transducer_stateless` | Emformer | Embedding + Conv1d | Using Emformer augmented with convolution for streaming ASR + mechanisms in reworked model |

| `conv_emformer_transducer_stateless` | ConvEmformer | Embedding + Conv1d | Using ConvEmformer for streaming ASR + mechanisms in reworked model |
| `conv_emformer_transducer_stateless2` | ConvEmformer | Embedding + Conv1d | Using ConvEmformer with simplified memory for streaming ASR + mechanisms in reworked model |

The decoder in `transducer_stateless` is modified from the paper
[Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419/).
Expand Down
312 changes: 312 additions & 0 deletions egs/librispeech/ASR/RESULTS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,317 @@
## Results

### LibriSpeech BPE training results (Pruned Stateless Conv-Emformer RNN-T 2)

[conv_emformer_transducer_stateless2](./conv_emformer_transducer_stateless2)

It implements [Emformer](https://arxiv.org/abs/2010.10759) augmented with convolution module and simplified memory bank for streaming ASR.
It is modified from [torchaudio](https://github.com/pytorch/audio).

See <https://github.com/k2-fsa/icefall/pull/440> for more details.

#### With lower latency setup, training on full librispeech

In this model, the lengths of chunk and right context are 32 frames (i.e., 0.32s) and 8 frames (i.e., 0.08s), respectively.

The WERs are:

| | test-clean | test-other | comment | decoding mode |
|-------------------------------------|------------|------------|----------------------|----------------------|
| greedy search (max sym per frame 1) | 3.5 | 9.09 | --epoch 30 --avg 10 | simulated streaming |
| greedy search (max sym per frame 1) | 3.57 | 9.1 | --epoch 30 --avg 10 | streaming |
| fast beam search | 3.5 | 8.91 | --epoch 30 --avg 10 | simulated streaming |
| fast beam search | 3.54 | 8.91 | --epoch 30 --avg 10 | streaming |
| modified beam search | 3.43 | 8.86 | --epoch 30 --avg 10 | simulated streaming |
| modified beam search | 3.48 | 8.88 | --epoch 30 --avg 10 | streaming |

The training command is:

```bash
./conv_emformer_transducer_stateless2/train.py \
--world-size 6 \
--num-epochs 30 \
--start-epoch 1 \
--exp-dir conv_emformer_transducer_stateless2/exp \
--full-libri 1 \
--max-duration 280 \
--master-port 12321 \
--num-encoder-layers 12 \
--chunk-length 32 \
--cnn-module-kernel 31 \
--left-context-length 32 \
--right-context-length 8 \
--memory-size 32
```

The tensorboard log can be found at
<https://tensorboard.dev/experiment/W5MpxekiQLSPyM4fe5hbKg/>

The simulated streaming decoding command using greedy search is:
```bash
./conv_emformer_transducer_stateless2/decode.py \
--epoch 30 \
--avg 10 \
--exp-dir conv_emformer_transducer_stateless2/exp \
--max-duration 300 \
--num-encoder-layers 12 \
--chunk-length 32 \
--cnn-module-kernel 31 \
--left-context-length 32 \
--right-context-length 8 \
--memory-size 32 \
--decoding-method greedy_search \
--use-averaged-model True
```

The simulated streaming decoding command using fast beam search is:
```bash
./conv_emformer_transducer_stateless2/decode.py \
--epoch 30 \
--avg 10 \
--exp-dir conv_emformer_transducer_stateless2/exp \
--max-duration 300 \
--num-encoder-layers 12 \
--chunk-length 32 \
--cnn-module-kernel 31 \
--left-context-length 32 \
--right-context-length 8 \
--memory-size 32 \
--decoding-method fast_beam_search \
--use-averaged-model True \
--beam 4 \
--max-contexts 4 \
--max-states 8
```

The simulated streaming decoding command using modified beam search is:
```bash
./conv_emformer_transducer_stateless2/decode.py \
--epoch 30 \
--avg 10 \
--exp-dir conv_emformer_transducer_stateless2/exp \
--max-duration 300 \
--num-encoder-layers 12 \
--chunk-length 32 \
--cnn-module-kernel 31 \
--left-context-length 32 \
--right-context-length 8 \
--memory-size 32 \
--decoding-method modified_beam_search \
--use-averaged-model True \
--beam-size 4
```

The streaming decoding command using greedy search is:
```bash
./conv_emformer_transducer_stateless2/streaming_decode.py \
--epoch 30 \
--avg 10 \
--exp-dir conv_emformer_transducer_stateless2/exp \
--num-decode-streams 2000 \
--num-encoder-layers 12 \
--chunk-length 32 \
--cnn-module-kernel 31 \
--left-context-length 32 \
--right-context-length 8 \
--memory-size 32 \
--decoding-method greedy_search \
--use-averaged-model True
```

The streaming decoding command using fast beam search is:
```bash
./conv_emformer_transducer_stateless2/streaming_decode.py \
--epoch 30 \
--avg 10 \
--exp-dir conv_emformer_transducer_stateless2/exp \
--num-decode-streams 2000 \
--num-encoder-layers 12 \
--chunk-length 32 \
--cnn-module-kernel 31 \
--left-context-length 32 \
--right-context-length 8 \
--memory-size 32 \
--decoding-method fast_beam_search \
--use-averaged-model True \
--beam 4 \
--max-contexts 4 \
--max-states 8
```

The streaming decoding command using modified beam search is:
```bash
./conv_emformer_transducer_stateless2/streaming_decode.py \
--epoch 30 \
--avg 10 \
--exp-dir conv_emformer_transducer_stateless2/exp \
--num-decode-streams 2000 \
--num-encoder-layers 12 \
--chunk-length 32 \
--cnn-module-kernel 31 \
--left-context-length 32 \
--right-context-length 8 \
--memory-size 32 \
--decoding-method modified_beam_search \
--use-averaged-model True \
--beam-size 4
```

Pretrained models, training logs, decoding logs, and decoding results
are available at
<https://huggingface.co/Zengwei/icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05>

#### With higher latency setup, training on full librispeech

In this model, the lengths of chunk and right context are 64 frames (i.e., 0.64s) and 16 frames (i.e., 0.16s), respectively.

The WERs are:

| | test-clean | test-other | comment | decoding mode |
|-------------------------------------|------------|------------|----------------------|----------------------|
| greedy search (max sym per frame 1) | 3.3 | 8.71 | --epoch 30 --avg 10 | simulated streaming |
| greedy search (max sym per frame 1) | 3.35 | 8.65 | --epoch 30 --avg 10 | streaming |
| fast beam search | 3.27 | 8.58 | --epoch 30 --avg 10 | simulated streaming |
| fast beam search | 3.31 | 8.48 | --epoch 30 --avg 10 | streaming |
| modified beam search | 3.26 | 8.56 | --epoch 30 --avg 10 | simulated streaming |
| modified beam search | 3.29 | 8.47 | --epoch 30 --avg 10 | streaming |

The training command is:

```bash
./conv_emformer_transducer_stateless2/train.py \
--world-size 4 \
--num-epochs 30 \
--start-epoch 1 \
--exp-dir conv_emformer_transducer_stateless2/exp \
--full-libri 1 \
--max-duration 280 \
--master-port 12321 \
--num-encoder-layers 12 \
--chunk-length 64 \
--cnn-module-kernel 31 \
--left-context-length 64 \
--right-context-length 16 \
--memory-size 32
```

The tensorboard log can be found at
<https://tensorboard.dev/experiment/eRx6XwbOQhGlywgD8lWBjw/>

The simulated streaming decoding command using greedy search is:
```bash
./conv_emformer_transducer_stateless2/decode.py \
--epoch 30 \
--avg 10 \
--exp-dir conv_emformer_transducer_stateless2/exp \
--max-duration 300 \
--num-encoder-layers 12 \
--chunk-length 64 \
--cnn-module-kernel 31 \
--left-context-length 64 \
--right-context-length 16 \
--memory-size 32 \
--decoding-method greedy_search \
--use-averaged-model True
```

The simulated streaming decoding command using fast beam search is:
```bash
./conv_emformer_transducer_stateless2/decode.py \
--epoch 30 \
--avg 10 \
--exp-dir conv_emformer_transducer_stateless2/exp \
--max-duration 300 \
--num-encoder-layers 12 \
--chunk-length 64 \
--cnn-module-kernel 31 \
--left-context-length 64 \
--right-context-length 16 \
--memory-size 32 \
--decoding-method fast_beam_search \
--use-averaged-model True \
--beam 4 \
--max-contexts 4 \
--max-states 8
```

The simulated streaming decoding command using modified beam search is:
```bash
./conv_emformer_transducer_stateless2/decode.py \
--epoch 30 \
--avg 10 \
--exp-dir conv_emformer_transducer_stateless2/exp \
--max-duration 300 \
--num-encoder-layers 12 \
--chunk-length 64 \
--cnn-module-kernel 31 \
--left-context-length 64 \
--right-context-length 16 \
--memory-size 32 \
--decoding-method modified_beam_search \
--use-averaged-model True \
--beam-size 4
```

The streaming decoding command using greedy search is:
```bash
./conv_emformer_transducer_stateless2/streaming_decode.py \
--epoch 30 \
--avg 10 \
--exp-dir conv_emformer_transducer_stateless2/exp \
--num-decode-streams 2000 \
--num-encoder-layers 12 \
--chunk-length 64 \
--cnn-module-kernel 31 \
--left-context-length 64 \
--right-context-length 16 \
--memory-size 32 \
--decoding-method greedy_search \
--use-averaged-model True
```

The streaming decoding command using fast beam search is:
```bash
./conv_emformer_transducer_stateless2/streaming_decode.py \
--epoch 30 \
--avg 10 \
--exp-dir conv_emformer_transducer_stateless2/exp \
--num-decode-streams 2000 \
--num-encoder-layers 12 \
--chunk-length 64 \
--cnn-module-kernel 31 \
--left-context-length 64 \
--right-context-length 16 \
--memory-size 32 \
--decoding-method fast_beam_search \
--use-averaged-model True \
--beam 4 \
--max-contexts 4 \
--max-states 8
```

The streaming decoding command using modified beam search is:
```bash
./conv_emformer_transducer_stateless2/streaming_decode.py \
--epoch 30 \
--avg 10 \
--exp-dir conv_emformer_transducer_stateless2/exp \
--num-decode-streams 2000 \
--num-encoder-layers 12 \
--chunk-length 64 \
--cnn-module-kernel 31 \
--left-context-length 64 \
--right-context-length 16 \
--memory-size 32 \
--decoding-method modified_beam_search \
--use-averaged-model True \
--beam-size 4
```

Pretrained models, training logs, decoding logs, and decoding results
are available at
<https://huggingface.co/Zengwei/icefall-asr-librispeech-conv-emformer-transducer-stateless2-larger-latency-2022-07-06>


### LibriSpeech BPE training results (Pruned Stateless Streaming Conformer RNN-T)

#### [pruned_transducer_stateless](./pruned_transducer_stateless)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -277,10 +277,10 @@ def decode_one_batch(
supervisions = batch["supervisions"]
feature_lens = supervisions["num_frames"].to(device)

feature_lens += params.right_context_length
feature_lens += params.chunk_length
feature = torch.nn.functional.pad(
feature,
pad=(0, 0, 0, params.right_context_length),
pad=(0, 0, 0, params.chunk_length),
value=LOG_EPS,
)

Expand Down
Loading