-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the upgraded Zipformer model #1058
Conversation
The training commands:
./zipformer/train.py \
--world-size 4 \
--num-epochs 40 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp \
--causal 0 \
--full-libri 1 \
--max-duration 1000
./zipformer/train.py \
--world-size 2 \
--num-epochs 40 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp-small \
--causal 0 \
--num-encoder-layers 2,2,2,2,2,2 \
--feedforward-dim 512,768,768,768,768,768 \
--encoder-dim 192,256,256,256,256,256 \
--encoder-unmasked-dim 192,192,192,192,192,192 \
--base-lr 0.04 \
--full-libri 1 \
--max-duration 1500
./zipformer/train.py \
--world-size 4 \
--num-epochs 40 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp-large \
--causal 0 \
--num-encoder-layers 2,2,4,5,4,2 \
--feedforward-dim 512,768,1536,2048,1536,768 \
--encoder-dim 192,256,512,768,512,256 \
--encoder-unmasked-dim 192,192,256,320,256,192 \
--full-libri 1 \
--max-duration 1000
./zipformer/train.py \
--world-size 4 \
--num-epochs 30 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp-causal \
--causal 1 \
--full-libri 1 \
--max-duration 1000 |
I use the nvtx profiler to visualize the event timeline for our four encoder models, including the regular Conformer, the reworked Conformer, the old Zipformer, and the upgraded Zipformer, on a V100-32GB gpu. The models are in inference mode. The input tensor shape of each batch is (20, 3000, 80).
|
causal=causal) | ||
|
||
# TODO: remove it | ||
self.bypass_scale = nn.Parameter(torch.full((embed_dim,), 0.5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we remove this? I think it's unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But the trained model has already the bypass_scale
parameters... If we remove that, it would fail to load the saved model and optimizer state_dicts.
@@ -0,0 +1,123 @@ | |||
# Copyright 2021 Xiaomi Corp. (authors: Fangjun Kuang) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Copyright 2021 Xiaomi Corp. (authors: Fangjun Kuang) | |
# Copyright 2021-2023 Xiaomi Corp. (authors: Zengwei Yao) |
value_head_dim (int or Tuple[int]): dimension of value in each attention head | ||
pos_head_dim (int or Tuple[int]): dimension of positional-encoding projection per |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please switch the order of the doc for value_head_dim
and pos_head_dim
to match the actual argument order.
respectively. | ||
""" | ||
def __init__(self, *args): | ||
assert len(args) >= 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert len(args) >= 1 | |
assert len(args) >= 1, len(args) |
else: | ||
self.pairs = [ (float(x), float(y)) for x,y in args ] | ||
for (x,y) in self.pairs: | ||
assert isinstance(x, float) or isinstance(x, int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert isinstance(x, float) or isinstance(x, int) | |
assert isinstance(x, (float, int)), type(x) |
self.pairs = [ (float(x), float(y)) for x,y in args ] | ||
for (x,y) in self.pairs: | ||
assert isinstance(x, float) or isinstance(x, int) | ||
assert isinstance(y, float) or isinstance(y, int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert isinstance(y, float) or isinstance(y, int) | |
assert isinstance(y, (float, int)), type(y) |
* [(sp[0], sp[1] + xp[1]) for sp, xp in zip(s.pairs, x.pairs)]) | ||
|
||
def max(self, x): | ||
if isinstance(x, float) or isinstance(x, int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if isinstance(x, float) or isinstance(x, int): | |
if isinstance(x, (float, int)): |
include_crossings: if true, include in the x values positions | ||
where the functions indicate by this and p crosss. | ||
""" | ||
assert isinstance(p, PiecewiseLinear) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert isinstance(p, PiecewiseLinear) | |
assert isinstance(p, PiecewiseLinear), type(p) |
assert isinstance(p, PiecewiseLinear) | ||
|
||
# get sorted x-values without repetition. | ||
x_vals = sorted(set([ x for x, y in self.pairs ] + [ x for x, y in p.pairs ])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x_vals = sorted(set([ x for x, y in self.pairs ] + [ x for x, y in p.pairs ])) | |
x_vals = sorted(set([ x for x, _ in self.pairs ] + [ x for x, _ in p.pairs ])) |
Example: | ||
self.dropout = ScheduledFloat((0.0, 0.2), (4000.0, 0.0), default=0.0) | ||
|
||
`default` is used when self.batch_count is not set or in training or mode or in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in training or mode
Please fix the typo.
|
||
def __float__(self): | ||
batch_count = self.batch_count | ||
if batch_count is None or not self.training or torch.jit.is_scripting(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be
not torch.jit.is_scripting()
Also, should we add not torch.jit.is_tracing()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Thanks.
May i ask when could we use the latest zipformer ? It seems that i could not find the related code yet. |
Whenever you want. Please see k2-fsa/sherpa#379 |
Yeah that's great. But i would like to find the training recipe to train the lastest zipformer on my own dataset. Should i just use the |
Yes, you can find the usage in RESULTS.md |
Hello developers, I have some questions about simulated streaming (time masking), how exactly is simulated streaming implemented? Is there a paper or code for this? Best wishes. @yaozengwei |
hi, you can find the code for simulated streaming decode in the librispeech zipformer recipe, simulated streaming is enabled by setting causal to 1. also remember to set a proper value for `left-context-frames` and `chunk-size`.
best
jin
… On Mar 20, 2024, at 11:36, kk ***@***.***> wrote:
Hello developers, I have some questions about simulated streaming (time masking), how exactly is simulated streaming implemented? Is there a paper or code for this? Best wishes. @yaozengwei <https://github.com/yaozengwei>
—
Reply to this email directly, view it on GitHub <#1058 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42B5VODJLGIFZC5VNELYZD75XAVCNFSM6AAAAAAX7RLNFWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBYGYYTGOJZGE>.
You are receiving this because you are subscribed to this thread.
|
This PR adds a new recipe of the upgraded Zipformer-Transducer model from @danpovey . (See #1057 for detailed commit history). Compared to the old recipe (pruned_transducer_stateless7), the new model achieves better accuracy with lower memory usage and faster computation.
We will mainly maintain this recipe in the coming days. Other features (e.g., CTC & Attention-decoder model, multi-dataset training, language model rescoring, delay penalty, etc.) will be added into this recipe.
Our models are trained with pruned transducer loss on full librispeech dataset (with 0.9 and 1.1 speed perturbation). We adopt automatic mixed precision training.
The normal-scaled model (66.1 M), max-duration=1000, ~1h9m per epoch on 4 * V100-32GB gpus,
greedy_search decoding at epoch-30-avg-8
The chunk size is at 50Hz frame rate