Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replicate NST fine-tuning on Alvenir machine to sniff out training or data issues #32

Closed
sorenmulli opened this issue Aug 29, 2023 · 28 comments
Assignees
Labels
experimentation Experiment, resulting in new learnings

Comments

@sorenmulli
Copy link
Collaborator

The loss curve has looked wrong for training on NST and we want to see whether that also happens on other compute.

@sorenmulli sorenmulli added the experimentation Experiment, resulting in new learnings label Aug 29, 2023
@sorenmulli sorenmulli self-assigned this Aug 29, 2023
@sorenmulli
Copy link
Collaborator Author

Please let me know, when doing this is good to go, @saattrupdan - should I wait for #29 merge?

I should use finetune_model.py with the wav2vec2.yaml model config and the nst_da.yaml config? Or is there a place I can find the exact run command? :)

@saattrupdan
Copy link
Collaborator

You should probably wait for #29 to be merged first, yeah.

As for the command, you can run:

$ python src/scripts/finetune_model.py model=wav2vec2 datasets=[nst_da,common_voice_da] dataset_probabilities=[0.95,0.05]

Note that NST-da doesn't have any evaluation split, so only the common voice evaluation split is used. You can also simply try to completely overfit the NST-da dataset, running the following, which disables all regularisation and trains solely on the NST-da dataset (without any evaluation set). If I replace nst_da with common_voice_da then I've made it work (training loss ~ 0), but I haven't succeeded for NST-da, which might either be (a) a fault with the dataset, or (b) simply because it's so large that it takes way longer to get to 0.

$ python src/scripts/finetune_model.py model=wav2vec2 datasets=nst_da model.activation_dropout=0 model.attention_dropout=0 model.hidden_dropout=0 model.feat_proj_dropout=0 model.final_dropout=0 model.mask_time_prob=0 model.mask_feature_prob=0 model.layerdrop=0 early_stopping=false

@sorenmulli
Copy link
Collaborator Author

Currently running the NST overfitting training, will report back with progress :)

@sorenmulli
Copy link
Collaborator Author

Initial progress after 1600 steps looks fine: Started with a training loss value around 25 and is currently below 0.4

@saattrupdan
Copy link
Collaborator

saattrupdan commented Sep 12, 2023

Initial progress after 1600 steps looks fine: Started with a training loss value around 25 and is currently below 0.4

Nice! What are your hyperparameters? Default?

@sorenmulli
Copy link
Collaborator Author

I ran it with the default, using your call:

$ python src/scripts/finetune_model.py model=wav2vec2 datasets=nst_da model.activation_dropout=0 model.attention_dropout=0 model.hidden_dropout=0 model.feat_proj_dropout=0 model.final_dropout=0 model.mask_time_prob=0 model.mask_feature_prob=0 model.layerdrop=0 early_stopping=false

Now, I get after 27K steps (~4 epochs) a sudden fall where it succesfully hits 0, so maybe there is no fault with the dataset and you are correct that it is:

simply because it's so large that it takes way longer to get to 0.

The curve looks very funky but I guess that might be fine for this overfitting task.
Should I try to run the other command to test that? (though the larger GPU is occupied the next couple of days)

out

@saattrupdan
Copy link
Collaborator

saattrupdan commented Sep 13, 2023

Aha, that's good to hear. I only finetuned for 10k steps, so must've just been that.

And yep, please run the first command that I sent, to see if you can reproduce the results of your finetuned model (https://huggingface.co/chcaa/alvenir-wav2vec2-base-da-nst-cv9). You might have to tweak early_stopping_patience in case it plateaus for a long time, as in your plot above. Currently it's set to 50 steps. Alternatively, you can just disable early stopping completely by setting early_stopping=false.

@sorenmulli
Copy link
Collaborator Author

I ran

$ python src/scripts/finetune_model.py model=wav2vec2 datasets='[nst_da,common_voice_da]' dataset_probabilities.train='[0.95,0.05]' early_stopping=false

and sadly, this looks very strange :(
loss

Furthermore, the WER is 100% during the entire training - something is certainly wrong.
I note that the sudden fall to 0-loss happens at the same time; after 27K steps. This time preceded by a spike in loss.

There is some debugging to do, I suppose.
I can look at it later this week; any tips of how to approach this?

The trainer state of the training: trainer_state.json. Let me know if I can share any more details that can help debugging.

@saattrupdan
Copy link
Collaborator

I see in your logs that the 0-loss is NaN rather than 0, so there might be some overflow/underflow issue going on here. But it's super weird that the WER stays constant.

Also, the validation loss seems "stuck" after 4600 steps, which is also super strange. What happens if you crank up the gradient accumulation, which makes the effective batch size larger and hopefully more stable as well?

$ python src/scripts/finetune_model.py model=wav2vec2 datasets='[nst_da,common_voice_da]' dataset_probabilities.train='[0.95,0.05]' early_stopping=false model.gradient_accumulation=16

This won't fix the NaN validation loss issue though. Some things to try out:

  1. Manually check what comes out of the data collator during both training and validation. We replace padding tokens with -100; is the entire batch converted to -100 for some reason, say?
  2. Found a bug in the wav2vec2 module: word_delimiter_token is set to |, where this should now be (i.e., a space).
  3. Check whether self.processor.tokenizer.get_vocab actually fetches the correct vocab, in the wav2vec2 module.
  4. Check whether do_eval is set to True in the wav2vec2 module. It also doesn't seem right that we're calling values() on self.cfg.datasets, since it's a list?

@sorenmulli
Copy link
Collaborator Author

I see in your logs that the 0-loss is NaN rather than 0, so there might be some overflow/underflow issue going on here. But it's super weird that the WER stays constant.

Sure? Isn't it just the eval loss that turns into NaN after step 27440?
But yes, also seems suspiciously overflowy with the high jump.

Here is the resut plotted of running with higher grad accum. Behaviour looks similar.
loss

I might have time to look at your debugging suggestions tomorrow, ended up not having time today :/

@saattrupdan
Copy link
Collaborator

I see in your logs that the 0-loss is NaN rather than 0, so there might be some overflow/underflow issue going on here. But it's super weird that the WER stays constant.

Sure? Isn't it just the eval loss that turns into NaN after step 27440? But yes, also seems suspiciously overflowy with the high jump.

Here is the resut plotted of running with higher grad accum. Behaviour looks similar. loss

I might have time to look at your debugging suggestions tomorrow, ended up not having time today :/

Ah yeah, just the evaluation loss, my bad.

And fair enough. I'm guessing that the typo in point (2) will probably affect the WER stuff at least, since I'm guessing that's where it's getting its notion of "word" from, so that could be the first thing to check when you've got the time 🙂

@sorenmulli
Copy link
Collaborator Author

  1. Manually check what comes out of the data collator during both training and validation. We replace padding tokens with -100; is the entire batch converted to -100 for some reason, say?

From superficial inspection, it seems to work.
Some examples do have a lot of padding - I see some examples with all padding.
I have a version training now where I log the number of -100's in a batch to see whether the fateful step happens because of a batch with all examples with all padding. Maybe a pretty good explanation? We should thus maybe remove empty audio segments.

  1. Found a bug in the wav2vec2 module: word_delimiter_token is set to |, where this should now be (i.e., a space).

My currently training version has this fixed. Then we should hopefully be able to get some insight from eval.

  1. Check whether self.processor.tokenizer.get_vocab actually fetches the correct vocab, in the wav2vec2 module.

It does produce a nice looking vocab.
It is certainly differently ordered compared to previous models (e.g. chcaa/alvenir-wav2vec2-base-da-nst-cv9). (Would it be nice if I just sorted the characters before dumping vocabulary?)

{'j': 0, 'é': 1, 'a': 2, 'æ': 3, '2': 4, 'k': 5, 'f': 6, '8': 7, '7': 8, 'n': 9, 'q': 10, '9': 11, 'u': 12, 'v': 13, 's': 14, 'g': 15, 'z': 16, 'b': 17, 'ø': 18, '5': 19, '6': 20, 'm': 21, 'x': 22, '3': 23, 'l': 24, 'o': 25, '4': 26, 'c': 27, 'w': 28, 'd': 29, 'r': 30, 'i': 31, 'h': 32, '1': 33, 'å': 34, 't': 35, ' ': 36, 'ü': 37, '0': 38, 'p': 39, 'y': 40, 'e': 41, '<unk>': 42, '<pad>': 43, '<s>': 44, '</s>': 45}
  1. Check whether do_eval is set to True in the wav2vec2 module. It also doesn't seem right that we're calling values() on self.cfg.datasets, since it's a list?

It is. self.cfg.datastes is apparently a DictConfig. For me, it is:

{'nst_da': {'id': 'alexandrainst/nst-da', 'subset': None, 'train_name': 'train', 'val_name': None, 'test_name': 'test', 'text_column': 'text'}, 'common_voice_da': {'id': 'mozilla-foundation/common_voice_13_0', 'subset': 'da', 'train_name': 'train', 'val_name': 'validation', 'test_name': 'test', 'text_column': 'sentence'}}

@saattrupdan
Copy link
Collaborator

  1. Manually check what comes out of the data collator during both training and validation. We replace padding tokens with -100; is the entire batch converted to -100 for some reason, say?

From superficial inspection, it seems to work. Some examples do have a lot of padding - I see some examples with all padding. I have a version training now where I log the number of -100's in a batch to see whether the fateful step happens because of a batch with all examples with all padding. Maybe a pretty good explanation? We should thus maybe remove empty audio segments.

Interesting regarding the all -100's. Makes sense to remove them I guess, yep.

It does produce a nice looking vocab. It is certainly differently ordered compared to previous models (e.g. chcaa/alvenir-wav2vec2-base-da-nst-cv9). (Would it be nice if I just sorted the characters before dumping vocabulary?)

Sure, feel free to sort them before dumping them 🙂

Excited to see if the "|" replacement made any difference!

@sorenmulli
Copy link
Collaborator Author

Maaan, this didn't help anything. Still get 0 train loss and NaN eval loss from step 27560. Getting WER of 1 all the way through. Didn't find a batch of where every example had all -100's :(
I guess more in depth debugging of what happens in the evaluation + the significance of this specific step is required. Did you also get the problem at this step?

@saattrupdan
Copy link
Collaborator

Hmm that's really annoying. I think I'll have some time to look at it next week. Maybe try saving checkpoints so that we have the checkpoint and trainer state just before it gets to 0 loss? That would help with the debugging at least.

@sorenmulli
Copy link
Collaborator Author

That sounds nice, I will try to get some debugging material later this week :)

@sorenmulli
Copy link
Collaborator Author

Sorry, my debug run failed before hitting step 27K because I accidentally saved outputs to a disk that ran out of space. Setting it up again now to get checkpoints before/after step 27K but it probably takes a couple of days on the GPU that is free atm

@sorenmulli
Copy link
Collaborator Author

Now I have some checkpoints around the fateful step: https://drive.google.com/drive/folders/1TO9kHtZrf27ZgNS3RpPJRXP2XoUtM3j2?usp=sharing (just request access)
Checkpoint-27400 is just before and checkpoint-27500 just after (see log excerpt below).

(And wtf, gzip seems to be able to compress the step 27500 model into almost nothing even though both checkpoint dirs originally were 3.6GB??? Model must be all zeros or something crazy)

{'loss': 3.4909, 'learning_rate': 2.8414528699927665e-08, 'epoch': 4.12}                                               
{'loss': 3.4932, 'learning_rate': 2.7358640664088553e-08, 'epoch': 4.12}                                               
{'loss': 3.5171, 'learning_rate': 2.6322726410502595e-08, 'epoch': 4.12}                                               
{'loss': 3.481, 'learning_rate': 2.530678732109937e-08, 'epoch': 4.12}                                                 
{'loss': 3.5319, 'learning_rate': 2.431082475116142e-08, 'epoch': 4.12}                                                
{'loss': 3.4564, 'learning_rate': 2.3334840029319293e-08, 'epoch': 4.12}                                               
{'loss': 3.5391, 'learning_rate': 2.2378834457558173e-08, 'epoch': 4.12}                                               
{'loss': 3.5177, 'learning_rate': 2.144280931120457e-08, 'epoch': 4.12}                                                
{'loss': 3.4667, 'learning_rate': 2.052676583893298e-08, 'epoch': 4.12}                                                
{'loss': 3.4484, 'learning_rate': 1.963070526276589e-08, 'epoch': 4.12}                                                
***** Running Evaluation *****                                                                                         
  Num examples: Unknown                                                                                                
  Batch size = 8                                                                                                       
{'eval_loss': 3.5697641372680664, 'eval_wer': 1.0, 'eval_runtime': 80.0147, 'eval_samples_per_second': 27.77, 'eval_steps_per_second': 3.474, 'epoch': 4.12}
Saving model checkpoint to /mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27400              
Configuration saved in /mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27400/config.json      
Model weights saved in /mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27400/pytorch_model.bin
tokenizer config file saved in /mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27400/tokenizer_config.json
Special tokens file saved in /mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27400/special_tokens_map.json
Deleting older checkpoint [/mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27000] due to args.save_total_limit
{'loss': 3.5564, 'learning_rate': 1.8754628778060446e-08, 'epoch': 4.12}                                               
{'loss': 3.4582, 'learning_rate': 1.7898537553521775e-08, 'epoch': 4.12}                                               
{'loss': 3.4657, 'learning_rate': 1.706243273118968e-08, 'epoch': 4.12}                                                
{'loss': 140.4032, 'learning_rate': 1.6979921544251764e-08, 'epoch': 4.12}                                             
{'loss': 0.0, 'learning_rate': 1.6979921544251764e-08, 'epoch': 4.12}                                                  
{'loss': 0.0, 'learning_rate': 1.6979921544251764e-08, 'epoch': 4.12}                                                  
{'loss': 0.0, 'learning_rate': 1.6979921544251764e-08, 'epoch': 4.12}
{'loss': 0.0, 'learning_rate': 1.6979921544251764e-08, 'epoch': 4.12}                                                  
{'loss': 0.0, 'learning_rate': 1.6979921544251764e-08, 'epoch': 4.12}
{'loss': 0.0, 'learning_rate': 1.6979921544251764e-08, 'epoch': 4.12}
***** Running Evaluation *****
  Num examples: Unknown                                    
  Batch size = 8                                           
{'eval_loss': nan, 'eval_wer': 1.0, 'eval_runtime': 81.174, 'eval_samples_per_second': 27.373, 'eval_steps_per_second': 3.425, 'epoch': 4.12}
Saving model checkpoint to /mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27500
Configuration saved in /mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27500/config.json
Model weights saved in /mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27500/pytorch_model.bin
tokenizer config file saved in /mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27500/tokenizer_config.json
Special tokens file saved in /mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27500/special_tokens_map.json

@saattrupdan
Copy link
Collaborator

Now I have some checkpoints around the fateful step: https://drive.google.com/drive/folders/1TO9kHtZrf27ZgNS3RpPJRXP2XoUtM3j2?usp=sharing (just request access) Checkpoint-27400 is just before and checkpoint-27500 just after (see log excerpt below).

(And wtf, gzip seems to be able to compress the step 27500 model into almost nothing even though both checkpoint dirs originally were 3.6GB??? Model must be all zeros or something crazy)

{'loss': 3.4909, 'learning_rate': 2.8414528699927665e-08, 'epoch': 4.12}                                               
{'loss': 3.4932, 'learning_rate': 2.7358640664088553e-08, 'epoch': 4.12}                                               
{'loss': 3.5171, 'learning_rate': 2.6322726410502595e-08, 'epoch': 4.12}                                               
{'loss': 3.481, 'learning_rate': 2.530678732109937e-08, 'epoch': 4.12}                                                 
{'loss': 3.5319, 'learning_rate': 2.431082475116142e-08, 'epoch': 4.12}                                                
{'loss': 3.4564, 'learning_rate': 2.3334840029319293e-08, 'epoch': 4.12}                                               
{'loss': 3.5391, 'learning_rate': 2.2378834457558173e-08, 'epoch': 4.12}                                               
{'loss': 3.5177, 'learning_rate': 2.144280931120457e-08, 'epoch': 4.12}                                                
{'loss': 3.4667, 'learning_rate': 2.052676583893298e-08, 'epoch': 4.12}                                                
{'loss': 3.4484, 'learning_rate': 1.963070526276589e-08, 'epoch': 4.12}                                                
***** Running Evaluation *****                                                                                         
  Num examples: Unknown                                                                                                
  Batch size = 8                                                                                                       
{'eval_loss': 3.5697641372680664, 'eval_wer': 1.0, 'eval_runtime': 80.0147, 'eval_samples_per_second': 27.77, 'eval_steps_per_second': 3.474, 'epoch': 4.12}
Saving model checkpoint to /mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27400              
Configuration saved in /mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27400/config.json      
Model weights saved in /mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27400/pytorch_model.bin
tokenizer config file saved in /mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27400/tokenizer_config.json
Special tokens file saved in /mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27400/special_tokens_map.json
Deleting older checkpoint [/mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27000] due to args.save_total_limit
{'loss': 3.5564, 'learning_rate': 1.8754628778060446e-08, 'epoch': 4.12}                                               
{'loss': 3.4582, 'learning_rate': 1.7898537553521775e-08, 'epoch': 4.12}                                               
{'loss': 3.4657, 'learning_rate': 1.706243273118968e-08, 'epoch': 4.12}                                                
{'loss': 140.4032, 'learning_rate': 1.6979921544251764e-08, 'epoch': 4.12}                                             
{'loss': 0.0, 'learning_rate': 1.6979921544251764e-08, 'epoch': 4.12}                                                  
{'loss': 0.0, 'learning_rate': 1.6979921544251764e-08, 'epoch': 4.12}                                                  
{'loss': 0.0, 'learning_rate': 1.6979921544251764e-08, 'epoch': 4.12}
{'loss': 0.0, 'learning_rate': 1.6979921544251764e-08, 'epoch': 4.12}                                                  
{'loss': 0.0, 'learning_rate': 1.6979921544251764e-08, 'epoch': 4.12}
{'loss': 0.0, 'learning_rate': 1.6979921544251764e-08, 'epoch': 4.12}
***** Running Evaluation *****
  Num examples: Unknown                                    
  Batch size = 8                                           
{'eval_loss': nan, 'eval_wer': 1.0, 'eval_runtime': 81.174, 'eval_samples_per_second': 27.373, 'eval_steps_per_second': 3.425, 'epoch': 4.12}
Saving model checkpoint to /mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27500
Configuration saved in /mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27500/config.json
Model weights saved in /mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27500/pytorch_model.bin
tokenizer config file saved in /mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27500/tokenizer_config.json
Special tokens file saved in /mnt/data/asr-training/coral-models/models/wav2vec2-finetuned/checkpoint-27500/special_tokens_map.json

Really interesting with the high compression. It really sounds like some kind of under/overflow happening somewhere. I requested access now 🙂

@saattrupdan
Copy link
Collaborator

Also did some more experiments related to overfitting. Tried disabling all regularisation again, and then then finetuned the model at varying amounts of dataset_probabilities. Here are the findings:

  1. At 95% NST, the flatline + mysterious NaNs happen.
  2. At 50% NST the same flatline happens, and I presume that the NaNs happen at some point as well, but just way later than 30k steps.
  3. At 10% NST the loss starts going down, albeit very slowly (see plots below).
  4. Changing all_exhausted to first_exhausted doesn't make any difference.
  5. If finetuning on both NST and CV, but setting dataset probabilities to 100% CV, then the results are similar to only finetuning on CV - so there's nothing wrong with the dataset_probabilities argument, at least.

Points (1)-(3) seems to indicate that there is a real tension between NST and CV, since we can overfit them separately, but when we start interleaving them then the model is struggling to overfit.

Charts related to (3):
W B Chart 04_10_2023, 11_10_47
W B Chart 04_10_2023, 11_11_06

@saattrupdan
Copy link
Collaborator

saattrupdan commented Oct 4, 2023

Another curious observation:

The Alvenir model used this fairseq config, which sets dataset.max_tokens=1_000_000. I just ran the associated tokeniser on all of NST-da, and the entire dataset has about 10.5M tokens. To only use 1M tokens is thus using less than 10% of NST-da. Can you confirm that that's true? Because that changes the total number of training steps quite a bit, in our case.

@saattrupdan
Copy link
Collaborator

And yet another one:

I can't seem to find the batch size used in Fairseq, but in the XLSR paper, which is what the Alvenir model seems to follow, it mentions that the effective batch size is 0.44 hours, roughly corresponding to a batch size of 256, which is of course way bigger than the current 32. This might stabilise training more, as well as seeing way more samples of course.

@sorenmulli
Copy link
Collaborator Author

Also did some more experiments related to overfitting

Looks interesting, especially if the cause is inter-dataset heterogeneity/tension. Could it also be some error in the NST data?

dataset.max_tokens=1_000_000
batch size

I must admit that the trainings are before my time and I haven't worked much in Fairseq ASR (I can try to get more info of Martin and Lasse) but my fairseq understanding tells me:

  • dataset.max_tokens is the number of tokens in a per device mini-batch as fairseq does not operate on examples but normalized length batches in terms of a number of either sentences or tokens. This approach means that the per device mini-batch size in terms of examples is varying. See a plot below of this value (called bsz in fairseq) which should be for a run with the same config. I sadly cannot figure out right now why this number is a max and when fairseq decides to do go lower than this.

  • update_freq then acts as accumulation steps.

  • So one parameter update would include data from at most the following number of tokens: max_tokens * update_freq * number of devices. Here I get 32M tokens for the Alvenir model. I must admit that I am a little confused whether this number is in terms of audio samples. This would give 33 minutes in max effective batch size at 16kHz.

Number of examples in each per device mini-batch:
billede

@sorenmulli
Copy link
Collaborator Author

Dangit, I also found this plot which contradicts my understanding a bit :((

I'll try to get these things answered tomorrow!

billede

@saattrupdan
Copy link
Collaborator

Dangit, I also found this plot which contradicts my understanding a bit :((

I'll try to get these things answered tomorrow!

billede

I think this plot might be showing number of transcription tokens, since with ~50 transcription tokens per sample it would correspond to roughly the same batch size as your first plot.

Regarding max_tokens: I see! I never thought of each audio segment as a token, but it does seem like that's what's happening here, probably because fairseq is the same framework for both text models and audio models. And then the batch size makes sense too.

In any case, the previous effective batch size in this repo was 32, which is 10x smaller than the Alvenir config, so that might explain a bit. I've tried replicating the Alvenir config more correctly now, and started a run now. However, I've reduced the number of optimisation steps from 120k to 12k, since the steps they mention in the XLS-R paper is closer to that.

@sorenmulli
Copy link
Collaborator Author

Hoping that larger batch size can make it work🤞
Otherwise, I can try to debug the NST data format next week and set up a debug training with our preprocessed NST if that makes sense.

@saattrupdan
Copy link
Collaborator

Hoping that larger batch size can make it work🤞 Otherwise, I can try to debug the NST data format next week and set up a debug training with our preprocessed NST if that makes sense.

Sounds good. One thing I noticed is that there are a decent amount of samples in the NST dataset with no transcription (because nothing is being said). I've removed those now, and recompiled the dataset. I'm in the process of uploading the dataset to HF Hub now, so a new version should be up today (with a larger shard size as well).

@saattrupdan
Copy link
Collaborator

Looks like the larger batch size fixed things - loss and WER curves are looking much better now:

wer-plot
loss-plot

They are clearly underfitting, so we should be able to push the WER down way more 🤞

Closing here for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experimentation Experiment, resulting in new learnings
Projects
None yet
Development

No branches or pull requests

2 participants