-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Illegal memory access during zipformer training #1764
Comments
here is another different error I get by running the same script again. These errors seem to be happening at random times during the training, but within a few checkpoints. 2024-09-30 15:06:46,412 INFO [train.py:1122] (0/2) Epoch 1, batch 400, loss[loss=1.346, simple_loss=0.8059, pruned_loss=0.9432, over 1061.00 frames. ], tot_loss[loss=1.241, simple_loss=0.7526, pruned_loss=0.8645, over 191645.04 frames. ], batch size: 4, lr: 1.79e-02, grad_scale: 0.125 Exception raised from c10_cuda_check_implementation at /mnt/dsk1/home/ngoel/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first): terminate called after throwing an instance of 'c10::DistBackendError' |
I can confirm now that the error is specific to zipformer recipe. For example - pruned-transducer-stateless7 runs fine. If someone guides me towards steps equivalent to what's described here, I will happily do those and report. |
I believe that this has something to do with the torch cuda not being compatible with your CUDA installed in /usr/local/cuda-XX.X. But surprising that the error is only specific to the zipformer reciepe. Perhabs there is improper synch. when doing distributed training. Have you tried the zipformer recipe on single GPU? |
I think the possible mismatches would be the torch cuda not being compatible with the driver (although this should refuse to run), or not being compatible with the CUDA version that was used to compile k2 (although this should be detected some other way, i think). |
Thanks for the various suggestions. I have multiple experiments to run here to narrow down the cause. I'll get back to you. In the meantime, I would like to know about these CUDA versions. My understanding is that if I have (for example) a CUDA 12.xx driver (that comes with that version of CUDA), it should also be compatible with an earlier version of CUDA such as 11.xx. Is that your understanding also, or is there a compatibility matrix I can find somewhere? |
On a single GPU I got this assert error. 2024-10-02 11:19:36,387 INFO [train.py:1122] Epoch 1, batch 6000, loss[loss=0.4327, simple_loss=0.4287, pruned_loss=0.2184, over 8488.00 frames. ], tot_loss[loss=0.4947, simple_loss=0.4712, pruned_loss=0.2591, over 1707838.56 frames. ], batch size: 35, lr: 2.00e-02, grad_scale: 64.0 |
After enabling During handling of the above exception, another exception occurred: Traceback (most recent call last): |
after adding .contiguous() before the matmul to q and k above, that error did not happen in the last 2 runs. Now the error that happens every time is this..... During handling of the above exception, another exception occurred: Traceback (most recent call last): |
I have reduced the pytorch version to 2.4. This time i got the error as follows .... I am putting .contiguous() around x and s. It did more batches than before..... |
I think the reason for this error is that there is a sample that has a transcript, but has got no feature. But to confirm, could you try this: #1733 (comment) |
Could you please write the cuda versions.. I will try to replicate the recipe. and see if the error pops up. |
I haven't changed the CUDA version. Only Pytorch. Same as in the starting message. I think minor versions don't change the API much. 12.6. I had CUDA 11 earlier, with similar errors. I don't think zipformer recipe is so new. I am thinking something wrong in my setup / hardware but not sure. I have cloud compiled pytorch earlier. later I compiled myself , matching the CUDA and GPU architecture, to avoid any conflicts. latest error was around backward pass where it was complaining about misaligned address of loss(). As if CUDA is no longer ensuring that the outputs remain aligned just like inputs. |
I am now mostly getting this error (last 3 times) 2024-10-02 17:24:50,762 INFO [scaling.py:1024] Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=20.01 vs. limit=7.75875 |
Hi @ngoel17 , So, I replicated the librispeech zipformer recipe on my own small data, and it ran fine. Here's is how I installed k2 and torch pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install k2==1.24.4.dev20240905+cuda12.4.torch2.4.1 -f https://k2-fsa.github.io/k2/cuda.html Writing the env details for better readability
Here is the expt logs: python zipformer/train.py --world-size 2 --num-epochs 1 --start-epoch 1 --use-fp16 1 --exp-dir zipformer/exp --full-libri 1 --max-duration 200 --manifest-dir data/En_CV/all_data/fbank/ --bpe-model data/En_CV/all_data/lang_bpe_500/bpe.model --master-port 12344
[W1003 01:59:54.225740073 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
[W1003 01:59:54.230960963 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
2024-10-03 01:59:54,816 INFO [train.py:1194] (0/2) Training started
2024-10-03 01:59:54,817 INFO [train.py:1204] (0/2) Device: cuda:0
2024-10-03 01:59:54,820 INFO [train.py:1235] (0/2) Using dtype=torch.float16
2024-10-03 01:59:54,820 INFO [train.py:1236] (0/2) Use AMP=True
2024-10-03 01:59:54,820 INFO [train.py:1238] (0/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'ignore_id': -1, 'label_smoothing': 0.1, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'cf664841c6d93e21e59b40aade84869b76c919c1', 'k2-git-date': 'Thu Sep 5 19:25:17 2024', 'lhotse-version': '1.28.0.dev+git.c8ba6d01.clean', 'torch-version': '2.4.1+cu124', 'torch-cuda-available': True, 'torch-cuda-version': '12.4', 'python-version': '3.10', 'icefall-git-branch': 'master', 'icefall-git-sha1': '5c04c312-dirty', 'icefall-git-date': 'Fri Sep 20 06:38:52 2024', 'icefall-path': '/mnt/local/sangeet/workncode/k2-fsa/icefall', 'k2-path': '/tmp/test/test/lib/python3.10/site-packages/k2/__init__.py', 'lhotse-path': '/tmp/test/lhotse/lhotse/__init__.py', 'hostname': 'emlgpu04', 'IP address': '127.0.1.1'}, 'world_size': 2, 'master_port': 12344, 'tensorboard': True, 'num_epochs': 1, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'bpe_model': 'data/En_CV/all_data/lang_bpe_500/bpe.model', 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'attention_decoder_loss_scale': 0.8, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'use_bf16': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'attention_decoder_dim': 512, 'attention_decoder_num_layers': 6, 'attention_decoder_attention_dim': 512, 'attention_decoder_num_heads': 8, 'attention_decoder_feedforward_dim': 2048, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'use_attention_decoder': False, 'full_libri': True, 'mini_libri': False, 'manifest_dir': PosixPath('data/En_CV/all_data/fbank'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'sos_id': 1, 'eos_id': 1, 'vocab_size': 500, 'dtype': torch.float16, 'use_autocast': True}
2024-10-03 01:59:54,820 INFO [train.py:1240] (0/2) About to create model
2024-10-03 01:59:54,914 INFO [train.py:1194] (1/2) Training started
2024-10-03 01:59:54,915 INFO [train.py:1204] (1/2) Device: cuda:1
2024-10-03 01:59:54,916 INFO [train.py:1235] (1/2) Using dtype=torch.float16
2024-10-03 01:59:54,917 INFO [train.py:1236] (1/2) Use AMP=True
2024-10-03 01:59:54,917 INFO [train.py:1238] (1/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'ignore_id': -1, 'label_smoothing': 0.1, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'cf664841c6d93e21e59b40aade84869b76c919c1', 'k2-git-date': 'Thu Sep 5 19:25:17 2024', 'lhotse-version': '1.28.0.dev+git.c8ba6d01.clean', 'torch-version': '2.4.1+cu124', 'torch-cuda-available': True, 'torch-cuda-version': '12.4', 'python-version': '3.10', 'icefall-git-branch': 'master', 'icefall-git-sha1': '5c04c312-dirty', 'icefall-git-date': 'Fri Sep 20 06:38:52 2024', 'icefall-path': '/mnt/local/sangeet/workncode/k2-fsa/icefall', 'k2-path': '/tmp/test/test/lib/python3.10/site-packages/k2/__init__.py', 'lhotse-path': '/tmp/test/lhotse/lhotse/__init__.py', 'hostname': 'emlgpu04', 'IP address': '127.0.1.1'}, 'world_size': 2, 'master_port': 12344, 'tensorboard': True, 'num_epochs': 1, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp'), 'bpe_model': 'data/En_CV/all_data/lang_bpe_500/bpe.model', 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'attention_decoder_loss_scale': 0.8, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'use_bf16': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'attention_decoder_dim': 512, 'attention_decoder_num_layers': 6, 'attention_decoder_attention_dim': 512, 'attention_decoder_num_heads': 8, 'attention_decoder_feedforward_dim': 2048, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'use_attention_decoder': False, 'full_libri': True, 'mini_libri': False, 'manifest_dir': PosixPath('data/En_CV/all_data/fbank'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'sos_id': 1, 'eos_id': 1, 'vocab_size': 500, 'dtype': torch.float16, 'use_autocast': True}
2024-10-03 01:59:54,917 INFO [train.py:1240] (1/2) About to create model
2024-10-03 01:59:55,339 INFO [train.py:1244] (0/2) Number of model parameters: 65549011
2024-10-03 01:59:55,391 INFO [train.py:1244] (1/2) Number of model parameters: 65549011
2024-10-03 01:59:55,500 INFO [train.py:1259] (1/2) Using DDP
2024-10-03 01:59:56,316 INFO [train.py:1259] (0/2) Using DDP
2024-10-03 01:59:57,341 INFO [asr_datamodule.py:436] (0/2) About to get the shuffled train-clean-100, train-clean-360 and train-other-500 cuts
2024-10-03 01:59:57,450 INFO [asr_datamodule.py:232] (0/2) Enable MUSAN
2024-10-03 01:59:57,450 INFO [asr_datamodule.py:233] (0/2) About to get Musan cuts
2024-10-03 01:59:57,468 INFO [asr_datamodule.py:436] (1/2) About to get the shuffled train-clean-100, train-clean-360 and train-other-500 cuts
2024-10-03 01:59:57,581 INFO [asr_datamodule.py:232] (1/2) Enable MUSAN
2024-10-03 01:59:57,581 INFO [asr_datamodule.py:233] (1/2) About to get Musan cuts
2024-10-03 01:59:59,222 INFO [asr_datamodule.py:257] (0/2) Enable SpecAugment
2024-10-03 01:59:59,222 INFO [asr_datamodule.py:258] (0/2) Time warp factor: 80
2024-10-03 01:59:59,223 INFO [asr_datamodule.py:268] (0/2) Num frame mask: 10
2024-10-03 01:59:59,223 INFO [asr_datamodule.py:281] (0/2) About to create train dataset
2024-10-03 01:59:59,223 INFO [asr_datamodule.py:308] (0/2) Using DynamicBucketingSampler.
2024-10-03 01:59:59,322 INFO [asr_datamodule.py:257] (1/2) Enable SpecAugment
2024-10-03 01:59:59,323 INFO [asr_datamodule.py:258] (1/2) Time warp factor: 80
2024-10-03 01:59:59,323 INFO [asr_datamodule.py:268] (1/2) Num frame mask: 10
2024-10-03 01:59:59,323 INFO [asr_datamodule.py:281] (1/2) About to create train dataset
2024-10-03 01:59:59,323 INFO [asr_datamodule.py:308] (1/2) Using DynamicBucketingSampler.
2024-10-03 02:00:00,146 INFO [asr_datamodule.py:325] (0/2) About to create train dataloader
2024-10-03 02:00:00,147 INFO [asr_datamodule.py:453] (0/2) About to get dev-clean cuts
2024-10-03 02:00:00,148 INFO [asr_datamodule.py:460] (0/2) About to get dev-other cuts
2024-10-03 02:00:00,148 INFO [asr_datamodule.py:356] (0/2) About to create dev dataset
2024-10-03 02:00:00,281 INFO [asr_datamodule.py:325] (1/2) About to create train dataloader
2024-10-03 02:00:00,281 INFO [asr_datamodule.py:453] (1/2) About to get dev-clean cuts
2024-10-03 02:00:00,282 INFO [asr_datamodule.py:460] (1/2) About to get dev-other cuts
2024-10-03 02:00:00,283 INFO [asr_datamodule.py:356] (1/2) About to create dev dataset
2024-10-03 02:00:00,707 INFO [asr_datamodule.py:373] (0/2) About to create dev dataloader
2024-10-03 02:00:00,707 INFO [train.py:1463] (0/2) Sanity check -- see if any of the batches in epoch 1 would cause OOM.
2024-10-03 02:00:00,844 INFO [asr_datamodule.py:373] (1/2) About to create dev dataloader
2024-10-03 02:00:00,845 INFO [train.py:1463] (1/2) Sanity check -- see if any of the batches in epoch 1 would cause OOM.
2024-10-03 02:00:07,072 INFO [scaling.py:1025] (1/2) Whitening: name=None, num_groups=1, num_channels=192, metric=45.78 vs. limit=7.5
2024-10-03 02:00:07,224 INFO [train.py:1493] (0/2) Maximum memory allocated so far is 3315MB
2024-10-03 02:00:07,226 INFO [train.py:1493] (1/2) Maximum memory allocated so far is 3329MB
2024-10-03 02:00:07,557 INFO [scaling.py:1025] (0/2) Whitening: name=None, num_groups=1, num_channels=256, metric=87.59 vs. limit=4.0
2024-10-03 02:00:08,065 INFO [train.py:1493] (1/2) Maximum memory allocated so far is 3329MB
2024-10-03 02:00:08,065 INFO [train.py:1493] (0/2) Maximum memory allocated so far is 3315MB
2024-10-03 02:00:08,950 INFO [train.py:1493] (0/2) Maximum memory allocated so far is 3315MB
2024-10-03 02:00:08,950 INFO [train.py:1493] (1/2) Maximum memory allocated so far is 3329MB
2024-10-03 02:00:09,709 INFO [scaling.py:1025] (1/2) Whitening: name=None, num_groups=4, num_channels=128, metric=9.13 vs. limit=3.0
2024-10-03 02:00:09,828 INFO [train.py:1493] (1/2) Maximum memory allocated so far is 3329MB
2024-10-03 02:00:09,829 INFO [train.py:1493] (0/2) Maximum memory allocated so far is 3315MB
2024-10-03 02:00:10,738 INFO [train.py:1493] (0/2) Maximum memory allocated so far is 3315MB
2024-10-03 02:00:10,738 INFO [train.py:1493] (1/2) Maximum memory allocated so far is 3329MB
2024-10-03 02:00:11,141 INFO [scaling.py:1025] (1/2) Whitening: name=None, num_groups=1, num_channels=256, metric=38.92 vs. limit=7.5
2024-10-03 02:00:11,362 INFO [scaling.py:1025] (0/2) Whitening: name=None, num_groups=4, num_channels=128, metric=9.64 vs. limit=3.0
2024-10-03 02:00:11,587 INFO [train.py:1493] (0/2) Maximum memory allocated so far is 3315MB
2024-10-03 02:00:11,588 INFO [train.py:1493] (1/2) Maximum memory allocated so far is 3329MB
/tmp/test/icefall/egs/librispeech/ASR/zipformer/train.py:1370: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
scaler = GradScaler(enabled=params.use_autocast, init_scale=1.0)
/tmp/test/icefall/egs/librispeech/ASR/zipformer/train.py:1370: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
scaler = GradScaler(enabled=params.use_autocast, init_scale=1.0)
2024-10-03 02:00:22,446 INFO [scaling.py:1025] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=16.03 vs. limit=7.5
2024-10-03 02:00:22,798 INFO [train.py:1126] (0/2) Epoch 1, batch 0, loss[loss=7.436, simple_loss=6.775, pruned_loss=6.596, over 4731.00 frames. ], tot_loss[loss=7.436, simple_loss=6.775, pruned_loss=6.596, over 4731.00 frames. ], batch size: 23, lr: 2.25e-02, grad_scale: 2.0
2024-10-03 02:00:22,799 INFO [train.py:1149] (0/2) Computing validation loss
2024-10-03 02:00:22,802 INFO [train.py:1126] (1/2) Epoch 1, batch 0, loss[loss=7.43, simple_loss=6.767, pruned_loss=6.622, over 4733.00 frames. ], tot_loss[loss=7.43, simple_loss=6.767, pruned_loss=6.622, over 4733.00 frames. ], batch size: 23, lr: 2.25e-02, grad_scale: 2.0
2024-10-03 02:00:22,803 INFO [train.py:1149] (1/2) Computing validation loss
2024-10-03 02:00:39,860 INFO [zipformer.py:1883] (0/2) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.3075, 5.4333, 5.3807, 5.4160], device='cuda:0')
2024-10-03 02:00:40,104 INFO [zipformer.py:1883] (1/2) name=encoder.encoders.3.encoder.layers.3.self_attn_weights, attn_weights_entropy = tensor([3.3353, 3.4979, 3.4713, 3.5922, 3.4526, 3.5167, 3.4543, 3.5035],
device='cuda:1')
2024-10-03 02:00:43,322 INFO [zipformer.py:1883] (1/2) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.4515, 3.5168, 3.4991, 3.5555, 3.4853, 3.5217, 3.4923, 3.5199],
device='cuda:1')
2024-10-03 02:00:43,428 INFO [zipformer.py:1883] (0/2) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.2810, 5.4639, 5.4801, 5.3358], device='cuda:0')
2024-10-03 02:00:48,377 INFO [zipformer.py:1883] (1/2) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([3.5395, 3.6418, 3.6039, 3.6943, 3.5954, 3.6408, 3.6137, 3.6393],
device='cuda:1')
2024-10-03 02:00:48,502 INFO [zipformer.py:1883] (0/2) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.0970, 5.1419, 5.1674, 5.2175], device='cuda:0')
2024-10-03 02:00:53,323 INFO [zipformer.py:1883] (1/2) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.4807, 5.6428, 5.6575, 5.5311], device='cuda:1')
2024-10-03 02:00:53,392 INFO [zipformer.py:1883] (0/2) name=encoder.encoders.3.encoder.layers.3.self_attn_weights, attn_weights_entropy = tensor([3.3341, 3.5283, 3.4634, 3.6020, 3.4913, 3.5387, 3.4652, 3.5156],
device='cuda:0')
2024-10-03 02:00:55,341 INFO [zipformer.py:1883] (1/2) name=encoder.encoders.3.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([3.4074, 3.5270, 3.4807, 3.6096, 3.4510, 3.5521, 3.4889, 3.5356],
device='cuda:1')
2024-10-03 02:00:55,601 INFO [zipformer.py:1883] (0/2) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.9412, 4.9748, 5.0364, 5.0812], device='cuda:0')
2024-10-03 02:00:56,504 INFO [zipformer.py:1883] (1/2) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.3655, 4.2595, 4.2836, 4.3864], device='cuda:1')
2024-10-03 02:00:56,776 INFO [zipformer.py:1883] (0/2) name=encoder.encoders.2.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([4.2437, 4.3846, 4.2819, 4.2082], device='cuda:0')
2024-10-03 02:01:02,899 INFO [zipformer.py:1883] (0/2) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.2188, 5.4193, 5.4720, 5.2883], device='cuda:0')
2024-10-03 02:01:03,419 INFO [zipformer.py:1883] (1/2) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.8552, 3.9198, 3.8987, 3.9552, 3.8833, 3.9226, 3.9035, 3.9127],
device='cuda:1')
2024-10-03 02:01:19,152 INFO [zipformer.py:1883] (0/2) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.9902, 5.2923, 5.3694, 5.1818], device='cuda:0')
2024-10-03 02:01:20,277 INFO [zipformer.py:1883] (1/2) name=encoder.encoders.2.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([4.0022, 4.1509, 4.0052, 3.9495], device='cuda:1')
2024-10-03 02:01:29,367 INFO [zipformer.py:1883] (0/2) name=encoder.encoders.3.encoder.layers.2.self_attn_weights, attn_weights_entropy = tensor([3.7156, 3.8265, 3.8094, 3.9157, 3.7977, 3.8325, 3.8000, 3.8370],
device='cuda:0')
2024-10-03 02:01:29,570 INFO [zipformer.py:1883] (1/2) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.4350, 4.7387, 4.5909, 4.2373], device='cuda:1')
2024-10-03 02:01:41,441 INFO [train.py:1158] (1/2) Epoch 1, validation: loss=7.282, simple_loss=6.63, pruned_loss=6.509, over 4897180.00 frames.
2024-10-03 02:01:41,442 INFO [train.py:1159] (1/2) Maximum memory allocated so far is 14153MB
2024-10-03 02:01:41,444 INFO [train.py:1158] (0/2) Epoch 1, validation: loss=7.282, simple_loss=6.63, pruned_loss=6.509, over 4897180.00 frames.
2024-10-03 02:01:41,445 INFO [train.py:1159] (0/2) Maximum memory allocated so far is 14152MB
2024-10-03 02:01:45,218 INFO [scaling.py:215] (0/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.attention_skip_rate, batch_count=0.0, ans=0.2
2024-10-03 02:01:45,222 INFO [scaling.py:215] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.conv_skip_rate, batch_count=0.0, ans=0.2
2024-10-03 02:01:45,440 INFO [scaling.py:1025] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=512, metric=34.05 vs. limit=4.0
2024-10-03 02:01:45,499 INFO [scaling.py:1025] (1/2) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=384, metric=15.76 vs. limit=7.5
2024-10-03 02:01:46,157 INFO [scaling.py:1025] (0/2) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=512, metric=32.18 vs. limit=5.0
2024-10-03 02:01:46,662 INFO [scaling.py:215] (1/2) ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.conv_module1.balancer1.prob, batch_count=0.0, ans=0.5
2024-10-03 02:01:46,792 INFO [scaling.py:1025] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=4.0
2024-10-03 02:01:47,392 INFO [scaling.py:215] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=0.0, ans=0.5
i second to you on this one. Maybe you could try installing torch and k2 using the setup that worked for me. thanks |
Sure thing. I'll try this as well. Currently, the training hasn't crashed and progressed to epoch 2. I believe it's due to all the ideas proposed by you/Dan, but fingers crossed. Will need to go back and hash out what helped if we determine that everything is resolved. |
Nagendra, I suspect the kernel syncing was not working i.e. the export CUDA_LAUNCH_BLOCKING=1 Also, the error that you got in validation when you used 1 GPU: I think this was a totally different error where in the validation code you were passing in the wrong thing. Probably you never reached batch 6000 with 4 GPUs so it never hit the validation code on batch 1. |
@danpovey export CUDA_LAUNCH_BLOCKING=1 at least made that message about the option to use CUDA_LAUNCH_BLOCKING in the error messages go away. So the only message left was about TORCH_USE_CUDA_DSA. I tried this and also tried exporting USE_GPU=1 and export TORCH_USE_CUDA_DSA=1, before compiling pytorch, but that would not affect the TORCH_USE_CUDA_DSA message. interestingly editing the ./torch/include/c10/cuda/CUDADeviceAssertionHost.h to explicitly #define lead to "previously defined" .. warning and I could not figure out where its previously defined. This header has pragma once. @sangeet2020 I did move to CUDA 12.4 installed and then pip install of the pre-compiled version. Unfortunately, that has not helped either. The latest error message isn't even about the memory alignment....
Interestingly gpu_burn runs for hours without any problem, and the CPU hasn't thrown any errors. In coming days, would try to use the system for a NLP training task, just to rule out general hardware issues. |
You said: |
@danpovey - I may have exported that before opening the ticket because it was an easier thing to do. There is some progress probably. Past 4 times, I have had error messages only from one "primary backtrace" --- File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1053, in train_one_epoch The message this time is slightly different but then I had edited the compute_loss function to add .contiguous() to loss (not metircs info). Here is the complete log. _2024-10-03 22:19:39,729 INFO [scaling.py:214] ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=36600.0, ans=0.125 [ Stack-Trace: ] 2024-10-03 22:21:23,924 INFO [train.py:1060] Caught exception:
2024-10-03 22:21:23,925 INFO [checkpoint.py:75] Saving checkpoint to exp/zipformer/v6/bad-model-0.pt
During handling of the above exception, another exception occurred: Traceback (most recent call last): |
Can you try adding in k2's swoosh.cu, |
It turns out to be bad hardware. Hopefully, this thread will remind others that not all problems are software problems. The error was not frequent, and probably i was in too much of a rush to draw conclusions. |
Oh. How did you determine that it was bad hardware? |
Tried a different GPU (train on single GPU) and it didn't crash. The
machine had a recent water cooler and Power supply failure. So I couldn't
have had imagined GPU will also go bad.... But turned out the recipe ran
flawlessly on the other GPU. Tried again on this GPU and got random errors.
…On Tue, Oct 8, 2024, 11:34 AM Daniel Povey ***@***.***> wrote:
Oh. How did you determine that it was bad hardware?
—
Reply to this email directly, view it on GitHub
<#1764 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACDHE6A4HPW5A4DA5SIUYA3Z2P3OVAVCNFSM6AAAAABPDU25GKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBQGE4DEMRUGY>
.
You are receiving this because you modified the open/close state.Message
ID: ***@***.***>
|
OK, but sometimes on different GPU types it might run a different kernel for some reason. |
Ok. but I have already sent the faulty card away as the warranty was
expiring in a month or so, and the other two cards are exactly the same
model, and not throwing errors. I don't have any other card powerful enough
to run these trainings. So testing on my arch will not help. Let me ask
around if someone else can test.
…On Tue, Oct 8, 2024, 9:54 PM Daniel Povey ***@***.***> wrote:
OK, but sometimes on different GPU types it might run a different kernel
for some reason.
In the case where the gradient was in fp16, I think there might actually
be a bug in that code that it would treat it as fp32; and you would get
misaligned address errors, potentially, at least in principle. I just don't
know whether that would actually happen in practice. It would be good if
you could apply that change, recompile k2, and test.
—
Reply to this email directly, view it on GitHub
<#1764 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACDHE6HJ3T4LGHBRACXPDKDZ2SEEZAVCNFSM6AAAAABPDU25GKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBRGEYTKNZVGI>
.
You are receiving this because you modified the open/close state.Message
ID: ***@***.***>
|
According to this pytorch Documentation autocast will be f32 because swoosh is not in the list. That should be very deterministic, so nothing to worry about. |
I am getting the following error during Zipformer training...
Initially, I was getting exactly the same error with an older version of CUDA/drivers, and an older version of k2fsa and icefall. re-ran after upgrading everything (including pytorch) and still got the same error. The error does not happen consistently at the same point. Any pointers will be greatly appreciated.
2024-09-30 11:09:40,620 INFO [train.py:1190] (0/2) Training started
2024-09-30 11:09:40,620 INFO [train.py:1200] (0/2) Device: cuda:0
2024-09-30 11:09:40,621 INFO [train.py:1231] (0/2) Using dtype=torch.float16
2024-09-30 11:09:40,621 INFO [train.py:1232] (0/2) Use AMP=True
2024-09-30 11:09:40,621 INFO [train.py:1234] (0/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'ignore_id': -1, 'label_smoothing': 0.1, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '21302dae6cdbaa25c5b851f35329e592f5bf12d5', 'k2-git-date': 'Sat Sep 7 05:29:18 2024', 'lhotse-version': '1.28.0.dev+git.bc2c0a29.clean', 'torch-version': '2.6.0a0+gitc9653bf', 'torch-cuda-available': True, 'torch-cuda-version': '12.6', 'python-version': '3.10', 'icefall-git-branch': 'master', 'icefall-git-sha1': '5c04c312-clean', 'icefall-git-date': 'Fri Sep 20 00:38:52 2024', 'icefall-path': '/mnt/dsk1/home/ngoel/icefall', 'k2-path': '/mnt/dsk1/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/k2-1.24.4.dev20240930+cpu.torch2.6.0a0-py3.10-linux-x86_64.egg/k2/init.py', 'lhotse-path': '/mnt/dsk1/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/lhotse-1.28.0.dev0+git.bc2c0a29.clean-py3.10.egg/lhotse/init.py', 'hostname': 'rahim', 'IP address': '127.0.1.1'}, 'world_size': 2, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 8000, 'exp_dir': PosixPath('exp/zipformer/v6'), 'bpe_model': 'data/lang_bpe_500/bpe.model', 'base_lr': 0.025, 'lr_batches': 5000.0, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'attention_decoder_loss_scale': 0.8, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 200, 'average_period': 200, 'use_fp16': True, 'use_bf16': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'attention_decoder_dim': 512, 'attention_decoder_num_layers': 6, 'attention_decoder_attention_dim': 512, 'attention_decoder_num_heads': 8, 'attention_decoder_feedforward_dim': 2048, 'causal': True, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'use_attention_decoder': False, 'full_libri': True, 'mini_libri': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 200, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'sos_id': 1, 'eos_id': 1, 'vocab_size': 500, 'dtype': torch.float16, 'use_autocast': True}
2024-09-30 11:09:40,621 INFO [train.py:1236] (0/2) About to create model
....
2024-09-30 11:15:42,969 INFO [scaling.py:214] (0/2) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=5666.666666666667, ans=0.234375
2024-09-30 11:15:43,063 INFO [scaling.py:214] (1/2) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.ff3_skip_rate, batch_count=5666.666666666667, ans=0.00963768115942029
2024-09-30 11:15:43,257 INFO [scaling.py:1024] (0/2) Whitening: name=encoder.encoders.2.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=384, metric=17.37 vs. limit=9.625
2024-09-30 11:15:44,083 INFO [scaling.py:1024] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.85 vs. limit=9.625
2024-09-30 11:15:47,973 INFO [scaling.py:1024] (1/2) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=24.01 vs. limit=11.754999999999999
[F] /home/ngoel/k2/k2/csrc/eval.h:147:void k2::EvalDevice(cudaStream_t, int32_t, LambdaT&) [with LambdaT = __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<at::Tensor ()(torch::autograd::AutogradContext, at::Tensor, float), k2::SwooshFunctionk2::SwooshRConstants::forward, void, 1>, const float*, float, float, float, float, const float*, float*, const float*, unsigned char*>; cudaStream_t = CUstream_st*; int32_t = int] Check failed: e == cudaSuccess (700 vs. 0) Error: an illegal memory access was encountered.
[rank1]:[E930 11:15:51.984368626 ProcessGroupNCCL.cpp:1598] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Exception raised from c10_cuda_check_implementation at /mnt/dsk1/home/ngoel/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xac (0x7fb2efd9778c in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fb2efd3ba79 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7fb30bdaeec2 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fb2f0f41d1e in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xb0 (0x7fb2f0f46db0 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1ca (0x7fb2f0f512da in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x166 (0x7fb2f0f52e76 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x7fb30b8b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fb313e9eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7fb313f30850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Exception raised from c10_cuda_check_implementation at /mnt/dsk1/home/ngoel/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xac (0x7fb2efd9778c in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xf3 (0x7fb2efd3ba79 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7fb30bdaeec2 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7fb2f0f41d1e in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xb0 (0x7fb2f0f46db0 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1ca (0x7fb2f0f512da in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x166 (0x7fb2f0f52e76 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdc253 (0x7fb30b8b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fb313e9eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126850 (0x7fb313f30850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at /mnt/dsk1/home/ngoel/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1604 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xac (0x7fb2efd9778c in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0x1125d62 (0x7fb2f0f2fd62 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xdaffe4 (0x7fb2f0bb9fe4 in /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xdc253 (0x7fb30b8b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: + 0x94ac3 (0x7fb313e9eac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #5: + 0x126850 (0x7fb313f30850 in /usr/lib/x86_64-linux-gnu/libc.so.6)
W0930 11:15:52.358000 1791209 /mnt/dsk1/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:160] Terminating process 1791247 via signal SIGTERM
Traceback (most recent call last):
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1651, in
main()
File "/mnt/dsk1/home/ngoel/icefall/egs/multien/ASR13/./zipformer/train.py", line 1642, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
while not context.join():
File "/home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 184, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGABRT
(icefall-sep-24) ngoel@rahim:~/icefall/egs/multien/ASR13$ /home/ngoel/miniconda3/envs/icefall-sep-24/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 28 leaked semaphore objects to clean up at shutdown
The text was updated successfully, but these errors were encountered: