Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError for class values on conformer_ctc. #240

Closed
mesut92 opened this issue Mar 6, 2022 · 22 comments
Closed

RuntimeError for class values on conformer_ctc. #240

mesut92 opened this issue Mar 6, 2022 · 22 comments

Comments

@mesut92
Copy link

mesut92 commented Mar 6, 2022

Hi k2
I am trying to run "librispeech/ASR/conformer_ctc/". I build docker image from your Dockerfile. I use Librispeech dataset. However it gives an error, and i could not fix.

I use thşs command:
conformer_ctc/train.py --max-duration 140

`Traceback (most recent call last):
  File "conformer_ctc/train.py", line 799, in <module>
    main()
  File "conformer_ctc/train.py", line 792, in main
    run(rank=0, world_size=1, args=args)
  File "conformer_ctc/train.py", line 685, in run
    scan_pessimistic_batches_for_oom(
  File "conformer_ctc/train.py", line 758, in scan_pessimistic_batches_for_oom
    loss, _ = compute_loss(
  File "conformer_ctc/train.py", line 416, in compute_loss
    att_loss = mmodel.decoder_forward(
  File "/media/mesut/Tosun/ses/010322k2/icefall/egs/librispeech/ASR/conformer_ctc/transformer.py", line 313, in decoder_forward
    decoder_loss = self.decoder_criterion(pred_pad, ys_out_pad)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/mesut/Tosun/ses/010322k2/icefall/egs/librispeech/ASR/conformer_ctc/label_smoothing.py", line 81, in forward
    true_dist = torch.nn.functional.one_hot(
RuntimeError: Class values must be non-negative.
`

Thanks,
Mesut

@csukuangfj
Copy link
Collaborator

csukuangfj commented Mar 6, 2022

Could you change

unsorted_token_ids = graph_compiler.texts_to_ids(
supervisions["text"]
)
att_loss = mmodel.decoder_forward(

to

 unsorted_token_ids = graph_compiler.texts_to_ids( 
     supervisions["text"] 
 ) 
 import pdb
 pdb.set_trace()
 att_loss = mmodel.decoder_forward(

When it enters pdb, you can print out the value of unsorted_token_ids to see whether it contains negative entries.

@mesut92
Copy link
Author

mesut92 commented Mar 7, 2022

my output:

(Pdb) print(unsorted_token_ids)
[[28, 172, 37, 39, 18, 8, 10, 74, 85, 119, 22, 4, 5, 163, 76, 200, 16, 9, 28, 3, 190, 178, 50, 142, 11, 11, 43, 47, 92, 3, 3, 15, 4, 110, 187, 357, 43, 3, 29, 172, 45, 77, 33, 4, 82, 332, 88, 17, 391, 3, 148, 3, 3, 35, 256, 7, 28, 172, 37, 152, 12, 50, 30, 68, 361, 13, 55, 50, 78, 33, 310, 21, 183, 185, 69, 47, 95, 24, 50, 57, 83, 19, 180, 140, 21, 183, 185, 148, 3, 3, 35, 40, 30, 216, 5, 64, 34, 10, 369, 217, 3, 111, 67, 22, 178, 3, 5, 47, 35, 49, 71, 444, 11, 45, 22, 25, 40, 26, 93, 140, 217, 17, 152, 77, 14, 28, 3, 47, 43, 36, 204, 137, 22, 288, 153, 15, 4, 70, 27, 12, 86, 126, 96, 25, 228, 29, 172, 37, 57, 59, 20, 40, 26, 214, 8, 269, 5, 53, 71, 14, 25, 50, 250, 3, 217, 367, 19, 163, 101, 3, 80, 12, 57, 83, 19, 180, 7, 15, 18, 27, 185, 14, 30, 255, 11, 200, 34, 193, 254, 4, 121, 43, 57, 83, 19, 180, 140, 159, 34, 92, 77, 7, 4, 87, 11, 79, 33, 91, 89, 140, 250, 3, 34, 56, 4, 81, 61, 95, 281, 65, 47, 95, 83, 16, 47, 119, 82, 3], [7, 144, 74, 22, 55, 49, 8, 9, 169, 170, 79, 251, 3, 156, 131, 229, 177, 27, 6, 86, 8, 15, 183, 8, 371, 4, 82, 47, 99, 18, 323, 59, 207, 3, 112, 172, 38, 12, 83, 82, 7, 141, 114, 43, 493, 253, 47, 27, 133, 202, 125, 6, 3, 45, 36, 14, 102, 52, 3, 5, 125, 6, 9, 142, 89, 5, 156, 61, 456, 5, 52, 19, 141, 114, 324, 7, 45, 36, 8, 5, 37, 20, 163, 35, 247, 148, 17, 33, 140, 366, 102, 52], [69, 23, 241, 12, 41, 6, 327, 100, 371, 59, 20, 77, 29, 197, 35, 145, 274, 23, 483, 9, 468, 191, 47, 119, 82, 10, 420, 134, 121, 31, 141, 146, 19, 3, 8, 10, 105, 60, 4, 37, 59, 9, 28, 188, 24, 11, 318, 232, 44, 11, 121, 241, 12, 41, 6, 483, 328, 9, 4, 82, 113, 14, 3, 327, 100, 371, 59, 19, 19, 11, 29, 23, 60, 45, 67, 208, 11, 489, 346, 26, 25, 13, 39, 42, 53, 71, 43, 6, 7, 104, 227, 15, 4, 361, 88, 49, 8, 319, 91, 18, 80, 9, 42, 276, 13], [7, 261, 127, 160, 22, 43, 185, 193, 271, 27, 21, 172, 496, 107, 181, 313, 84, 72, 67, 72, 4, 111, 6, 91, 139, 354, 194, 31, 62, 57, 43, 18, 6, 55, 51, 35, 8, 106, 42, 155, 24, 55, 11, 5, 61, 348, 75, 14, 29, 23, 332, 43, 100, 30, 188, 59, 42, 255, 3, 209, 9, 142, 77, 15, 4, 229, 15, 59, 193, 9, 42, 345, 57, 16, 13, 41, 3, 428, 14], [39, 4, 321, 18, 185, 3, 9, 5, 280, 136, 6, 143, 71, 11, 109, 56, 21, 105, 117, 172, 350, 70, 294, 205, 330, 3, 9, 137, 67, 58, 95, 38, 12, 163, 55, 6, 208, 87, 133, 144, 23, 132, 65, 279, 206, 322, 93, 184, 36, 411, 120, 42, 349, 136, 17, 35, 30, 10, 272, 89, 54, 98, 113, 164, 4, 32, 422, 27, 172, 189, 36, 12, 10, 37, 5, 280, 126, 77, 26, 44, 11], [5, 119, 89, 73, 31, 113, 262, 12, 222, 5, 344, 12, 3, 434, 106, 4, 393, 11, 14, 3, 329, 188, 24, 139, 275, 85, 63, 9, 63, 286, 3, 40, 20, 77, 7, 47, 92, 3, 22, 8, 4, 270, 29, 74, 128, 122, 8, 42, 68, 13, 238, 325, 14, 157, 28, 60, 191, 102, 14, 197, 36, 98, 15, 4, 392, 7, 358, 69, 39, 5, 143, 207, 105, 341, 13, 10, 98]]

@csukuangfj
Copy link
Collaborator

Is this batch causing the above error?

@mesut92
Copy link
Author

mesut92 commented Mar 7, 2022

I guess so. It gives error"scan_pessimistic_batches_for_oom("line.

@csukuangfj
Copy link
Collaborator

(Pdb) print(unsorted_token_ids)

You can continue running by

(Pdb) continue
(Pdb) print(unsorted_token_ids)
(Pdb) continue

... ...

If it throws again, the last output of unsorted_token_ids is causing the exception.

Also, you can use a try ... except to catch the exception and print out the problematic batch.

@mesut92
Copy link
Author

mesut92 commented Mar 7, 2022

When i did (Pdb) continue, it gave same error. And then i did, pdb.set_trace(), print(target) in label_smoothing.py in line 81.
result:

tensor([  0, 172,  37,  39,  18,   8,  10,  74,  85, 119,  22,   4,   5, 163,
         76, 200,  16,   9,  28,   3, 190, 178,  50, 142,  11,  11,  43,  47,
         92,   3,   3,  15,   4, 110, 187, 357,  43,   3,  29, 172,  45,  77,
         33,   4,  82, 332,  88,  17, 391,   3, 148,   3,   3,  35, 256,   7,
         28, 172,  37, 152,  12,  50,  30,  68, 361,  13,  55,  50,  78,  33,
        310,  21, 183, 185,  69,  47,  95,  24,  50,  57,  83,  19, 180, 140,
         21, 183, 185, 148,   3,   3,  35,  40,  30, 216,   5,  64,  34,  10,
        369, 217,   3, 111,  67,  22, 178,   3,   5,  47,  35,  49,  71, 444,
         11,  45,  22,  25,  40,  26,  93, 140, 217,  17, 152,  77,  14,  28,
          3,  47,  43,  36, 204, 137,  22, 288, 153,  15,   4,  70,  27,  12,
         86, 126,  96,  25, 228,  29, 172,  37,  57,  59,  20,  40,  26, 214,
          8, 269,   5,  53,  71,  14,  25,  50, 250,   3, 217, 367,  19, 163,
        101,   3,  80,  12,  57,  83,  19, 180,   7,  15,  18,  27, 185,  14,
         30, 255,  11, 200,  34, 193, 254,   4, 121,  43,  57,  83,  19, 180,
        140, 159,  34,  92,  77,   7,   4,  87,  11,  79,  33,  91,  89, 140,
        250,   3,  34,  56,   4,  81,  61,  95, 281,  65,  47,  95,  83,  16,
         47, 119,  82,   3,   1,   7, 144,  74,  22,  55,  49,   8,   9, 169,
        170,  79, 251,   3, 156, 131, 229, 177,  27,   6,  86,   8,  15, 183,
          8, 371,   4,  82,  47,  99,  18, 323,  59, 207,   3, 112, 172,  38,
         12,  83,  82,   7, 141, 114,  43, 493, 253,  47,  27, 133, 202, 125,
          6,   3,  45,  36,  14, 102,  52,   3,   5, 125,   6,   9, 142,  89,
          5, 156,  61, 456,   5,  52,  19, 141, 114, 324,   7,  45,  36,   8,
          5,  37,  20, 163,  35, 247, 148,  17,  33, 140, 366, 102,  52,   1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  69,  23, 241,  12,
         41,   6, 327, 100, 371,  59,  20,  77,  29, 197,  35, 145, 274,  23,
        483,   9, 468, 191,  47, 119,  82,  10, 420, 134, 121,  31, 141, 146,
         19,   3,   8,  10, 105,  60,   4,  37,  59,   9,  28, 188,  24,  11,
        318, 232,  44,  11, 121, 241,  12,  41,   6, 483, 328,   9,   4,  82,
        113,  14,   3, 327, 100, 371,  59,  19,  19,  11,  29,  23,  60,  45,
         67, 208,  11, 489, 346,  26,  25,  13,  39,  42,  53,  71,  43,   6,
          7, 104, 227,  15,   4, 361,  88,  49,   8, 319,  91,  18,  80,   9,
         42, 276,  13,   1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,   7, 261, 127, 160,  22,  43, 185, 193, 271,  27,  21, 172, 496,
        107, 181, 313,  84,  72,  67,  72,   4, 111,   6,  91, 139, 354, 194,
         31,  62,  57,  43,  18,   6,  55,  51,  35,   8, 106,  42, 155,  24,
         55,  11,   5,  61, 348,  75,  14,  29,  23, 332,  43, 100,  30, 188,
         59,  42, 255,   3, 209,   9, 142,  77,  15,   4, 229,  15,  59, 193,
          9,  42, 345,  57,  16,  13,  41,   3, 428,  14,   1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  39,   4, 321,  18, 185,   3,   9,   5,
        280, 136,   6, 143,  71,  11, 109,  56,  21, 105, 117, 172, 350,  70,
        294, 205, 330,   3,   9, 137,  67,  58,  95,  38,  12, 163,  55,   6,
        208,  87, 133, 144,  23, 132,  65, 279, 206, 322,  93, 184,  36, 411,
        120,  42, 349, 136,  17,  35,  30,  10, 272,  89,  54,  98, 113, 164,
          4,  32, 422,  27, 172, 189,  36,  12,  10,  37,   5, 280, 126,  77,
         26,  44,  11,   1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,   5, 119,  89,
         73,  31, 113, 262,  12, 222,   5, 344,  12,   3, 434, 106,   4, 393,
         11,  14,   3, 329, 188,  24, 139, 275,  85,  63,   9,  63, 286,   3,
         40,  20,  77,   7,  47,  92,   3,  22,   8,   4, 270,  29,  74, 128,
        122,   8,  42,  68,  13, 238, 325,  14, 157,  28,  60, 191, 102,  14,
        197,  36,  98,  15,   4, 392,   7, 358,  69,  39,   5, 143, 207, 105,
        341,  13,  10,  98,   1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,  -1,
         -1,  -1], device='cuda:0')

@csukuangfj
Copy link
Collaborator

ignored = target == self.ignore_index
target[ignored] = 0
true_dist = torch.nn.functional.one_hot(
target, num_classes=num_classes
).to(x)

At line 79, entries of -1 in target are set to 0, so at line 81, target should not contain -1.

Did you make any changes to the code?

@mesut92
Copy link
Author

mesut92 commented Mar 7, 2022

nope. I did not change anything in label_smoothing.py. Ignored variable looks good, but target variable assigning is not working i guess.

        ignored = target == self.ignore_index
        target[ignored] = 0
        import pdb
        pdb.set_trace()

(Pdb) print(ignored)

tensor([False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True], device='cuda:0')

@csukuangfj
Copy link
Collaborator

but target variable assigning is not working i guess.

What are the values of target before and after line 79?

@mesut92
Copy link
Author

mesut92 commented Mar 7, 2022

        import pdb
        target = target.clone().reshape(-1)
        ignored = target == self.ignore_index
        pdb.set_trace()
        target[ignored] = 0
        pdb.set_trace()

result:

-> att_loss = mmodel.decoder_forward(
(Pdb) print(target)
*** NameError: name 'target' is not defined
(Pdb) continue
> /media/mesut/Tosun/ses/010322k2/icefall/egs/librispeech/ASR/conformer_ctc/label_smoothing.py(80)forward()
-> target[ignored] = 0
(Pdb) print(target)
tensor([ 28, 172,  37,  ...,  -1,  -1,  -1], device='cuda:0')
(Pdb) 

@csukuangfj
Copy link
Collaborator

(Pdb) continue
> /media/mesut/Tosun/ses/010322k2/icefall/egs/librispeech/ASR/conformer_ctc/label_smoothing.py(80)forward()
-> target[ignored] = 0
(Pdb) print(target)
tensor([ 28, 172,  37,  ...,  -1,  -1,  -1], device='cuda:0')
(Pdb) 

When you print out the value of target, the line target[ignored] = 0 has not been executed yet.
please enter next, press Enter, and then enter print(target).

@mesut92
Copy link
Author

mesut92 commented Mar 7, 2022

      import pdb
      target = target.clone().reshape(-1)
      ignored = target == self.ignore_index
      pdb.set_trace()
      target[ignored] = 0
      pdb.set_trace()

result:

> /media/mesut/Tosun/ses/010322k2/icefall/egs/librispeech/ASR/conformer_ctc/label_smoothing.py(80)forward()
-> target[ignored] = 0
(Pdb) print(target)
tensor([ 28, 172,  37,  ...,  -1,  -1,  -1], device='cuda:0')
(Pdb) next
> /media/mesut/Tosun/ses/010322k2/icefall/egs/librispeech/ASR/conformer_ctc/label_smoothing.py(81)forward()
-> pdb.set_trace()
(Pdb)  print(target)
tensor([  0, 172,  37,  ...,  -1,  -1,  -1], device='cuda:0')
(Pdb) print(target)
tensor([  0, 172,  37,  ...,  -1,  -1,  -1], device='cuda:0')
(Pdb) target[ignored] = 0
(Pdb) print(target)
tensor([  0, 172,  37,  ...,  -1,  -1,  -1], device='cuda:0')

@csukuangfj
Copy link
Collaborator

Ok, it's strange. What the version of your pytorch?

What is the output of the following code for your PyTorch?

#!/usr/bin/env python3
import torch

target = torch.tensor([1, 3, -1, -1, 2])
ignored = target == -1
print(ignored)
target[ignored] = 0
print(target)

It outputs

tensor([False, False,  True,  True, False])
tensor([1, 3, 0, 0, 2])

on my computer.

@mesut92
Copy link
Author

mesut92 commented Mar 7, 2022

torch==1.7.1

Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> #!/usr/bin/env python3
>>> import torch
>>> 
>>> target = torch.tensor([1, 3, -1, -1, 2])
>>> ignored = target == -1
>>> print(ignored)
tensor([False, False,  True,  True, False])
>>> target[ignored] = 0
>>> print(target)
tensor([1, 3, 0, 0, 2])
>>> print(torch.__version__)
1.7.1

@csukuangfj
Copy link
Collaborator

It is very odd that target[ignored] = 0 works in the demo code but not in label_smoothing.py.

@desh2608
Copy link
Collaborator

FWIW, I was getting the same issue where indexing did not seem to be changing the -1 to 0. I made the following changes in label_smoothing.py to make it work:

# target[ignored] = 0
target = torch.where(ignored, torch.zeros_like(target), target)

and

# true_dist[ignored] = 0
true_dist = torch.where(
    ignored.unsqueeze(1).repeat(1, true_dist.shape[1]),
    torch.zeros_like(true_dist),
    true_dist,
)

and then it worked. It seems to be related to some PyTorch errors other people have encountered, where logical indexing fails with CUDA tensors.

@danpovey
Copy link
Collaborator

Thanks!
If you can't find a matching issue on PyTorch's github it might be worthwhile creating one.

@csukuangfj
Copy link
Collaborator

Fixed by #300

Feel free to re-open it if the issue still exists.

@teowenshen
Copy link
Contributor

teowenshen commented May 29, 2022

Hi, I was migrating my environment over to a newly setup docker and faced this exact issue too, because I was using an older version of "label_smoothing.py" in my program.

Collecting environment information...

k2 version: 1.15.1
Build type: Release
Git SHA1: 350e449fe5d7b4231f77eb0add764782eed9f5d2
Git date: Thu May 26 14:56:24 2022
Cuda used to build k2: 11.0
cuDNN used to build k2: 8.0.4
Python version used to build k2: 3.8
OS used to build k2: 
CMake version: 3.18.0
GCC version: 7.5.0
CMAKE_CUDA_FLAGS:  --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall  --compiler-options -Wno-strict-overflow  --compiler-options -Wno-unknown-pragmas 
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-unused-variable  -Wno-strict-overflow 
PyTorch version used to build k2: 1.7.1
PyTorch is using Cuda: 11.0
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False
Max cpu memory allocate: 214748364800
k2 abort: False

When I updated my codes, this is the new error, presumably because true_dist has been overwritten by the torch.where. A solution online I found was to create a dummy Function for the masking with an implementation of backward.

Traceback (most recent call last):
  File "conformer_ctc/train.py", line 822, in <module>
    main()
  File "conformer_ctc/train.py", line 815, in main
    run(rank=0, world_size=1, args=args)
  File "conformer_ctc/train.py", line 712, in run
    scan_pessimistic_batches_for_oom(
  File "conformer_ctc/train.py", line 788, in scan_pessimistic_batches_for_oom
    loss.backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Anyhow, in the lack of a more elegant solution, I built a new Icefall image based on pytorch/pytorch:1.9.0-cuda11.1-cudnn8-devel, and just successfully started training. So far no issue.

I noticed that this peculiar error seemed to happen for torch=1.7.x and torch=1.8.x, which the current Icefall Dockerfile is based on.

Are there any plans to update the Dockerfile?

@csukuangfj
Copy link
Collaborator

Hi, I was migrating my environment over to a newly setup docker and faced this exact issue too, because I was using an older version of "label_smoothing.py" in my program.


Collecting environment information...



k2 version: 1.15.1

Build type: Release

Git SHA1: 350e449fe5d7b4231f77eb0add764782eed9f5d2

Git date: Thu May 26 14:56:24 2022

Cuda used to build k2: 11.0

cuDNN used to build k2: 8.0.4

Python version used to build k2: 3.8

OS used to build k2: 

CMake version: 3.18.0

GCC version: 7.5.0

CMAKE_CUDA_FLAGS:  --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall  --compiler-options -Wno-strict-overflow  --compiler-options -Wno-unknown-pragmas 

CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-unused-variable  -Wno-strict-overflow 

PyTorch version used to build k2: 1.7.1

PyTorch is using Cuda: 11.0

NVTX enabled: True

With CUDA: True

Disable debug: True

Sync kernels : False

Disable checks: False

Max cpu memory allocate: 214748364800

k2 abort: False

When I updated my codes, this is the new error, presumably because true_dist has been overwritten by the torch.where. A solution online I found was to create a dummy Function for the masking with an implementation of backward.


Traceback (most recent call last):

  File "conformer_ctc/train.py", line 822, in <module>

    main()

  File "conformer_ctc/train.py", line 815, in main

    run(rank=0, world_size=1, args=args)

  File "conformer_ctc/train.py", line 712, in run

    scan_pessimistic_batches_for_oom(

  File "conformer_ctc/train.py", line 788, in scan_pessimistic_batches_for_oom

    loss.backward()

  File "/opt/conda/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward

    torch.autograd.backward(self, gradient, retain_graph, create_graph)

  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward

    Variable._execution_engine.run_backward(

RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Anyhow, in the lack of a more elegant solution, I built a new Icefall image based on pytorch/pytorch:1.9.0-cuda11.1-cudnn8-devel, and just successfully started training. So far no issue.

I noticed that this peculiar error seemed to happen for torch=1.7.x and torch=1.8.x, which the current Icefall Dockerfile is based on.

Are there any plans to update the Dockerfile?

If you have time, would you mind updating the docker file? To be honest, I am not using icefall in a docker container.

@teowenshen
Copy link
Contributor

Yes sure. However, there was a recent Nvidia issue where the apt-keys are being "rotated". The base images don't seem to have caught on with the change yet, so perhaps this is not a good time to update the Dockerfile anyway.

I plan to have a go at updating the Dockerfile after my current training is done. Is there any recommended environment that has been proven to be stable at this point in time?

@csukuangfj
Copy link
Collaborator

Is there any recommended environment that has been proven to be stable at this point in time?

Sorry, I don't have much experience with docker. I think your current working version is fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants