-
Notifications
You must be signed in to change notification settings - Fork 74.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA illegal error access error when running distributed mixed precision #40814
Comments
Provide the exact sequence of commands / steps that you executed before running into the problem.Thanks! |
I can't give you all the code, but I use the basic approach below: policy = tf.keras.mixed_precision.experimental.Policy("mixed_float16")
tf.keras.mixed_precision.experimental.set_policy(policy)
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
logits = get_logits()
model = tf.keras.Model(inputs, logits)
model.fit(X, y) # X and y are datasets read from tfrecords I run this from the official tensorflow docker container. |
Is there a flag I can use to get a more detailed stack trace? |
I ran this with cuda-memcheck and the error occured at an earlier point:
|
@lminer Can you please provide us the full reproducible code for us to reproduce this issue. We can't reproduce the issue currently as the code you provided is not enough. Thanks! |
@lminer did you figure what causes this error? |
@ben0it8 no, and I'm having trouble creating a reproducible example that isn't just my entire code base. Do you have one? |
@gowthamkpr, I have a reproducible example. This will crash if I run: Where import numpy as np
import tensorflow as tf
class Conv2dLayer(tf.keras.layers.Layer):
def __init__(self, filters, kernel_size, strides=1, **kwargs):
super().__init__(**kwargs)
self.activation = tf.keras.layers.LeakyReLU()
self.conv = tf.keras.layers.Conv2D(
filters, kernel_size, strides=strides, padding="same", kernel_initializer="he_normal",
)
self.batch_norm = tf.keras.layers.BatchNormalization()
self.filters = filters
self.kernel_size = kernel_size
self.strides = strides
def call(self, inputs, **kwargs):
x = self.conv(inputs)
x = self.activation(x)
x = self.batch_norm(x)
return x
def get_config(self):
config = super().get_config()
config["filters"] = self.filters
config["kernel_size"] = self.kernel_size
config["strides"] = self.strides
return config
def compute_output_shape(self, input_shape):
return self.conv.compute_output_shape(input_shape)
class UpSampleLayer(tf.keras.layers.Layer):
def __init__(self, filters, strides=2, **kwargs):
super().__init__(**kwargs)
self.dropout = tf.keras.layers.Dropout(0.5)
self.activation = tf.keras.layers.LeakyReLU()
self.upconv = tf.keras.layers.Conv2DTranspose(
filters, 4, strides=strides, padding="same", kernel_initializer="he_normal"
)
self.batch_norm = tf.keras.layers.BatchNormalization()
self.filters = filters
self.strides = strides
def call(self, inputs, **kwargs):
x = self.upconv(inputs)
x = self.batch_norm(x)
x = self.dropout(x)
return self.activation(x)
def get_config(self):
config = super().get_config()
config["filters"] = self.filters
config["strides"] = self.strides
return config
class DownsampleBlock(tf.keras.layers.Layer):
def __init__(self, filters, **kwargs):
super().__init__(**kwargs)
self.filters = filters
self.conv1 = Conv2dLayer(filters, 4)
self.conv2 = Conv2dLayer(filters, 4)
self.downsample_conv = Conv2dLayer(filters, 4, strides=2)
self.dropout = tf.keras.layers.Dropout(0.5)
def call(self, inputs, **kwargs):
x = self.conv1(inputs)
x = self.conv2(x)
x = self.downsample_conv(x)
x = self.dropout(x)
return x
def get_config(self):
config = super().get_config()
config["filters"] = self.filters
return config
class Unet(tf.keras.models.Model):
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.mask = tf.keras.layers.Activation("relu")
self.axis = -1
self.downsample_blocks = []
self.upsample_blocks = []
n_maps_list = []
for i in range(6):
n_maps = 16 * 2 ** i
n_maps_list.insert(0, n_maps)
self.downsample_blocks.append(DownsampleBlock(n_maps))
for i, n_maps in enumerate(n_maps_list[1:]):
self.upsample_blocks.append(UpSampleLayer(n_maps, strides=2))
self.upsample_blocks.append(UpSampleLayer(2, strides=2))
def call(self, inputs, training=None, mask=None):
skip_connections = []
x = inputs
for downsample_block in self.downsample_blocks:
x = downsample_block(x)
skip_connections.insert(0, x)
x = self.upsample_blocks[0](x) # no skip connection used for first block
for upsample_block, h in zip(self.upsample_blocks[1:], skip_connections[1:]):
x = upsample_block(tf.keras.layers.concatenate([x, h], axis=self.axis))
return self.mask(x)
def train():
BATCH_SIZE = 16
WIDTH = 256
HEIGHT = 512
CHANNELS = 2
policy = tf.keras.mixed_precision.experimental.Policy("mixed_float16")
tf.keras.mixed_precision.experimental.set_policy(policy)
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = Unet()
model.build(input_shape=(None, WIDTH, HEIGHT, CHANNELS))
model.compile(optimizer="adam", loss="mean_absolute_error")
examples = np.random.rand(BATCH_SIZE * 20, WIDTH, HEIGHT, CHANNELS)
target = np.random.rand(BATCH_SIZE * 20, WIDTH, HEIGHT, CHANNELS)
ds = tf.data.Dataset.from_tensor_slices((examples, target))
ds = ds.repeat()
ds = ds.batch(BATCH_SIZE)
model.fit(ds, steps_per_epoch=1875, epochs=10)
train() The error is as follows:
|
Unfortunately I was not able to reproduce this on P100 or Titan-V. Can you try running with |
@sanjoy when I run it with that option, the model is loaded into the memory of both GPUs, but only one GPU actually sees any utilization and there is no crash. |
@dubey Have you seen similar issues before? |
I get the same error when running multi-gpu training with 2 or 3 RTX 2080Tis. My code is very similar to yours, with the exception that I do not use mixed precision. |
@sanjoy No I haven't seen this issue before. |
Ok guys, I think I've found a solution, which seems to work for me. I followed the instructions here: https://github.com/NVIDIA/framework-determinism - I enabled Then I fixed all the random seeds:
The model has been running without a hitch for many epochs now. Seems that the non-determinism of some operations might cause these multi-gpu issues. Keep in mind - I don't fully understand WHY this works, just know that it does work for a similar problem. Do let me know if this helps. Also, keep in mind the instructions here: https://github.com/NVIDIA/framework-determinism are a bit different from the ones I originally used (here; https://stackoverflow.com/questions/50744565/how-to-handle-non-determinism-when-training-on-a-gpu/62712389#62712389). Might be worth trying both sets. |
I get the same error when running multi-gpu mixed precision training with 2 RTX 2080Ti . Any solutions ? |
@dolhasz can you confirm that enforcing determinism indeed solved your issue? also, which layer/op do you suspect to cause the error? |
This works for me. Thanks @dolhasz ! |
I faced a very similar issue. Loading the model just once solved the issue. |
This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you. |
Closing as stale. Please reopen if you'd like to work on this further. |
Whenever I try to train a model using MirroredStrategy and mixed precision, at an indeterminate time, I get the following error:
Unfortunately, I don't have a simply example to reproduce this and can't include my entire code. But maybe other people are having similar issues and can produce a better example.
I'm running tensorflow 2.2.0 on ubuntu 18.04. CUDA 10.1.243, CuDNN 7.6.5 using two RTX 2080 ti cards. I get the same error on a V100.
The text was updated successfully, but these errors were encountered: