Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault when open "/dev/shm/nccl-shm-recv-5840-0-0" #167

Closed
RAMBOO1990 opened this issue Dec 11, 2018 · 3 comments
Closed

segmentation fault when open "/dev/shm/nccl-shm-recv-5840-0-0" #167

RAMBOO1990 opened this issue Dec 11, 2018 · 3 comments

Comments

@RAMBOO1990
Copy link

I try to run BERT finetune with https://github.com/huggingface/pytorch-pretrained-BERT
python run_classifier.py

12/11/2018 11:07:55 - INFO - main - guid: train-4
12/11/2018 11:07:55 - INFO - main - tokens: [CLS] [UNK] � [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] � [UNK] [UNK] [UNK] [UNK] � [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] � ##� [UNK] [UNK] | | [UNK] � [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] � [UNK] [UNK] � [UNK] [UNK] [UNK] � [UNK] [UNK] [UNK] � [UNK] [UNK] [UNK] [UNK] [UNK] � ##� [UNK] [UNK] | | da ##4 ji ##a1 ji ##4 zh ##u ##4 le yi ##4 ba ##n1 kai ##1 ha ##o ##3 ch ##e ##1 de n ##v ##3 re ##n2 b ##u ##2 shi ##4 ca ##o ##1 ta ##1 de you ##3 qi ##an ##2 ji ##u ##4 shi ##4 ca ##o ##1 ta ##1 ma [SEP]
12/11/2018 11:07:55 - INFO - main - input_ids: 101 100 176 100 100 100 100 100 100 100 100 100 100 192 100 100 100 100 192 100 100 100 100 100 100 100 192 11699 100 100 170 170 100 176 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 192 100 100 192 100 100 100 192 100 100 100 192 100 100 100 100 100 192 11699 100 100 170 170 10005 8159 12095 11414 12095 8159 9998 8207 8159 8983 11242 8159 10322 11310 13072 8148 11643 8167 8152 9537 8154 8148 8363 156 8225 8152 8847 12750 144 8207 8144 11772 8159 8850 8167 8148 8346 8148 8363 8357 8152 11566 8244 8144 12095 8207 8159 11772 8159 8850 8167 8148 8346 8148 9622 102
12/11/2018 11:07:55 - INFO - main - input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
12/11/2018 11:07:55 - INFO - main - segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12/11/2018 11:07:55 - INFO - main - label: 0 (id = 0)
12/11/2018 11:08:23 - INFO - main - ***** Running training *****
12/11/2018 11:08:23 - INFO - main - Num examples = 10000
12/11/2018 11:08:23 - INFO - main - Batch size = 32
12/11/2018 11:08:23 - INFO - main - Num steps = 937
Segmentation fault

then i found Segmentation fault when running:

loss = model(input_ids, segment_ids, input_mask, label_ids)

strace -o strace.log python run_classifier.py
tail strace.log -n 50

mmap(0x1085b800000, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x1085b800000
mmap(0x1085b800000, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1085b800000
open("/proc/driver/nvidia/params", O_RDONLY) = 79
fstat(79, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f53fed00000
read(79, "Mobile: 4294967295\nResmanDebugLe"..., 1024) = 491
close(79) = 0
munmap(0x7f53fed00000, 4096) = 0
stat("/dev/nvidiactl", {st_mode=S_IFCHR|0666, st_rdev=makedev(195, 255), ...}) = 0
open("/dev/nvidiactl", O_RDWR) = 79
fcntl(79, F_SETFD, FD_CLOEXEC) = 0
ioctl(10, 0xc0384627, 0x7fff7f413850) = 0
close(79) = 0
ioctl(4, 0x21, 0x7fff7f413490) = 0
ioctl(5, 0xc01c4634, 0x7fff7f4136c0) = 0
ioctl(4, 0x21, 0x7fff7f4132a0) = 0
ioctl(5, 0xc01c4634, 0x7fff7f4136c0) = 0
ioctl(4, 0x21, 0x7fff7f4132a0) = 0
ioctl(5, 0xc01c4634, 0x7fff7f4136c0) = 0
ioctl(4, 0x21, 0x7fff7f4132a0) = 0
ioctl(5, 0xc020462a, 0x7fff7f4148b0) = 0
open("/dev/shm/nccl-shm-recv-5840-0-0", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0600) = -1 EACCES (Permission denied)
ioctl(5, 0xc020462a, 0x7fff7f4146e0) = 0
ioctl(4, 0x42, 0x7fff7f414670) = 0
ioctl(4, 0x22, 0x7fff7f414630) = 0
ioctl(5, 0xc0104629, 0x7fff7f414770) = 0
munmap(0x1085ae00000, 2097152) = 0
ioctl(5, 0xc020462a, 0x7fff7f4146e0) = 0
ioctl(4, 0x42, 0x7fff7f414670) = 0
ioctl(4, 0x22, 0x7fff7f414630) = 0
ioctl(5, 0xc0104629, 0x7fff7f414770) = 0
munmap(0x1085b000000, 6291456) = 0
ioctl(5, 0xc020462a, 0x7fff7f4146e0) = 0
ioctl(4, 0x42, 0x7fff7f414670) = 0
ioctl(4, 0x22, 0x7fff7f414630) = 0
ioctl(5, 0xc0104629, 0x7fff7f414770) = 0
munmap(0x1085b600000, 2097152) = 0
ioctl(4, 0x42, 0x7fff7f4145e0) = 0
ioctl(5, 0xc0104629, 0x7fff7f4146e0) = 0
ioctl(4, 0x42, 0x7fff7f4145e0) = 0
ioctl(5, 0xc0104629, 0x7fff7f4146e0) = 0
ioctl(4, 0x42, 0x7fff7f4145e0) = 0
ioctl(5, 0xc0104629, 0x7fff7f4146e0) = 0
ioctl(4, 0x42, 0x7fff7f414670) = 0
ioctl(4, 0x22, 0x7fff7f414630) = 0
ioctl(5, 0xc0104629, 0x7fff7f414770) = 0
mmap(0x1085b800000, 2097152, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1085b800000
munmap(0x1085b800000, 2097152) = 0

i got the error

open("/dev/shm/nccl-shm-recv-5840-0-0", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0600) = -1 EACCES (Permission denied)

then i found the code in nccl

sprintf(shmName, "nccl-shm-recv-%lx-%d-%d", info->pidHash, info->id, info->rank);

https://github.com/NVIDIA/nccl/blob/master/src/transport/shm.cu

anyone known how to fix it?

@RAMBOO1990
Copy link
Author

fix: login as root

ll /dev
drwxr-xr-x 2 root root 80 Dec 11 13:35 shm

@AddyLaddy
Copy link
Collaborator

On our test Linux nodes that directory /dev/shm has these permissions;

drwxrwxrwt 2 root root 40 Dec 10 22:06 shm

@AddyLaddy
Copy link
Collaborator

Another tip is to run with NCCL_DEBUG=WARN in your environment if you see failures when using NCCL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants