We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I try to run BERT finetune with https://github.com/huggingface/pytorch-pretrained-BERT python run_classifier.py
python run_classifier.py
12/11/2018 11:07:55 - INFO - main - guid: train-4 12/11/2018 11:07:55 - INFO - main - tokens: [CLS] [UNK] � [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] � [UNK] [UNK] [UNK] [UNK] � [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] � ##� [UNK] [UNK] | | [UNK] � [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] � [UNK] [UNK] � [UNK] [UNK] [UNK] � [UNK] [UNK] [UNK] � [UNK] [UNK] [UNK] [UNK] [UNK] � ##� [UNK] [UNK] | | da ##4 ji ##a1 ji ##4 zh ##u ##4 le yi ##4 ba ##n1 kai ##1 ha ##o ##3 ch ##e ##1 de n ##v ##3 re ##n2 b ##u ##2 shi ##4 ca ##o ##1 ta ##1 de you ##3 qi ##an ##2 ji ##u ##4 shi ##4 ca ##o ##1 ta ##1 ma [SEP] 12/11/2018 11:07:55 - INFO - main - input_ids: 101 100 176 100 100 100 100 100 100 100 100 100 100 192 100 100 100 100 192 100 100 100 100 100 100 100 192 11699 100 100 170 170 100 176 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 192 100 100 192 100 100 100 192 100 100 100 192 100 100 100 100 100 192 11699 100 100 170 170 10005 8159 12095 11414 12095 8159 9998 8207 8159 8983 11242 8159 10322 11310 13072 8148 11643 8167 8152 9537 8154 8148 8363 156 8225 8152 8847 12750 144 8207 8144 11772 8159 8850 8167 8148 8346 8148 8363 8357 8152 11566 8244 8144 12095 8207 8159 11772 8159 8850 8167 8148 8346 8148 9622 102 12/11/2018 11:07:55 - INFO - main - input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 12/11/2018 11:07:55 - INFO - main - segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12/11/2018 11:07:55 - INFO - main - label: 0 (id = 0) 12/11/2018 11:08:23 - INFO - main - ***** Running training ***** 12/11/2018 11:08:23 - INFO - main - Num examples = 10000 12/11/2018 11:08:23 - INFO - main - Batch size = 32 12/11/2018 11:08:23 - INFO - main - Num steps = 937 Segmentation fault
then i found Segmentation fault when running:
Segmentation fault
loss = model(input_ids, segment_ids, input_mask, label_ids)
strace -o strace.log python run_classifier.py tail strace.log -n 50
strace -o strace.log python run_classifier.py
tail strace.log -n 50
mmap(0x1085b800000, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x1085b800000 mmap(0x1085b800000, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1085b800000 open("/proc/driver/nvidia/params", O_RDONLY) = 79 fstat(79, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f53fed00000 read(79, "Mobile: 4294967295\nResmanDebugLe"..., 1024) = 491 close(79) = 0 munmap(0x7f53fed00000, 4096) = 0 stat("/dev/nvidiactl", {st_mode=S_IFCHR|0666, st_rdev=makedev(195, 255), ...}) = 0 open("/dev/nvidiactl", O_RDWR) = 79 fcntl(79, F_SETFD, FD_CLOEXEC) = 0 ioctl(10, 0xc0384627, 0x7fff7f413850) = 0 close(79) = 0 ioctl(4, 0x21, 0x7fff7f413490) = 0 ioctl(5, 0xc01c4634, 0x7fff7f4136c0) = 0 ioctl(4, 0x21, 0x7fff7f4132a0) = 0 ioctl(5, 0xc01c4634, 0x7fff7f4136c0) = 0 ioctl(4, 0x21, 0x7fff7f4132a0) = 0 ioctl(5, 0xc01c4634, 0x7fff7f4136c0) = 0 ioctl(4, 0x21, 0x7fff7f4132a0) = 0 ioctl(5, 0xc020462a, 0x7fff7f4148b0) = 0 open("/dev/shm/nccl-shm-recv-5840-0-0", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0600) = -1 EACCES (Permission denied) ioctl(5, 0xc020462a, 0x7fff7f4146e0) = 0 ioctl(4, 0x42, 0x7fff7f414670) = 0 ioctl(4, 0x22, 0x7fff7f414630) = 0 ioctl(5, 0xc0104629, 0x7fff7f414770) = 0 munmap(0x1085ae00000, 2097152) = 0 ioctl(5, 0xc020462a, 0x7fff7f4146e0) = 0 ioctl(4, 0x42, 0x7fff7f414670) = 0 ioctl(4, 0x22, 0x7fff7f414630) = 0 ioctl(5, 0xc0104629, 0x7fff7f414770) = 0 munmap(0x1085b000000, 6291456) = 0 ioctl(5, 0xc020462a, 0x7fff7f4146e0) = 0 ioctl(4, 0x42, 0x7fff7f414670) = 0 ioctl(4, 0x22, 0x7fff7f414630) = 0 ioctl(5, 0xc0104629, 0x7fff7f414770) = 0 munmap(0x1085b600000, 2097152) = 0 ioctl(4, 0x42, 0x7fff7f4145e0) = 0 ioctl(5, 0xc0104629, 0x7fff7f4146e0) = 0 ioctl(4, 0x42, 0x7fff7f4145e0) = 0 ioctl(5, 0xc0104629, 0x7fff7f4146e0) = 0 ioctl(4, 0x42, 0x7fff7f4145e0) = 0 ioctl(5, 0xc0104629, 0x7fff7f4146e0) = 0 ioctl(4, 0x42, 0x7fff7f414670) = 0 ioctl(4, 0x22, 0x7fff7f414630) = 0 ioctl(5, 0xc0104629, 0x7fff7f414770) = 0 mmap(0x1085b800000, 2097152, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x1085b800000 munmap(0x1085b800000, 2097152) = 0
i got the error
open("/dev/shm/nccl-shm-recv-5840-0-0", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0600) = -1 EACCES (Permission denied)
then i found the code in nccl
sprintf(shmName, "nccl-shm-recv-%lx-%d-%d", info->pidHash, info->id, info->rank);
https://github.com/NVIDIA/nccl/blob/master/src/transport/shm.cu
anyone known how to fix it?
The text was updated successfully, but these errors were encountered:
fix: login as root
ll /dev drwxr-xr-x 2 root root 80 Dec 11 13:35 shm
Sorry, something went wrong.
On our test Linux nodes that directory /dev/shm has these permissions;
drwxrwxrwt 2 root root 40 Dec 10 22:06 shm
Another tip is to run with NCCL_DEBUG=WARN in your environment if you see failures when using NCCL
NCCL_DEBUG=WARN
No branches or pull requests
I try to run BERT finetune with https://github.com/huggingface/pytorch-pretrained-BERT
python run_classifier.py
then i found
Segmentation fault
when running:strace -o strace.log python run_classifier.py
tail strace.log -n 50
i got the error
then i found the code in nccl
https://github.com/NVIDIA/nccl/blob/master/src/transport/shm.cu
anyone known how to fix it?
The text was updated successfully, but these errors were encountered: