Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

require 'nccl' doesn't work #5

Closed
loeweX opened this issue Apr 12, 2016 · 11 comments · Fixed by NVIDIA/nccl#18
Closed

require 'nccl' doesn't work #5

loeweX opened this issue Apr 12, 2016 · 11 comments · Fixed by NVIDIA/nccl#18

Comments

@loeweX
Copy link

loeweX commented Apr 12, 2016

Hello!

I was trying to use nccl for torch. Installing worked fine for both the Nvidia library and the torch bindings. However when I try require 'nccl', I get this error message:

nccl/ffi.lua:195: 'libnccl.so not found in library path.
Please install nccl.
Then make sure all the files named as libnccl.so* are placed in your library load path (for example /usr/local/lib , or manually add a path to LD_LIBRARY_PATH)

Problem is, I already added libnccl.so to my LD_LIBRARY_PATH and double-checked that it's really there. So I tried to manually load the two files, which are loaded in ffi.lua. But then I get these errors:

res = ffi.load('libnccl.1.dylib')
[string "res = ffi.load('libnccl.1.dylib')"]:1: libnccl.1.dylib: cannot open shared object file: No such file or directory

Indeed this file doesn't exist in my nccl/lib directory. Do I have to create it manually, since it didn't show up automatically after installing nccl? How could I do so?
For the second file I get this error:

res = ffi.load('libnccl.so.1')
[string "res = ffi.load('libnccl.so.1')"]:1: /usr/local/nccl/lib/libnccl.so.1: undefined symbol: __cudaRegisterFatBinary

How can I solve this problem?
Thanks!

@ngimel
Copy link
Owner

ngimel commented Apr 12, 2016

You only need one of those files, either libnccl.1.dylib or libnccl.so.1. Error message you were getting from nccl " 'libnccl.so not found in library path" is confusing, sorry, your actual problem is that libnccl.so.1 can not be loaded successfully. "undefined symbol: __cudaRegisterFatBinary:" indicates cuda install problems or torch compiled without cuda support. Does require 'cutorch' work?

@apaszke
Copy link

apaszke commented Apr 12, 2016

I have the same issue (library loading fails with undefined symbol). cutorch and other cuda packages work just fine.

@apaszke
Copy link

apaszke commented Apr 12, 2016

These symbols are linked into the library as undefined:

~/nccl/build/lib [master*]$ nm libnccl.so | grep FatBinary
                 U __cudaRegisterFatBinary
                 U __cudaUnregisterFatBinary

@ngimel
Copy link
Owner

ngimel commented Apr 12, 2016

Are you using nccl built from source, or deb package? I can't repro with nccl built from source.

@apaszke
Copy link

apaszke commented Apr 13, 2016

I built it from source 😕

@apaszke
Copy link

apaszke commented Apr 13, 2016

I'm using CUDA 7.5.18 on Ubuntu 14.04.1 btw

@apaszke
Copy link

apaszke commented Apr 13, 2016

I managed to solve this issue. It seems that the gcc linker didn't add libcudart.so to the dynamic section, so it wasn't loaded along with libnccl.so when required from Torch. I solved it by adding linker flags to nccl's Makefile. I will submit a PR today.

@loeweX
Copy link
Author

loeweX commented Apr 14, 2016

Your solution worked for me, too. Thanks a lot!

@ngimel
Copy link
Owner

ngimel commented Apr 26, 2016

Thanks, guys. Closing.

@ngimel ngimel closed this as completed Apr 26, 2016
@PratsBhatt
Copy link

I am still facing the issue. Can you please explain the steps in detail. I am new to torch as well as nccl and so I am unable to understand exactly what I need to solve this issue.
I appreciate your help. Thanks in advance.

@onepiece666
Copy link

I managed to solve this issue. It seems that the gcc linker didn't add libcudart.so to the dynamic section, so it wasn't loaded along with libnccl.so when required from Torch. I solved it by adding linker flags to nccl's Makefile. I will submit a PR today.

can you please explain the steps in detail.,thank.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants