本文主要是介绍You may need to install ‘nccl2‘ from NVIDIA official website,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
报错信息
在使用paddle进行多卡训练的时候报错,报错信息如下
W0111 17:25:32.685145 56257 dynamic_loader.cc:207] You may need to install ‘nccl2’ from NVIDIA official website: https://developer.nvidia.com/nccl/nccl-downloadbefore install PaddlePaddle.
Traceback (most recent call last):
File “tools/train.py”, line 114, in
main(config, device, logger, vdl_writer)
File “tools/train.py”, line 47, in main
dist.init_parallel_env()
File “/home/disk0/zw/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/distributed/parallel.py”, line 181, in init_parallel_env
parallel_helper._init_parallel_ctx()
File “/home/disk0/zw/anaconda3/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel_helper.py”, line 42, in _init_parallel_ctx
parallel_ctx__clz.init()
RuntimeError: (PreconditionNotMet) The third-party dynamic library (libnccl.so) that Paddle depends on is not configured correctly. (error code is libnccl.so: cannot open shared object file: No such file or directory)
Suggestions:
- Check if the third-party dynamic library (e.g. CUDA, CUDNN) is installed correctly and its version is matched with paddlepaddle you installed.
- Configure third-party dynamic library environment variables as follows:
- Linux: set LD_LIBRARY_PATH by
export LD_LIBRARY_PATH=...
- Windows: set PATH by
set PATH=XXX; (at /paddle/paddle/fluid/platform/dynload/dynamic_loader.cc:234) [Hint: If you need C++ stacktraces for debugging, please set
FLAGS_call_stack_level=2`.]
分析原因
环境信息
- python:3.7
- cuda:10.0
- cudnn:7.6
- paddlepaddle-gpu:2.0.0rc1
通过上面的错误可以很容易定位到是因为没有找到libnccl.so
导致的这个问题,所以导致这个错误有两种原因:
- 没有安装nccl
- 没有将libnccl.so添加到
LD_LIBRARY_PATH
环境变量中
解决办法
安装nccl
根据cuda
的版本去选择对应版本的nccl,可以去nvidia的官网下载https://developer.nvidia.com/nccl/nccl-legacy-downloads
这里以cuda10为例
1.下载nccl-repo-ubuntu1604-2.6.4-ga-cuda10.0_1-1_amd64.deb
2.安装镜像库
sudo dpkg -i nccl-repo-ubuntu1604-2.6.4-ga-cuda10.0_1-1_amd64.deb
3.更新源镜像
sudo apt update
4.安装nccl
sudo apt install
libnccl2=2.6.4-1+cuda10.0 libnccl-dev=2.6.4-1+cuda10.0
将nccl添加到环境变量中
nccl默认的安装目录是/usr/lib/x86_64-linux-gnu
,修改~/.bashrc
文件,添加如下内容到文件中
#设置cuda库的目录
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64
#将nccl添加到LD_LIBRARY_PATH中
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/x86_64-linux-gnu
添加好之后保存文件,使用source ~/.bashrc
让文件的配置生效,在通过echo $LD_LIBRARY_PATH
查看环境变量设置是否成功,配置成功之后输出的信息如下
/usr/local/cuda-10.0/lib64:/usr/lib/x86_64-linux-gnu
参考:
- https://forums.developer.nvidia.com/t/have-strange-problem-on-installing-nccl/60654
- https://zhuanlan.zhihu.com/p/174710896
- https://github.com/PaddlePaddle/PaddleDetection/issues/1444
- https://developer.nvidia.com/nccl/nccl-legacy-downloads
这篇关于You may need to install ‘nccl2‘ from NVIDIA official website的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!