mpirun专题

MPIRUN 31280 segmentation fault (core dumped)

用mpirun运行多节点nccl时有时候出现hang死,而且是指定了mpi_host的情况 nccl正常,各节点通信正常,但是一跑mpirun就卡死,core dump。 提前退出: [worker0:38156] *** Process received signal ***[worker0:38156] Signal: Segmentation fault (11)[worker0:3