本文主要是介绍mindspore8卡分布式训练davinci_model : load task fail, return ret,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
8卡分布式训练失败,出现HCCL错误。
【操作步骤&问题现象】
1、模型为3D卷积模型,分布式相关设置。
2、出现HCCL报错,Distribute Task Failed。注意:数据处理需要40分钟。
1. 报错信息
ERROR] GE(1530736,ffff4809f1e0,python):2021-10-10-19:36:29.319.926 [mindspore/ccsrc/runtime/device/ascend/ge_runtime/task/hccl_task.cc:100] Distribute] davinci_model : load task fail, return ret: 1343225860[ERROR] DEVICE(1530736,ffff4809f1e0,python):2021-10-10-19:36:29.320.484 [mindspore/ccsrc/runtime/device/ascend/ascend_kernel_runtime.cc:469] LoadTask] Distribute Task Failed, error: mindspore/ccsrc/runtime/device/ascend/ge_runtime/task/hccl_task.cc:100 Distribute] davinci_model : load task fail, return ret: 1343225860# In file /root/archiconda3/envs/zjc/lib/python3.7/site-packages/mindspore/ops/_grad/grad_nn_ops.py(83)dx = input_grad(w, dout, get_shape(x))^[ERROR] MD(1530736,ffff38fff1e0,python):2021-10-10-19:36:34.542.081 [mindspore/ccsrc/minddata/dataset/util/task_manager.cc:217] InterruptMaster] Task is terminated with err msg(more detail in info level log):Exception thrown from PyFunc. The actual amount of data read from generator 444 is different from generator.len 8400, you should adjust generator.len to make them match.Line of code : 198File : /home/jenkins/agent-working-dir/workspace/Compile_Ascend_ARM_CentOS/mindspore/mindspore/ccsrc/minddata/dataset/engine/datasetops/source/generator_op.cc[WARNING] MD(1530736,ffffa0a34740,python):2021-10-10-19:36:37.914.259 [mindspore/ccsrc/minddata/dataset/engine/datasetops/device_queue_op.cc:73] ~DeviceQueueOp] preprocess_batch: 140; batch_queue: 0, 0, 0, 0, 0, 0, 0, 0, 0, 64; push_start_time: 2021-10-10-19:32:07.652.799, 2021-10-10-19:32:07.695.438, 2021-10-10-19:32:07.732.808, 2021-10-10-19:32:07.779.769, 2021-10-10-19:32:07.817.956, 2021-10-10-19:32:07.866.328, 2021-10-10-19:32:07.905.937, 2021-10-10-19:32:07.931.153, 2021-10-10-19:32:07.936.371, 2021-10-10-19:32:07.945.894; push_end_time: 2021-10-10-19:32:07.653.279, 2021-10-10-19:32:07.695.918, 2021-10-10-19:32:07.733.354, 2021-10-10-19:32:07.780.238, 2021-10-10-19:32:07.818.448, 2021-10-10-19:32:07.866.782, 2021-10-10-19:32:07.906.422, 2021-10-10-19:32:07.931.613, 2021-10-10-19:32:07.936.843, 2021-10-10-19:36:36.347.214.Traceback (most recent call last):File "train.py", line 139, in <module>model.train(config.epoch_size, train_dataset, callbacks=callbacks_list) # , dataset_sink_mode=FalseFile "/root/archiconda3/envs/zjc/lib/python3.7/site-packages/mindspore/train/model.py", line 649, in trainsink_size=sink_size)File "/root/archiconda3/envs/zjc/lib/python3.7/site-packages/mindspore/train/model.py", line 439, in _trainself._train_dataset_sink_process(epoch, train_dataset, list_callback, cb_params, sink_size)File "/root/archiconda3/envs/zjc/lib/python3.7/site-packages/mindspore/train/model.py", line 499, in _train_dataset_sink_processoutputs = self._train_network(*inputs)File "/root/archiconda3/envs/zjc/lib/python3.7/site-packages/mindspore/nn/cell.py", line 386, in __call__out = self.compile_and_run(*inputs)File "/root/archiconda3/envs/zjc/lib/python3.7/site-packages/mindspore/nn/cell.py", line 644, in compile_and_runself.compile(*inputs)File "/root/archiconda3/envs/zjc/lib/python3.7/site-packages/mindspore/nn/cell.py", line 631, in compile_executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode)File "/root/archiconda3/envs/zjc/lib/python3.7/site-packages/mindspore/common/api.py", line 531, in compileresult = self._executor.compile(obj, args_list, phase, use_vm, self.queue_name)RuntimeError: mindspore/ccsrc/runtime/device/ascend/ascend_kernel_runtime.cc:469 LoadTask] Distribute Task Failed, error: mindspore/ccsrc/runtime/device/ascend/ge_runtime/task/hccl_task.cc:100 Distribute] davinci_model : load task fail, return ret: 1343225860# In file /root/archiconda3/envs/zjc/lib/python3.7/site-packages/mindspore/ops/_grad/grad_nn_ops.py(83)dx = input_grad(w, dout, get_shape(x))^# In file /root/archiconda3/envs/zjc/lib/python3.7/site-packages/mindspore/ops/_grad/grad_nn_ops.py(83)dx = input_grad(w, dout, get_shape(x))^
2. info 日志
[ERROR] HCCL(167557,python):2021-10-09-17:53:10.542.719 [p2p_mgmt.cc:218][167557][218555][Wait][P2PConnected]connected p2p timeout, timeout:120 s. local logicDevid:0, remote physic id:4.[ERROR] HCCL(167557,python):2021-10-09-17:53:10.542.778 [p2p_mgmt.cc:185][167557][218555]call trace: ret -> 16[ERROR] HCCL(167557,python):2021-10-09-17:53:10.542.788 [comm_factory.cc:1087][167557][218555][Get][ExchangerNetwork]Enable P2P Failed, ret[16][ERROR] HCCL(167557,python):2021-10-09-17:53:10.542.796 [comm_factory.cc:240][167557][218555][Create][CommOuter]exchangerNetwork create failed[ERROR] HCCL(167557,python):2021-10-09-17:53:10.542.805 [hccl_impl.cc:1958][167557][218555][Create][OuterComm]errNo[0x0000000005000006] tag[HcomAllReduce_6629421139219749105_0], created commOuter fail. commOuter[0] is null[ERROR] HCCL(167557,python):2021-10-09-17:53:10.542.885 [hccl_impl.cc:1734][167557][213677][Create][CommByAlg]CreateInnerComm [0] or CreateOuterComm[6] failed. commType[2][ERROR] HCCL(167557,python):2021-10-09-17:53:10.542.921 [hccl_impl.cc:1831][167557][213677]call trace: ret -> 4[ERROR] HCCL(167557,python):2021-10-09-17:53:10.542.938 [hccl_impl.cc:893][167557][213677][HcclImpl][AllReduce]errNo[0x0000000005000004] tag[HcomAllReduce_6629421139219749105_0],all reduce create comm failed[ERROR] HCCL(167557,python):2021-10-09-17:53:10.542.946 [hccl_comm.cc:232][167557][213677]call trace: ret -> 4[ERROR] HCCL(167557,python):2021-10-09-17:53:10.542.957 [hcom.cc:246][167557][213677][AllReduce][Result]errNo[0x0000000005010004] hcclComm all reduce error, tag[HcomAllReduce_6629421139219749105_0],input_ptr[0x1088eaa69200], output_ptr[0x108800000200], count[1132288], data_type[4], op[0], stream[0xfffdb566b530][ERROR] HCCL(167557,python):2021-10-09-17:53:10.542.968 [hcom_ops_kernel_info_store.cc:309][167557][213677]call trace: ret -> 4[ERROR] HCCL(167557,python):2021-10-09-17:53:10.542.983 [hcom_ops_kernel_info_store.cc:191][167557][213677]call trace: ret -> 4[ERROR] HCCL(167557,python):2021-10-09-17:53:10.542.992 [hcom_ops_kernel_info_store.cc:806][167557][213677][Load][Task]errNo[0x0000000005010004] load task failed. (load op[HcomAllReduce] fail)[ERROR] GE(167557,ffff2b2ef1e0,python):2021-10-09-17:53:10.543.045 [mindspore/ccsrc/runtime/device/ascend/ge_runtime/task/hccl_task.cc:100] Distribute] davinci_model : load task fail, return ret: 1343225860[ERROR] DEVICE(167557,ffff2b2ef1e0,python):2021-10-09-17:53:10.543.320 [mindspore/ccsrc/runtime/device/ascend/ascend_kernel_runtime.cc:469] LoadTask] Distribute Task Failed, error: mindspore/ccsrc/runtime/device/ascend/ge_runtime/task/hccl_task.cc:100 Distribute] davinci_model : load task fail, return ret: 1343225860# In file /root/archiconda3/envs/zjc/lib/python3.7/site-packages/mindspore/ops/_grad/grad_nn_ops.py(83)dx = input_grad(w, dout, get_shape(x))[ERROR] MD(167557,ffff6edbf1e0,python):2021-10-09-17:53:15.902.036 [mindspore/ccsrc/minddata/dataset/util/task_manager.cc:217] InterruptMaster] Task is terminated with err msg(more detail in info level log):Exception thrown from PyFunc. The actual amount of data read from generator 547 is different from generator.len 8400, you should adjust generator.len to make them match.Line of code : 198File : /home/jenkins/agent-working-dir/workspace/Compile_Ascend_ARM_CentOS/mindspore/mindspore/ccsrc/minddata/dataset/engine/datasetops/source/generator_op.cc[ERROR] HCCL(167557,python):2021-10-09-17:53:17.392.389 [allreducepadding.cc:334][167557][213677][Compare][OutputPaddingMem]Tag[HcomAllReduce_6629421139219749105_0] All Reduce Padding Head Check Failed, TargetValue[41], ActualValue[00]
解答:
提示是hccl建连超时了。
有可能是其他卡已经挂了,也有可能是每卡处理的速度不一样,可以把超时时间设置长一点。
HCCL_CONNECT_TIMEOUT=6000,export下这个环境变量。
这篇关于mindspore8卡分布式训练davinci_model : load task fail, return ret的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!