本文主要是介绍Megatron-LM 验证1F1B interleaved的效果,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
Megatron-LM 验证1F1B interleaved的效果
- 1.创建容器
- 2.安装Megatron-LM,准备数据集
- 3.准备解析脚本
- 4.PP4测试
- 5.PP4 VP2 测试
- 6.NCCL带宽测试
本文测试1F1B interleaved是否能挤掉空泡。因为所用的服务器不支持P2P,且PCIE为GEN1 X16
NCCL all_reduce_perf测试的性能仅为1.166GB/s。因此开启interleaved模式后,通信算子耗时占明显增加
- rank0: 0.15 -> 0.50
- rank1: 0.26 -> 0.70
- rank2: 0.24 -> 0.73
- rank3: 0.13 -> 0.12
这种情况下,所有没有性能收益
1.创建容器
docker run --gpus all --shm-size=32g -ti -e NVIDIA_VISIBLE_DEVICES=all --privileged \--net=host -v $PWD:/home \-w /home --rm nvcr.io/nvidia/pytorch:23.07-py3 /bin/bash
2.安装Megatron-LM,准备数据集
cd /home
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
python3 setup.py install
mkdir gpt2-data
cd gpt2-data
wget https://huggingface.co/bigscience/misc-test-data/resolve/main/stas/oscar-1GB.jsonl.xz
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
xz -d oscar-1GB.jsonl.xz
pip install nltk
python3 ../tools/preprocess_data.py \--input oscar-1GB.jsonl \--output-prefix gpt2 \--vocab-file gpt2-vocab.json \--tokenizer-type GPT2BPETokenizer \--merge-file gpt2-merges.txt \--append-eod \--workers 8
3.准备解析脚本
cd /home/Megatron-LM
cat > nsys2json.py << EOF
import sqlite3
import argparse
import json
from pathlib import Path
import re
from collections import defaultdict_PID_TO_DEVICE = None# Code adapted from https://raw.githubusercontent.com/chenyu-jiang/nsys2json/main/nsys2json.pydef parse_args():parser = argparse.ArgumentParser(description='Convert nsight systems sqlite output to Google Event Trace compatible JSON.')parser.add_argument("-f", '--filename', help="Path to the input sqlite file.", required=True)parser.add_argument("-o", "--output", help="Output file name, default to same as input with .json extension.")parser.add_argument("-t", "--activity-type", help="Type of activities shown. Default to all.", default=["kernel", "nvtx-kernel"], choices=['kernel', 'nvtx', "nvtx-kernel", "cuda-api"], nargs="+")parser.add_argument("--nvtx-event-prefix", help="Filter NVTX events by their names' prefix.", type=str, nargs="*")parser.add_argument("--nvtx-color-scheme", help="""Color scheme for NVTX events.Accepts a dict mapping a string to one of chrome tracing colors.Events with names containing the string will be colored.E.g. {"send": "thread_state_iowait", "recv": "thread_state_iowait", "compute": "thread_state_running"}For details of the color scheme, see links in https://github.com/google/perfetto/issues/208""", type=json.loads, default={})args = parser.parse_args()if args.output is None:args.output = Path(args.filename).with_suffix(".json")return argsclass ActivityType:KERNEL = "kernel"NVTX_CPU = "nvtx"NVTX_KERNEL = "nvtx-kernel"CUDA_API = "cuda-api"def munge_time(t):"""Take a timestamp from nsys (ns) and convert it into us (the default for chrome://tracing)."""# For strict correctness, divide by 1000, but this reduces accuracy.return t / 1000.# For reference of the schema, see
# https://docs.nvidia.com/nsight-systems/UserGuide/index.html#exporter-sqlite-schema
def parse_cupti_kernel_events(conn: sqlite3.Connection, strings: dict):per_device_kernel_rows = defaultdict(list)per_device_kernel_events = defaultdict(list)for row in conn.execute("SELECT * FROM CUPTI_ACTIVITY_KIND_KERNEL"):per_device_kernel_rows[row["deviceId"]].append(row)event = {"name": strings[row["shortName"]],"ph": "X", # Complete Event (Begin + End event)"cat": "cuda","ts": munge_time(row["start"]),"dur": munge_time(row["end"] - row["start"]),"tid": "Stream {}".format(row["streamId"]),"pid": "Device {}".format(row["deviceId"]),"args": {# TODO: More},}per_device_kernel_events[row["deviceId"]].append(event)return per_device_kernel_rows, per_device_kernel_eventsdef link_pid_with_devices(conn: sqlite3.Connection):# map each pid to a device. assumes each pid is associated with a single deviceglobal _PID_TO_DEVICEif _PID_TO_DEVICE is None:pid_to_device = {}for row in conn.execute("SELECT DISTINCT deviceId, globalPid / 0x1000000 % 0x1000000 AS PID FROM CUPTI_ACTIVITY_KIND_KERNEL"):assert row["PID"] not in pid_to_device, \f"A single PID ({row['PID']}) is associated with multiple devices ({pid_to_device[row['PID']]} and {row['deviceId']})."pid_to_device[row["PID"]] = row["deviceId"]_PID_TO_DEVICE = pid_to_devicereturn _PID_TO_DEVICEdef parse_nvtx_events(conn: sqlite3.Connection, event_prefix=None, color_scheme={}):if event_prefix is None:match_text = ''else:match_text = " AND "if len(event_prefix) == 1:match_text += f"NVTX_EVENTS.text LIKE '{event_prefix[0]}%'"else:match_text += "("for idx, prefix in enumerate(event_prefix):match_text += f"NVTX_EVENTS.text LIKE '{prefix}%'"if idx == len(event_prefix) - 1:match_text += ")"else:match_text += " OR "per_device_nvtx_rows = defaultdict(list)per_device_nvtx_events = defaultdict(list)pid_to_device = link_pid_with_devices(conn)# eventType 59 is NvtxPushPopRange, which corresponds to torch.cuda.nvtx.range apisfor row in conn.execute(f"SELECT start, end, text, globalTid / 0x1000000 % 0x1000000 AS PID, globalTid % 0x1000000 AS TID FROM NVTX_EVENTS WHERE NVTX_EVENTS.eventType == 59{match_text};"):text = row['text']pid = row['PID']tid = row['TID']per_device_nvtx_rows[pid_to_device[pid]].append(row)assert pid in pid_to_device, f"PID {pid} not found in the pid to device map."event = {"name": text,"ph": "X", # Complete Event (Begin + End event)"cat": "nvtx","ts": munge_time(row["start"]),"dur": munge_time(row["end"] - row["start"]),"tid": "NVTX Thread {}".format(tid),"pid": "Device {}".format(pid_to_device[pid]),"args": {# TODO: More},}if color_scheme:for key, color in color_scheme.items():if re.search(key, text):event["cname"] = colorbreakper_device_nvtx_events[pid_to_device[pid]].append(event)return per_device_nvtx_rows, per_device_nvtx_eventsdef parse_cuda_api_events(conn: sqlite3.Connection, strings: dict):pid_to_devices = link_pid_with_devices(conn)per_device_api_rows = defaultdict(list)per_device_api_events = defaultdict(list)# event type 0 is TRACE_PROCESS_EVENT_CUDA_RUNTIMEfor row in conn.execute(f"SELECT start, end, globalTid / 0x1000000 % 0x1000000 AS PID, globalTid % 0x1000000 AS TID, correlationId, nameId FROM CUPTI_ACTIVITY_KIND_RUNTIME;"):text = strings[row['nameId']]pid = row['PID']tid = row['TID']correlationId = row['correlationId']per_device_api_rows[pid_to_devices[pid]].append(row)event = {"name": text,"ph": "X", # Complete Event (Begin + End event)"cat": "cuda_api","ts": munge_time(row["start"]),"dur": munge_time(row["end"] - row["start"]),"tid": "CUDA API Thread {}".format(tid),"pid": "Device {}".format(pid_to_devices[pid]),"args": {"correlationId": correlationId,},}per_device_api_events[pid_to_devices[pid]].append(event)return per_device_api_rows, per_device_api_eventsdef _find_overlapping_intervals(nvtx_rows, cuda_api_rows):mixed_rows = []for nvtx_row in nvtx_rows:start = nvtx_row["start"]end = nvtx_row["end"]mixed_rows.append((start, 1, "nvtx", nvtx_row))mixed_rows.append((end, -1, "nvtx", nvtx_row))for cuda_api_row in cuda_api_rows:start = cuda_api_row["start"]end = cuda_api_row["end"]mixed_rows.append((start, 1, "cuda_api", cuda_api_row))mixed_rows.append((end, -1, "cuda_api", cuda_api_row))mixed_rows.sort(key=lambda x: (x[0], x[1], x[2]))active_intervals = []result = defaultdict(list)for _, event_type, event_origin, orig_event in mixed_rows:if event_type == 1:# startif event_origin == "nvtx":active_intervals.append(orig_event)else:for event in active_intervals:result[event].append(orig_event)else:# endif event_origin == "nvtx":active_intervals.remove(orig_event)return resultdef link_nvtx_events_to_kernel_events(strings: dict,pid_to_device: dict[int, int],per_device_nvtx_rows: dict[int, list],per_device_cuda_api_rows: dict[int, list],per_device_cuda_kernel_rows: dict[int, list],per_device_kernel_events: dict[int, list]):result = {}for device in pid_to_device.values():event_map = _find_overlapping_intervals(per_device_nvtx_rows[device], per_device_cuda_api_rows[device])correlation_id_map = defaultdict(dict)for cuda_api_row in per_device_cuda_api_rows[device]:correlation_id_map[cuda_api_row["correlationId"]]["cuda_api"] = cuda_api_rowfor kernel_row, kernel_trace_event in zip(per_device_cuda_kernel_rows[device], per_device_kernel_events[device]):correlation_id_map[kernel_row["correlationId"]]["kernel"] = kernel_rowcorrelation_id_map[kernel_row["correlationId"]]["kernel_trace_event"] = kernel_trace_eventfor nvtx_row, cuda_api_rows in event_map.items():kernel_start_time = Nonekernel_end_time = Nonefor cuda_api_row in cuda_api_rows:if "kernel" not in correlation_id_map[cuda_api_row["correlationId"]]:# other cuda api event, ignorecontinuekernel_row = correlation_id_map[cuda_api_row["correlationId"]]["kernel"]kernel_trace_event = correlation_id_map[cuda_api_row["correlationId"]]["kernel_trace_event"]if "NVTXRegions" not in kernel_trace_event["args"]:kernel_trace_event["args"]["NVTXRegions"] = []kernel_trace_event["args"]["NVTXRegions"].append(nvtx_row["text"])if kernel_start_time is None or kernel_start_time > kernel_row["start"]:kernel_start_time = kernel_row["start"]if kernel_end_time is None or kernel_end_time < kernel_row["end"]:kernel_end_time = kernel_row["end"]if kernel_start_time is not None and kernel_end_time is not None:result[nvtx_row] = (kernel_start_time, kernel_end_time)return resultdef parse_all_events(conn: sqlite3.Connection, strings: dict, activities=None, event_prefix=None, color_scheme={}):if activities is None:activities = [ActivityType.KERNEL, ActivityType.NVTX_CPU, ActivityType.NVTX_KERNEL]if ActivityType.KERNEL in activities or ActivityType.NVTX_KERNEL in activities:per_device_kernel_rows, per_device_kernel_events = parse_cupti_kernel_events(conn, strings)if ActivityType.NVTX_CPU in activities or ActivityType.NVTX_KERNEL in activities:per_device_nvtx_rows, per_device_nvtx_events = parse_nvtx_events(conn, event_prefix=event_prefix, color_scheme=color_scheme)if ActivityType.CUDA_API in activities or ActivityType.NVTX_KERNEL in activities:per_device_cuda_api_rows, per_device_cuda_api_events = parse_cuda_api_events(conn, strings)if ActivityType.NVTX_KERNEL in activities:pid_to_device = link_pid_with_devices(conn)nvtx_kernel_event_map = link_nvtx_events_to_kernel_events(strings, pid_to_device, per_device_nvtx_rows, per_device_cuda_api_rows, per_device_kernel_rows, per_device_kernel_events)traceEvents = []if ActivityType.KERNEL in activities:for k, v in per_device_kernel_events.items():traceEvents.extend(v)if ActivityType.NVTX_CPU in activities:for k, v in per_device_nvtx_events.items():traceEvents.extend(v)if ActivityType.CUDA_API in activities:for k, v in per_device_cuda_api_events.items():traceEvents.extend(v)if ActivityType.NVTX_KERNEL in activities:for nvtx_event, (kernel_start_time, kernel_end_time) in nvtx_kernel_event_map.items():event = {"name": nvtx_event["text"],"ph": "X", # Complete Event (Begin + End event)"cat": "nvtx-kernel","ts": munge_time(kernel_start_time),"dur": munge_time(kernel_end_time - kernel_start_time),"tid": "NVTX Kernel Thread {}".format(nvtx_event["tid"]),"pid": "Device {}".format(pid_to_device[nvtx_event["pid"]]),"args": {# TODO: More},}traceEvents.append(event)return traceEventsdef nsys2json():args = parse_args()conn = sqlite3.connect(args.filename)conn.row_factory = sqlite3.Rowstrings = {}for r in conn.execute("SELECT id, value FROM StringIds"):strings[r["id"]] = r["value"]traceEvents = parse_all_events(conn, strings, activities=args.activity_type, event_prefix=args.nvtx_event_prefix, color_scheme=args.nvtx_color_scheme)# make the timelines appear in pid and tid ordertraceEvents.sort(key=lambda x: (x["pid"], x["tid"]))for i in traceEvents:if i["name"] is None:i["name"]="null" with open(args.output, 'w') as f:json.dump(traceEvents, f,indent=4)if __name__ == "__main__":nsys2json()
EOFcat > parser_prof.py << EOF
import json
import re
import os
import sys
import numpy as npwith open(sys.argv[1],"r") as f:traceEvents=json.load(f)
traceEventsPerDevice={}
for event in traceEvents:pid=event["pid"]if pid not in traceEventsPerDevice:traceEventsPerDevice[pid]=[] if event["cat"]=="cuda":epoch_str=event["args"]['NVTXRegions'][0]if event["name"]=="Kernel":event["name"]=epoch_str.split(",")[0]traceEventsPerDevice[pid].append((event["name"]+"_"+event["tid"],event["ts"],event["dur"]))for k,v in traceEventsPerDevice.items():v.sort(key=lambda x: x[1], reverse=False)for device,v in traceEventsPerDevice.items(): print(f"-----------------------------{device}-----------------------------")totalDurPerKernel={}durPerKernel={}marginPerKernel={"beg":float('inf'),"end":0}for ev in v:name,ts,dur=evif name not in totalDurPerKernel:totalDurPerKernel[name]=0durPerKernel[name]=[]totalDurPerKernel[name]+=durdurPerKernel[name].append(dur)if ts<marginPerKernel["beg"]:marginPerKernel["beg"]=tsif ts>marginPerKernel["end"]:marginPerKernel["end"]=tstotal_percent=0total_dur=marginPerKernel["end"]-marginPerKernel["beg"]for name,dur in sorted(totalDurPerKernel.items(), key=lambda d:d[1], reverse = True): total_percent+=(dur/total_dur)print("{:7.2f} min:{:10.2f} max:{:10.2f} avg:{:10.2f} {}".format(dur/total_dur,np.min(durPerKernel[name]),np.max(durPerKernel[name]),np.mean(durPerKernel[name]),name))print("{:7.2f}".format(total_percent))
EOFcat > parser_nccl.py << EOF
import json
import re
import os
import sys
import numpy as npwith open(sys.argv[1],"r") as f:traceEvents=json.load(f)
traceEventsPerDevice={}
for event in traceEvents:pid=event["pid"]if pid not in traceEventsPerDevice:traceEventsPerDevice[pid]=[]name=event['name']if event["cat"]=="nvtx-kernel" and name.startswith("nccl"):op_name = name.split(",")[0]match = re.match(".*sizes = (.*?), input_op_ids",name)data=eval("np.ones({})".format(match.group(1).replace('[',"(").replace("]",")")))#print(op_name,data.shape,data.size)traceEventsPerDevice[pid].append((op_name,data.size))for k,v in traceEventsPerDevice.items():v.sort(key=lambda x: x[1], reverse=False)for device,v in traceEventsPerDevice.items():print(f"-----------------------------{device}-----------------------------")totalDurPerKernel={}for ev in v:name,size=evif name not in totalDurPerKernel:totalDurPerKernel[name]=0totalDurPerKernel[name]+=sizefor name,size in sorted(totalDurPerKernel.items(), key=lambda d:d[1], reverse = True):print("{:14.2f} {}".format(size,name))
EOF
4.PP4测试
cd /home/Megatron-LM
export MAX_JOBS=8
export NCCL_DEBUG=error
export NCCL_SOCKET_IFNAME=ens8
export NCCL_IB_DISABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NVTE_FLASH_ATTN=0
export NVTE_FUSED_ATTN=0nsys profile --stats=true -o cuda_profing_report.nsys-rep -f true -t cuda,nvtx --gpu-metrics-device=0,1,2,3 --capture-range=cudaProfilerApi \--capture-range-end=stop torchrun --nproc_per_node 4 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 7000 pretrain_gpt.py \--tensor-model-parallel-size 1 --pipeline-model-parallel-size 4 \--distributed-backend nccl \--num-layers 32 \--hidden-size 4096 \--ffn-hidden-size 11008 \--num-attention-heads 32 \--seq-length 32 \--max-position-embeddings 32 \--micro-batch-size 1 \--profile \--profile-step-start 2 \--profile-step-end 3 \--profile-ranks 0 1 2 3 \--global-batch-size 16 \--train-iters 6 \--log-interval 3 \--weight-decay 0.1 \--adam-beta1 0.9 \--adam-beta2 0.95 \--init-method-std 0.006 \--clip-grad 1.0 \--fp16 \--lr 6.0e-5 \--lr-decay-style cosine \--min-lr 6.0e-6 \--lr-warmup-fraction .001 \--lr-decay-iters 430000 \--use-mcore-models \--transformer-impl local \--clip-grad 1.0 \--weight-decay 1e-1 \--seed 42 \--fp16 \--vocab-file ./gpt2-data/gpt2-vocab.json \--merge-file ./gpt2-data/gpt2-merges.txt \--eval-iters 0 \--data-path ./gpt2-data/gpt2_text_documentpython3 nsys2json.py -f cuda_profing_report.sqlite -o prof.json
python3 parser_prof.py prof.json
python3 parser_nccl.py prof.json
- 输出
-----------------------------Device 0-----------------------------0.15 min: 44.51 max: 87362.91 avg: 8969.15 ncclKernel_SendRecv_RING_SIMPLE_Sum_int8_t_Stream 320.08 min: 215.58 max: 243.39 avg: 227.19 ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_32x1_nt_Stream 70.07 min: 76897.51 max: 76897.51 avg: 76897.51 ncclKernel_AllReduce_RING_LL_Sum_half_Stream 240.05 min: 45.06 max: 162.56 avg: 107.15 autograd::engine::evaluate_function: LinearWithGradAccumulationAndAsyncCommunicationBackward_Stream 70.04 min: 170.98 max: 192.96 avg: 179.05 ampere_fp16_s16816gemm_fp16_128x64_ldg8_f2f_stages_64x3_tn_Stream 70.04 min: 1.02 max: 3488.22 avg: 24.10 vectorized_elementwise_kernel_Stream 70.02 min: 47.26 max: 117.92 avg: 82.33 LinearWithGradAccumulationAndAsyncCommunication_Stream 70.01 min: 5.60 max: 215.04 avg: 208.67 multi_tensor_apply_kernel_Stream 70.01 min: 1.54 max: 1483.07 avg: 107.85 unrolled_elementwise_kernel_Stream 70.01 min: 81.73 max: 85.25 avg: 82.97 ampere_fp16_s1688gemm_fp16_256x64_ldg8_f2f_nt_Stream 70.00 min: 7.65 max: 9.25 avg: 8.29 ln_bwd_kernel_Stream 70.00 min: 4.77 max: 8.06 avg: 6.07 reduce_kernel_Stream 70.00 min: 5.54 max: 7.01 avg: 5.96 ln_bwd_finalize_kernel_Stream 70.00 min: 1.41 max: 4.80 avg: 3.34 elementwise_kernel_Stream 70.00 min: 6.56 max: 8.00 avg: 6.97 CatArrayBatchedCopy_Stream 70.00 min: 3.10 max: 3.90 avg: 3.46 kernel3_Stream 70.00 min: 2.94 max: 3.62 avg: 3.19 autograd::engine::evaluate_function: BmmBackward0_Stream 70.00 min: 2.75 max: 3.46 avg: 3.11 ln_fwd_kernel_Stream 70.00 min: 2.53 max: 3.23 avg: 2.75 autograd::engine::evaluate_function: BaddbmmBackward0_Stream 70.00 min: 2.27 max: 2.98 avg: 2.48 kernel5_Stream 70.00 min: 4.61 max: 5.31 avg: 4.89 kernel6_Stream 70.00 min: 43.42 max: 556.22 avg: 299.82 ncclKernel_AllReduce_RING_LL_Sum_uint8_t_Stream 200.00 min: 3.07 max: 3.84 avg: 3.41 kernel4_Stream 70.00 min: 2.88 max: 3.36 avg: 3.06 aten::baddbmm_Stream 70.00 min: 2.17 max: 3.26 avg: 2.41 fused_dropout_kernel_vec_Stream 70.00 min: 2.46 max: 2.88 avg: 2.62 aten::bmm_Stream 70.00 min: 1.66 max: 2.02 avg: 1.76 scaled_upper_triang_masked_softmax_warp_backward_Stream 70.00 min: 1.60 max: 1.89 avg: 1.72 scaled_upper_triang_masked_softmax_warp_forward_Stream 70.00 min: 7.84 max: 9.02 avg: 8.02 DeviceRadixSortSingleTileKernel_Stream 70.00 min: 121.92 max: 121.92 avg: 121.92 ncclKernel_AllGather_RING_LL_Sum_int8_t_Stream 200.00 min: 4.80 max: 5.57 avg: 5.13 index_elementwise_kernel_Stream 70.00 min: 3.42 max: 5.50 avg: 4.17 indexing_backward_kernel_Stream 70.00 min: 3.55 max: 4.35 avg: 3.87 indexSelectLargeIndex_Stream 70.00 min: 3.26 max: 3.81 avg: 3.38 embedding_backward_feature_kernel_Stream 70.00 min: 1.38 max: 1.63 avg: 1.41 DeviceSelectSweepKernel_Stream 70.00 min: 1.38 max: 1.57 avg: 1.39 DeviceReduceSingleTileKernel_Stream 70.00 min: 0.96 max: 0.99 avg: 0.97 DeviceCompactInitKernel_Stream 70.00 min: 1.02 max: 1.22 avg: 1.06 elementwise_kernel_with_index_Stream 70.00 min: 14.50 max: 14.50 avg: 14.50 ncclKernel_AllReduce_RING_LL_Sum_float_Stream 400.48
-----------------------------Device 1-----------------------------0.26 min: 27.17 max: 58101.85 avg: 8208.22 ncclKernel_SendRecv_RING_SIMPLE_Sum_int8_t_Stream 240.09 min: 108588.43 max: 108588.43 avg: 108588.43 ncclKernel_AllReduce_RING_LL_Sum_float_Stream 320.07 min: 215.30 max: 244.03 avg: 227.12 ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_32x1_nt_Stream 70.05 min: 45.02 max: 162.56 avg: 107.00 autograd::engine::evaluate_function: LinearWithGradAccumulationAndAsyncCommunicationBackward_Stream 70.04 min: 169.60 max: 191.74 avg: 177.55 ampere_fp16_s16816gemm_fp16_128x64_ldg8_f2f_stages_64x3_tn_Stream 70.02 min: 47.46 max: 117.57 avg: 82.42 LinearWithGradAccumulationAndAsyncCommunication_Stream 70.01 min: 0.96 max: 3005.40 avg: 8.44 vectorized_elementwise_kernel_Stream 70.01 min: 49.95 max: 213.95 avg: 208.16 multi_tensor_apply_kernel_Stream 70.01 min: 81.63 max: 84.86 avg: 82.93 ampere_fp16_s1688gemm_fp16_256x64_ldg8_f2f_nt_Stream 70.01 min: 1.47 max: 364.32 avg: 94.75 unrolled_elementwise_kernel_Stream 70.00 min: 7.62 max: 10.05 avg: 8.40 ln_bwd_kernel_Stream 70.00 min: 4.61 max: 7.97 avg: 5.99 reduce_kernel_Stream 70.00 min: 5.31 max: 6.88 avg: 5.80 ln_bwd_finalize_kernel_Stream 70.00 min: 1.38 max: 4.80 avg: 3.27 elementwise_kernel_Stream 70.00 min: 6.40 max: 8.00 avg: 6.91 CatArrayBatchedCopy_Stream 70.00 min: 3.04 max: 3.90 avg: 3.36 kernel3_Stream 70.00 min: 2.85 max: 3.78 avg: 3.18 autograd::engine::evaluate_function: BmmBackward0_Stream 70.00 min: 2.69 max: 3.49 avg: 3.05 ln_fwd_kernel_Stream 70.00 min: 2.50 max: 3.23 avg: 2.72 autograd::engine::evaluate_function: BaddbmmBackward0_Stream 70.00 min: 2.24 max: 2.98 avg: 2.50 kernel5_Stream 70.00 min: 4.64 max: 5.41 avg: 4.88 kernel6_Stream 70.00 min: 55.04 max: 491.10 avg: 273.07 ncclKernel_AllReduce_RING_LL_Sum_uint8_t_Stream 200.00 min: 3.04 max: 3.81 avg: 3.36 kernel4_Stream 70.00 min: 2.81 max: 3.36 avg: 3.01 aten::baddbmm_Stream 70.00 min: 2.40 max: 2.88 avg: 2.55 aten::bmm_Stream 70.00 min: 2.14 max: 2.62 avg: 2.28 fused_dropout_kernel_vec_Stream 70.00 min: 1.63 max: 2.02 avg: 1.73 scaled_upper_triang_masked_softmax_warp_backward_Stream 70.00 min: 1.57 max: 1.89 avg: 1.68 scaled_upper_triang_masked_softmax_warp_forward_Stream 70.00 min: 164.67 max: 164.67 avg: 164.67 ncclKernel_AllGather_RING_LL_Sum_int8_t_Stream 200.00 min: 1.34 max: 1.63 avg: 1.37 DeviceSelectSweepKernel_Stream 70.00 min: 1.31 max: 1.44 avg: 1.34 DeviceReduceSingleTileKernel_Stream 70.00 min: 0.93 max: 0.99 avg: 0.96 DeviceCompactInitKernel_Stream 70.58
-----------------------------Device 2-----------------------------0.24 min: 36.19 max: 48979.20 avg: 7991.36 ncclKernel_SendRecv_RING_SIMPLE_Sum_int8_t_Stream 240.11 min: 132201.84 max: 132201.84 avg: 132201.84 ncclKernel_AllReduce_RING_LL_Sum_float_Stream 320.07 min: 215.49 max: 243.68 avg: 227.16 ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_32x1_nt_Stream 70.05 min: 45.18 max: 162.91 avg: 107.41 autograd::engine::evaluate_function: LinearWithGradAccumulationAndAsyncCommunicationBackward_Stream 70.04 min: 170.53 max: 191.87 avg: 177.73 ampere_fp16_s16816gemm_fp16_128x64_ldg8_f2f_stages_64x3_tn_Stream 70.02 min: 47.36 max: 118.69 avg: 82.51 LinearWithGradAccumulationAndAsyncCommunication_Stream 70.01 min: 0.99 max: 2996.61 avg: 8.44 vectorized_elementwise_kernel_Stream 70.01 min: 50.14 max: 213.79 avg: 208.02 multi_tensor_apply_kernel_Stream 70.01 min: 81.86 max: 85.60 avg: 82.98 ampere_fp16_s1688gemm_fp16_256x64_ldg8_f2f_nt_Stream 70.01 min: 1.47 max: 364.51 avg: 94.80 unrolled_elementwise_kernel_Stream 70.00 min: 7.65 max: 9.57 avg: 8.41 ln_bwd_kernel_Stream 70.00 min: 4.61 max: 7.90 avg: 6.01 reduce_kernel_Stream 70.00 min: 5.34 max: 7.04 avg: 5.89 ln_bwd_finalize_kernel_Stream 70.00 min: 1.41 max: 4.74 avg: 3.28 elementwise_kernel_Stream 70.00 min: 6.46 max: 7.94 avg: 6.93 CatArrayBatchedCopy_Stream 70.00 min: 3.04 max: 3.87 avg: 3.37 kernel3_Stream 70.00 min: 2.82 max: 3.74 avg: 3.21 autograd::engine::evaluate_function: BmmBackward0_Stream 70.00 min: 2.65 max: 3.49 avg: 3.07 ln_fwd_kernel_Stream 70.00 min: 2.53 max: 3.13 avg: 2.74 autograd::engine::evaluate_function: BaddbmmBackward0_Stream 70.00 min: 2.21 max: 3.01 avg: 2.49 kernel5_Stream 70.00 min: 4.64 max: 5.54 avg: 4.91 kernel6_Stream 70.00 min: 51.23 max: 455.87 avg: 253.55 ncclKernel_AllReduce_RING_LL_Sum_uint8_t_Stream 200.00 min: 3.04 max: 3.84 avg: 3.38 kernel4_Stream 70.00 min: 2.82 max: 3.33 avg: 3.01 aten::baddbmm_Stream 70.00 min: 2.40 max: 2.85 avg: 2.56 aten::bmm_Stream 70.00 min: 2.14 max: 2.56 avg: 2.29 fused_dropout_kernel_vec_Stream 70.00 min: 1.60 max: 2.02 avg: 1.74 scaled_upper_triang_masked_softmax_warp_backward_Stream 70.00 min: 1.57 max: 1.89 avg: 1.68 scaled_upper_triang_masked_softmax_warp_forward_Stream 70.00 min: 144.03 max: 144.03 avg: 144.03 ncclKernel_AllGather_RING_LL_Sum_int8_t_Stream 200.00 min: 1.38 max: 1.66 avg: 1.39 DeviceSelectSweepKernel_Stream 70.00 min: 1.34 max: 1.54 avg: 1.36 DeviceReduceSingleTileKernel_Stream 70.00 min: 0.96 max: 1.02 avg: 0.97 DeviceCompactInitKernel_Stream 70.58
-----------------------------Device 3-----------------------------0.13 min: 151115.23 max: 151115.23 avg: 151115.23 ncclKernel_AllReduce_RING_LL_Sum_half_Stream 240.09 min: 216.03 max: 984.89 avg: 257.82 ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_32x1_nt_Stream 70.06 min: 40.41 max: 66041.74 avg: 3934.81 ncclKernel_SendRecv_RING_SIMPLE_Sum_int8_t_Stream 280.05 min: 47.01 max: 162.65 avg: 114.10 autograd::engine::evaluate_function: LinearWithGradAccumulationAndAsyncCommunicationBackward_Stream 70.04 min: 181.53 max: 190.81 avg: 185.52 ampere_fp16_s16816gemm_fp16_128x64_ldg8_f2f_stages_64x3_tn_Stream 70.02 min: 50.34 max: 116.96 avg: 83.82 LinearWithGradAccumulationAndAsyncCommunication_Stream 70.01 min: 19.17 max: 214.46 avg: 208.26 multi_tensor_apply_kernel_Stream 70.01 min: 1.02 max: 3480.38 avg: 6.65 vectorized_elementwise_kernel_Stream 70.01 min: 1.44 max: 1481.71 avg: 75.28 unrolled_elementwise_kernel_Stream 70.01 min: 82.85 max: 86.27 avg: 84.06 ampere_fp16_s1688gemm_fp16_256x64_ldg8_f2f_nt_Stream 70.01 min: 515.39 max: 519.80 avg: 517.37 ampere_fp16_s16816gemm_fp16_64x64_ldg8_f2f_stages_64x6_tn_Stream 70.01 min: 489.56 max: 490.56 avg: 489.90 ampere_fp16_s16816gemm_fp16_64x64_sliced1x2_ldg8_f2f_stages_64x5_nn_Stream 70.00 min: 2.18 max: 19.49 avg: 6.88 reduce_kernel_Stream 70.00 min: 8.38 max: 9.82 avg: 8.92 ln_bwd_kernel_Stream 70.00 min: 6.27 max: 6.91 avg: 6.54 ln_bwd_finalize_kernel_Stream 70.00 min: 1.41 max: 15.43 avg: 4.80 elementwise_kernel_Stream 70.00 min: 7.49 max: 7.97 avg: 7.82 CatArrayBatchedCopy_Stream 70.00 min: 3.52 max: 3.90 avg: 3.70 kernel3_Stream 70.00 min: 3.17 max: 3.97 avg: 3.51 autograd::engine::evaluate_function: BmmBackward0_Stream 70.00 min: 3.07 max: 3.49 avg: 3.36 ln_fwd_kernel_Stream 70.00 min: 2.85 max: 3.20 avg: 3.02 autograd::engine::evaluate_function: BaddbmmBackward0_Stream 70.00 min: 2.66 max: 3.01 avg: 2.80 kernel5_Stream 70.00 min: 4.86 max: 5.54 avg: 5.17 kernel6_Stream 70.00 min: 3.52 max: 3.90 avg: 3.69 kernel4_Stream 70.00 min: 3.26 max: 3.42 avg: 3.31 aten::baddbmm_Stream 70.00 min: 2.78 max: 2.88 avg: 2.84 aten::bmm_Stream 70.00 min: 2.46 max: 2.62 avg: 2.52 fused_dropout_kernel_vec_Stream 70.00 min: 1.89 max: 2.05 avg: 1.93 scaled_upper_triang_masked_softmax_warp_backward_Stream 70.00 min: 1.82 max: 1.89 avg: 1.87 scaled_upper_triang_masked_softmax_warp_forward_Stream 70.00 min: 3.07 max: 3.55 avg: 3.24 index_elementwise_kernel_Stream 70.00 min: 8.96 max: 9.60 avg: 9.21 ln_bwd_tuned_kernel_Stream 70.00 min: 6.43 max: 6.88 avg: 6.70 ln_bwd_finalize_tuned_kernel_Stream 70.00 min: 94.69 max: 94.69 avg: 94.69 ncclKernel_AllReduce_RING_LL_Sum_float_Stream 440.00 min: 4.10 max: 4.16 avg: 4.14 ln_fwd_tuned_kernel_Stream 70.00 min: 1.15 max: 1.22 avg: 1.19 elementwise_kernel_with_index_Stream 70.00 min: 15.33 max: 19.14 avg: 17.23 ncclKernel_AllReduce_RING_LL_Sum_uint8_t_Stream 200.00 min: 1.38 max: 1.66 avg: 1.41 DeviceSelectSweepKernel_Stream 70.00 min: 1.38 max: 1.60 avg: 1.39 DeviceReduceSingleTileKernel_Stream 70.00 min: 0.96 max: 0.99 avg: 0.97 DeviceCompactInitKernel_Stream 70.00 min: 10.18 max: 10.18 avg: 10.18 ncclKernel_AllGather_RING_LL_Sum_int8_t_Stream 200.46
-----------------------------Device 0-----------------------------206045187.00 nccl:all_reduce24.00 nccl:_all_gather_base
-----------------------------Device 1-----------------------------24.00 nccl:_all_gather_base3.00 nccl:all_reduce
-----------------------------Device 2-----------------------------24.00 nccl:_all_gather_base3.00 nccl:all_reduce
-----------------------------Device 3-----------------------------206045187.00 nccl:all_reduce24.00 nccl:_all_gather_base
5.PP4 VP2 测试
cd /home/Megatron-LM
export MAX_JOBS=8
export NCCL_DEBUG=error
export NCCL_SOCKET_IFNAME=ens8
export NCCL_IB_DISABLE=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NVTE_FLASH_ATTN=0
export NVTE_FUSED_ATTN=0nsys profile --stats=true -o cuda_profing_report.nsys-rep -f true -t cuda,nvtx --gpu-metrics-device=0,1,2,3 --capture-range=cudaProfilerApi \--capture-range-end=stop torchrun --nproc_per_node 4 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 7000 pretrain_gpt.py \--tensor-model-parallel-size 1 --pipeline-model-parallel-size 4 --num-layers-per-virtual-pipeline-stage 2 \--distributed-backend nccl \--num-layers 32 \--hidden-size 4096 \--ffn-hidden-size 11008 \--num-attention-heads 32 \--seq-length 32 \--max-position-embeddings 32 \--micro-batch-size 1 \--profile \--profile-step-start 2 \--profile-step-end 3 \--profile-ranks 0 1 2 3 \--global-batch-size 16 \--train-iters 6 \--eval-iters 0 \--log-interval 3 \--weight-decay 0.1 \--adam-beta1 0.9 \--adam-beta2 0.95 \--init-method-std 0.006 \--clip-grad 1.0 \--fp16 \--lr 6.0e-5 \--lr-decay-style cosine \--min-lr 6.0e-6 \--lr-warmup-fraction .001 \--lr-decay-iters 430000 \--use-mcore-models \--transformer-impl local \--clip-grad 1.0 \--weight-decay 1e-1 \--seed 42 \--fp16 \--vocab-file ./gpt2-data/gpt2-vocab.json \--merge-file ./gpt2-data/gpt2-merges.txt \--data-path ./gpt2-data/gpt2_text_documentpython3 nsys2json.py -f cuda_profing_report.sqlite -o prof.json
python3 parser_prof.py prof.json
python3 parser_nccl.py prof.json
- 输出
-----------------------------Device 0-----------------------------0.50 min: 23.94 max: 41342.41 avg: 7921.61 ncclKernel_SendRecv_RING_SIMPLE_Sum_int8_t_Stream 360.12 min: 23.33 max: 24482.04 avg: 1391.83 ncclKernel_SendRecv_RING_SIMPLE_Sum_int8_t_Stream 320.06 min: 215.65 max: 245.12 avg: 228.25 ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_32x1_nt_Stream 70.05 min: 77877.81 max: 77877.81 avg: 77877.81 ncclKernel_AllReduce_RING_LL_Sum_half_Stream 240.04 min: 45.41 max: 165.09 avg: 108.46 autograd::engine::evaluate_function: LinearWithGradAccumulationAndAsyncCommunicationBackward_Stream 70.03 min: 1.02 max: 2480.07 avg: 27.84 vectorized_elementwise_kernel_Stream 70.03 min: 171.78 max: 222.56 avg: 182.60 ampere_fp16_s16816gemm_fp16_128x64_ldg8_f2f_stages_64x3_tn_Stream 70.01 min: 47.65 max: 117.95 avg: 82.34 LinearWithGradAccumulationAndAsyncCommunication_Stream 70.01 min: 5.63 max: 216.38 avg: 209.00 multi_tensor_apply_kernel_Stream 70.01 min: 82.27 max: 88.77 avg: 84.56 ampere_fp16_s1688gemm_fp16_256x64_ldg8_f2f_nt_Stream 70.01 min: 1.47 max: 1482.88 avg: 107.89 unrolled_elementwise_kernel_Stream 70.00 min: 7.58 max: 12.32 avg: 9.81 ln_bwd_kernel_Stream 70.00 min: 4.74 max: 10.78 avg: 7.27 reduce_kernel_Stream 70.00 min: 5.47 max: 9.70 avg: 7.19 ln_bwd_finalize_kernel_Stream 70.00 min: 2.91 max: 6.94 avg: 4.58 autograd::engine::evaluate_function: BmmBackward0_Stream 70.00 min: 6.59 max: 10.69 avg: 8.24 CatArrayBatchedCopy_Stream 70.00 min: 2.56 max: 5.92 avg: 3.99 autograd::engine::evaluate_function: BaddbmmBackward0_Stream 70.00 min: 2.27 max: 5.79 avg: 3.84 kernel5_Stream 70.00 min: 1.44 max: 6.34 avg: 3.39 elementwise_kernel_Stream 70.00 min: 3.14 max: 6.24 avg: 3.67 kernel3_Stream 70.00 min: 138.40 max: 761.95 avg: 450.18 ncclKernel_AllReduce_RING_LL_Sum_uint8_t_Stream 200.00 min: 2.75 max: 6.43 avg: 3.44 ln_fwd_kernel_Stream 70.00 min: 4.58 max: 8.22 avg: 6.12 kernel6_Stream 70.00 min: 3.10 max: 5.82 avg: 3.48 kernel4_Stream 70.00 min: 2.85 max: 5.31 avg: 3.10 aten::baddbmm_Stream 70.00 min: 1.66 max: 4.77 avg: 3.02 scaled_upper_triang_masked_softmax_warp_backward_Stream 70.00 min: 2.18 max: 5.79 avg: 2.47 fused_dropout_kernel_vec_Stream 70.00 min: 2.43 max: 5.28 avg: 2.66 aten::bmm_Stream 70.00 min: 266.78 max: 266.78 avg: 266.78 ncclKernel_AllGather_RING_LL_Sum_int8_t_Stream 200.00 min: 1.60 max: 4.22 avg: 1.76 scaled_upper_triang_masked_softmax_warp_forward_Stream 70.00 min: 169.22 max: 169.22 avg: 169.22 ncclKernel_AllReduce_RING_LL_Sum_float_Stream 440.00 min: 7.84 max: 9.86 avg: 8.18 DeviceRadixSortSingleTileKernel_Stream 70.00 min: 4.70 max: 6.40 avg: 5.23 index_elementwise_kernel_Stream 70.00 min: 3.55 max: 6.59 avg: 4.43 indexing_backward_kernel_Stream 70.00 min: 3.49 max: 6.21 avg: 3.95 indexSelectLargeIndex_Stream 70.00 min: 3.30 max: 5.73 avg: 3.73 embedding_backward_feature_kernel_Stream 70.00 min: 1.38 max: 1.66 avg: 1.42 DeviceSelectSweepKernel_Stream 70.00 min: 1.38 max: 1.47 avg: 1.39 DeviceReduceSingleTileKernel_Stream 70.00 min: 0.96 max: 0.99 avg: 0.98 DeviceCompactInitKernel_Stream 70.00 min: 1.02 max: 3.14 avg: 1.36 elementwise_kernel_with_index_Stream 70.88
-----------------------------Device 1-----------------------------0.70 min: 43.81 max: 25483.82 avg: 8250.66 ncclKernel_SendRecv_RING_SIMPLE_Sum_int8_t_Stream 280.63 min: 34.30 max: 32534.59 avg: 7372.23 ncclKernel_SendRecv_RING_SIMPLE_Sum_int8_t_Stream 240.06 min: 215.30 max: 245.02 avg: 228.29 ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_32x1_nt_Stream 70.06 min: 87092.54 max: 87092.54 avg: 87092.54 ncclKernel_AllReduce_RING_LL_Sum_float_Stream 360.04 min: 45.09 max: 145.76 avg: 107.02 autograd::engine::evaluate_function: LinearWithGradAccumulationAndAsyncCommunicationBackward_Stream 70.03 min: 170.43 max: 206.21 avg: 179.27 ampere_fp16_s16816gemm_fp16_128x64_ldg8_f2f_stages_64x3_tn_Stream 70.01 min: 48.19 max: 118.82 avg: 83.24 LinearWithGradAccumulationAndAsyncCommunication_Stream 70.01 min: 0.99 max: 1503.93 avg: 10.90 vectorized_elementwise_kernel_Stream 70.01 min: 50.14 max: 214.62 avg: 208.26 multi_tensor_apply_kernel_Stream 70.01 min: 82.02 max: 86.91 avg: 84.53 ampere_fp16_s1688gemm_fp16_256x64_ldg8_f2f_nt_Stream 70.01 min: 1.50 max: 364.54 avg: 94.76 unrolled_elementwise_kernel_Stream 70.00 min: 7.71 max: 12.19 avg: 9.94 ln_bwd_kernel_Stream 70.00 min: 4.61 max: 9.82 avg: 7.27 reduce_kernel_Stream 70.00 min: 5.47 max: 8.80 avg: 7.17 ln_bwd_finalize_kernel_Stream 70.00 min: 3.39 max: 6.75 avg: 5.02 kernel3_Stream 70.00 min: 1.41 max: 7.20 avg: 4.50 elementwise_kernel_Stream 70.00 min: 2.82 max: 6.34 avg: 4.67 autograd::engine::evaluate_function: BmmBackward0_Stream 70.00 min: 2.88 max: 6.14 avg: 4.52 ln_fwd_kernel_Stream 70.00 min: 6.53 max: 9.60 avg: 8.13 CatArrayBatchedCopy_Stream 70.00 min: 2.56 max: 5.50 avg: 4.01 autograd::engine::evaluate_function: BaddbmmBackward0_Stream 70.00 min: 2.24 max: 8.06 avg: 3.96 kernel5_Stream 70.00 min: 4.64 max: 14.53 avg: 6.31 kernel6_Stream 70.00 min: 3.20 max: 6.34 avg: 4.86 kernel4_Stream 70.00 min: 2.82 max: 5.79 avg: 4.40 aten::baddbmm_Stream 70.00 min: 2.40 max: 5.41 avg: 3.95 aten::bmm_Stream 70.00 min: 2.14 max: 5.34 avg: 3.73 fused_dropout_kernel_vec_Stream 70.00 min: 1.57 max: 4.51 avg: 3.10 scaled_upper_triang_masked_softmax_warp_forward_Stream 70.00 min: 1.63 max: 4.48 avg: 3.06 scaled_upper_triang_masked_softmax_warp_backward_Stream 70.00 min: 97.31 max: 146.53 avg: 121.92 ncclKernel_AllReduce_RING_LL_Sum_uint8_t_Stream 200.00 min: 1.34 max: 1.63 avg: 1.37 DeviceSelectSweepKernel_Stream 70.00 min: 1.31 max: 1.44 avg: 1.34 DeviceReduceSingleTileKernel_Stream 70.00 min: 0.96 max: 0.99 avg: 0.96 DeviceCompactInitKernel_Stream 70.00 min: 10.40 max: 10.40 avg: 10.40 ncclKernel_AllGather_RING_LL_Sum_int8_t_Stream 201.57
-----------------------------Device 2-----------------------------0.73 min: 35.14 max: 25527.14 avg: 8370.15 ncclKernel_SendRecv_RING_SIMPLE_Sum_int8_t_Stream 280.06 min: 40.48 max: 14778.07 avg: 746.27 ncclKernel_SendRecv_RING_SIMPLE_Sum_int8_t_Stream 240.06 min: 89855.22 max: 89855.22 avg: 89855.22 ncclKernel_AllReduce_RING_LL_Sum_float_Stream 360.06 min: 215.42 max: 243.01 avg: 226.95 ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_32x1_nt_Stream 70.04 min: 45.15 max: 158.37 avg: 106.52 autograd::engine::evaluate_function: LinearWithGradAccumulationAndAsyncCommunicationBackward_Stream 70.03 min: 171.62 max: 209.63 avg: 180.28 ampere_fp16_s16816gemm_fp16_128x64_ldg8_f2f_stages_64x3_tn_Stream 70.01 min: 49.57 max: 118.24 avg: 83.85 LinearWithGradAccumulationAndAsyncCommunication_Stream 70.01 min: 0.99 max: 1508.61 avg: 7.58 vectorized_elementwise_kernel_Stream 70.01 min: 49.98 max: 213.95 avg: 208.09 multi_tensor_apply_kernel_Stream 70.01 min: 82.05 max: 84.99 avg: 82.99 ampere_fp16_s1688gemm_fp16_256x64_ldg8_f2f_nt_Stream 70.01 min: 1.50 max: 364.00 avg: 94.79 unrolled_elementwise_kernel_Stream 70.00 min: 7.55 max: 9.57 avg: 8.39 ln_bwd_kernel_Stream 70.00 min: 4.64 max: 7.68 avg: 5.95 reduce_kernel_Stream 70.00 min: 5.47 max: 6.79 avg: 5.88 ln_bwd_finalize_kernel_Stream 70.00 min: 3.20 max: 46.18 avg: 5.65 ln_fwd_kernel_Stream 70.00 min: 3.68 max: 6.88 avg: 5.50 kernel3_Stream 70.00 min: 1.41 max: 7.26 avg: 4.99 elementwise_kernel_Stream 70.00 min: 2.24 max: 48.10 avg: 5.29 kernel5_Stream 70.00 min: 6.43 max: 7.58 avg: 6.79 CatArrayBatchedCopy_Stream 70.00 min: 2.88 max: 4.03 avg: 3.30 autograd::engine::evaluate_function: BmmBackward0_Stream 70.00 min: 2.56 max: 3.07 avg: 2.71 autograd::engine::evaluate_function: BaddbmmBackward0_Stream 70.00 min: 3.52 max: 6.53 avg: 5.29 kernel4_Stream 70.00 min: 134.88 max: 511.68 avg: 323.28 ncclKernel_AllReduce_RING_LL_Sum_uint8_t_Stream 200.00 min: 4.70 max: 5.25 avg: 4.88 kernel6_Stream 70.00 min: 3.14 max: 6.14 avg: 4.77 aten::baddbmm_Stream 70.00 min: 2.69 max: 5.66 avg: 4.47 aten::bmm_Stream 70.00 min: 2.37 max: 5.34 avg: 4.24 fused_dropout_kernel_vec_Stream 70.00 min: 1.76 max: 4.70 avg: 3.51 scaled_upper_triang_masked_softmax_warp_forward_Stream 70.00 min: 1.63 max: 1.98 avg: 1.73 scaled_upper_triang_masked_softmax_warp_backward_Stream 70.00 min: 157.50 max: 157.50 avg: 157.50 ncclKernel_AllGather_RING_LL_Sum_int8_t_Stream 200.00 min: 1.34 max: 1.66 avg: 1.39 DeviceSelectSweepKernel_Stream 70.00 min: 1.34 max: 1.44 avg: 1.36 DeviceReduceSingleTileKernel_Stream 70.00 min: 0.96 max: 0.99 avg: 0.97 DeviceCompactInitKernel_Stream 71.04
-----------------------------Device 3-----------------------------0.07 min: 216.03 max: 984.41 avg: 257.61 ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_32x1_nt_Stream 70.06 min: 87483.81 max: 87483.81 avg: 87483.81 ncclKernel_AllReduce_RING_LL_Sum_half_Stream 240.06 min: 24.16 max: 12072.96 avg: 641.48 ncclKernel_SendRecv_RING_SIMPLE_Sum_int8_t_Stream 280.04 min: 46.34 max: 165.25 avg: 111.93 autograd::engine::evaluate_function: LinearWithGradAccumulationAndAsyncCommunicationBackward_Stream 70.03 min: 179.68 max: 190.37 avg: 184.96 ampere_fp16_s16816gemm_fp16_128x64_ldg8_f2f_stages_64x3_tn_Stream 70.01 min: 49.95 max: 117.12 avg: 83.13 LinearWithGradAccumulationAndAsyncCommunication_Stream 70.01 min: 1.02 max: 2477.52 avg: 8.22 vectorized_elementwise_kernel_Stream 70.01 min: 18.66 max: 217.44 avg: 210.38 multi_tensor_apply_kernel_Stream 70.01 min: 1.60 max: 1491.89 avg: 75.95 unrolled_elementwise_kernel_Stream 70.01 min: 82.56 max: 87.90 avg: 83.73 ampere_fp16_s1688gemm_fp16_256x64_ldg8_f2f_nt_Stream 70.01 min: 512.35 max: 520.15 avg: 515.84 ampere_fp16_s16816gemm_fp16_64x64_ldg8_f2f_stages_64x6_tn_Stream 70.01 min: 477.63 max: 492.89 avg: 481.12 ampere_fp16_s16816gemm_fp16_64x64_sliced1x2_ldg8_f2f_stages_64x5_nn_Stream 70.00 min: 24.35 max: 59.77 avg: 37.00 ncclKernel_SendRecv_RING_SIMPLE_Sum_int8_t_Stream 320.00 min: 2.24 max: 19.65 avg: 6.79 reduce_kernel_Stream 70.00 min: 8.22 max: 11.84 avg: 8.76 ln_bwd_kernel_Stream 70.00 min: 1.47 max: 17.44 avg: 4.86 elementwise_kernel_Stream 70.00 min: 6.27 max: 9.63 avg: 6.51 ln_bwd_finalize_kernel_Stream 70.00 min: 7.33 max: 9.98 avg: 7.62 CatArrayBatchedCopy_Stream 70.00 min: 3.42 max: 4.19 avg: 3.72 kernel3_Stream 70.00 min: 3.10 max: 6.21 avg: 3.45 autograd::engine::evaluate_function: BmmBackward0_Stream 70.00 min: 3.01 max: 3.49 avg: 3.28 ln_fwd_kernel_Stream 70.00 min: 2.75 max: 5.92 avg: 2.98 autograd::engine::evaluate_function: BaddbmmBackward0_Stream 70.00 min: 2.59 max: 6.17 avg: 2.85 kernel5_Stream 70.00 min: 4.74 max: 7.14 avg: 5.06 kernel6_Stream 70.00 min: 3.39 max: 3.87 avg: 3.56 kernel4_Stream 70.00 min: 3.20 max: 3.39 avg: 3.25 aten::baddbmm_Stream 70.00 min: 2.72 max: 2.88 avg: 2.78 aten::bmm_Stream 70.00 min: 2.43 max: 2.56 avg: 2.47 fused_dropout_kernel_vec_Stream 70.00 min: 1.82 max: 4.58 avg: 1.92 scaled_upper_triang_masked_softmax_warp_backward_Stream 70.00 min: 1.79 max: 1.89 avg: 1.82 scaled_upper_triang_masked_softmax_warp_forward_Stream 70.00 min: 3.01 max: 6.18 avg: 3.33 index_elementwise_kernel_Stream 70.00 min: 8.58 max: 11.84 avg: 9.15 ln_bwd_tuned_kernel_Stream 70.00 min: 6.46 max: 9.31 avg: 6.82 ln_bwd_finalize_tuned_kernel_Stream 70.00 min: 4.03 max: 4.16 avg: 4.07 ln_fwd_tuned_kernel_Stream 70.00 min: 1.12 max: 3.84 avg: 1.25 elementwise_kernel_with_index_Stream 70.00 min: 38.14 max: 38.14 avg: 38.14 ncclKernel_AllGather_RING_LL_Sum_int8_t_Stream 200.00 min: 15.87 max: 21.60 avg: 18.74 ncclKernel_AllReduce_RING_LL_Sum_uint8_t_Stream 200.00 min: 1.44 max: 1.73 avg: 1.46 DeviceSelectSweepKernel_Stream 70.00 min: 1.41 max: 1.54 avg: 1.43 DeviceReduceSingleTileKernel_Stream 70.00 min: 0.99 max: 1.02 avg: 1.00 DeviceCompactInitKernel_Stream 70.00 min: 18.27 max: 18.27 avg: 18.27 ncclKernel_AllReduce_RING_LL_Sum_float_Stream 480.33
-----------------------------Device 0-----------------------------206045187.00 nccl:all_reduce14680064.00 nccl:send14680064.00 nccl:recv24.00 nccl:_all_gather_base
-----------------------------Device 1-----------------------------16777216.00 nccl:recv16777216.00 nccl:send24.00 nccl:_all_gather_base3.00 nccl:all_reduce
-----------------------------Device 2-----------------------------16777216.00 nccl:recv16777216.00 nccl:send24.00 nccl:_all_gather_base3.00 nccl:all_reduce
-----------------------------Device 3-----------------------------206045187.00 nccl:all_reduce14680064.00 nccl:recv14680064.00 nccl:send24.00 nccl:_all_gather_base
6.NCCL带宽测试
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests/
make -j
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
*输出
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 52931 on vastai-NF5468M6 device 0 [0x01] NVIDIA GeForce RTX 3090
# Rank 1 Group 0 Pid 52931 on vastai-NF5468M6 device 1 [0x25] NVIDIA GeForce RTX 3090
# Rank 2 Group 0 Pid 52931 on vastai-NF5468M6 device 2 [0x41] NVIDIA GeForce RTX 3090
# Rank 3 Group 0 Pid 52931 on vastai-NF5468M6 device 3 [0x61] NVIDIA GeForce RTX 3090
# Rank 4 Group 0 Pid 52931 on vastai-NF5468M6 device 4 [0x81] NVIDIA GeForce RTX 3090
# Rank 5 Group 0 Pid 52931 on vastai-NF5468M6 device 5 [0xa1] NVIDIA GeForce RTX 3090
# Rank 6 Group 0 Pid 52931 on vastai-NF5468M6 device 6 [0xc1] NVIDIA GeForce RTX 3090
# Rank 7 Group 0 Pid 52931 on vastai-NF5468M6 device 7 [0xe1] NVIDIA GeForce RTX 3090
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)8 2 float sum -1 40.57 0.00 0.00 0 42.15 0.00 0.00 016 4 float sum -1 40.50 0.00 0.00 0 40.10 0.00 0.00 032 8 float sum -1 40.46 0.00 0.00 0 40.43 0.00 0.00 064 16 float sum -1 40.59 0.00 0.00 0 40.00 0.00 0.00 0128 32 float sum -1 40.35 0.00 0.01 0 39.67 0.00 0.01 0256 64 float sum -1 40.98 0.01 0.01 0 40.61 0.01 0.01 0512 128 float sum -1 40.22 0.01 0.02 0 40.18 0.01 0.02 01024 256 float sum -1 40.61 0.03 0.04 0 40.22 0.03 0.04 02048 512 float sum -1 40.58 0.05 0.09 0 40.42 0.05 0.09 04096 1024 float sum -1 41.27 0.10 0.17 0 40.58 0.10 0.18 08192 2048 float sum -1 40.84 0.20 0.35 0 41.24 0.20 0.35 016384 4096 float sum -1 41.25 0.40 0.70 0 41.41 0.40 0.69 032768 8192 float sum -1 43.61 0.75 1.31 0 43.57 0.75 1.32 065536 16384 float sum -1 74.19 0.88 1.55 0 76.64 0.86 1.50 0131072 32768 float sum -1 130.8 1.00 1.75 0 120.0 1.09 1.91 0262144 65536 float sum -1 245.8 1.07 1.87 0 240.5 1.09 1.91 0524288 131072 float sum -1 298.4 1.76 3.08 0 287.6 1.82 3.19 01048576 262144 float sum -1 702.4 1.49 2.61 0 724.4 1.45 2.53 02097152 524288 float sum -1 1270.3 1.65 2.89 0 1255.3 1.67 2.92 04194304 1048576 float sum -1 3495.2 1.20 2.10 0 3502.9 1.20 2.10 08388608 2097152 float sum -1 6755.2 1.24 2.17 0 6735.1 1.25 2.18 016777216 4194304 float sum -1 13705 1.22 2.14 0 13678 1.23 2.15 033554432 8388608 float sum -1 27779 1.21 2.11 0 27761 1.21 2.12 067108864 16777216 float sum -1 56601 1.19 2.07 0 56647 1.18 2.07 0134217728 33554432 float sum -1 117937 1.14 1.99 0 118026 1.14 1.99 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 1.16643
这篇关于Megatron-LM 验证1F1B interleaved的效果的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!