DeepSpeed Profiling

本文主要是介绍DeepSpeed Profiling，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

DeepSpeed自带的Profiler

-------------------------- DeepSpeed Flops Profiler --------------------------
Profile Summary at step 10:
Notations:
data parallel size (dp_size), model parallel size(mp_size),
number of parameters (params), number of multiply-accumulate operations(MACs),
number of floating-point operations (flops), floating-point operations per second (FLOPS),
fwd latency (forward propagation latency), bwd latency (backward propagation latency),
step (weights update latency), iter latency (sum of fwd, bwd and step latency)world size:                                                   1
data parallel size:                                           1
model parallel size:                                          1
batch size per GPU:                                           80
params per gpu:                                               336.23 M
params of model = params per GPU * mp_size:                   336.23 M
fwd MACs per GPU:                                             3139.93 G
fwd flops per GPU:                                            6279.86 G
fwd flops of model = fwd flops per GPU * mp_size:             6279.86 G
fwd latency:                                                  76.67 ms
bwd latency:                                                  108.02 ms
fwd FLOPS per GPU = fwd flops per GPU / fwd latency:          81.9 TFLOPS
bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency:      116.27 TFLOPS
fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency):   102.0 TFLOPS
step latency:                                                 34.09 us
iter latency:                                                 184.73 ms
samples/second:                                               433.07----------------------------- Aggregated Profile per GPU -----------------------------
Top modules in terms of params, MACs or fwd latency at different model depths:
depth 0:params      - {'BertForPreTrainingPreLN': '336.23 M'}MACs        - {'BertForPreTrainingPreLN': '3139.93 GMACs'}fwd latency - {'BertForPreTrainingPreLN': '76.39 ms'}
depth 1:params      - {'BertModel': '335.15 M', 'BertPreTrainingHeads': '32.34 M'}MACs        - {'BertModel': '3092.96 GMACs', 'BertPreTrainingHeads': '46.97 GMACs'}fwd latency - {'BertModel': '34.29 ms', 'BertPreTrainingHeads': '3.23 ms'}
depth 2:params      - {'BertEncoder': '302.31 M', 'BertLMPredictionHead': '32.34 M'}MACs        - {'BertEncoder': '3092.88 GMACs', 'BertLMPredictionHead': '46.97 GMACs'}fwd latency - {'BertEncoder': '33.45 ms', 'BertLMPredictionHead': '2.61 ms'}
depth 3:params      - {'ModuleList': '302.31 M', 'Embedding': '31.79 M', 'Linear': '31.26 M'}MACs        - {'ModuleList': '3092.88 GMACs', 'Linear': '36.23 GMACs'}fwd latency - {'ModuleList': '33.11 ms', 'BertPredictionHeadTransform': '1.83 ms''}
depth 4:params      - {'BertLayer': '302.31 M', 'LinearActivation': '1.05 M''}MACs        - {'BertLayer': '3092.88 GMACs', 'LinearActivation': '10.74 GMACs'}fwd latency - {'BertLayer': '33.11 ms', 'LinearActivation': '1.43 ms'}
depth 5:params      - {'BertAttention': '100.76 M', 'BertIntermediate': '100.76 M'}MACs        - {'BertAttention': '1031.3 GMACs', 'BertIntermediate': '1030.79 GMACs'}fwd latency - {'BertAttention': '19.83 ms', 'BertOutput': '4.38 ms'}
depth 6:params      - {'LinearActivation': '100.76 M', 'Linear': '100.69 M'}MACs        - {'LinearActivation': '1030.79 GMACs', 'Linear': '1030.79 GMACs'}fwd latency - {'BertSelfAttention': '16.29 ms', 'LinearActivation': '3.48 ms'}------------------------------ Detailed Profile per GPU ------------------------------
Each module profile is listed after its name in the following order:
params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPSBertForPreTrainingPreLN(336.23 M, 100.00% Params, 3139.93 GMACs, 100.00% MACs, 76.39 ms, 100.00% latency, 82.21 TFLOPS,(bert): BertModel(335.15 M, 99.68% Params, 3092.96 GMACs, 98.50% MACs, 34.29 ms, 44.89% latency, 180.4 TFLOPS,(embeddings): BertEmbeddings(...)(encoder): BertEncoder(302.31 M, 89.91% Params, 3092.88 GMACs, 98.50% MACs, 33.45 ms, 43.79% latency, 184.93 TFLOPS,(FinalLayerNorm): FusedLayerNorm(...)(layer): ModuleList(302.31 M, 89.91% Params, 3092.88 GMACs, 98.50% MACs, 33.11 ms, 43.35% latency, 186.8 TFLOPS,(0): BertLayer(12.6 M, 3.75% Params, 128.87 GMACs, 4.10% MACs, 1.29 ms, 1.69% latency, 199.49 TFLOPS,(attention): BertAttention(4.2 M, 1.25% Params, 42.97 GMACs, 1.37% MACs, 833.75 us, 1.09% latency, 103.08 TFLOPS,(self): BertSelfAttention(3.15 M, 0.94% Params, 32.23 GMACs, 1.03% MACs, 699.04 us, 0.92% latency, 92.22 TFLOPS,(query): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 182.39 us, 0.24% latency, 117.74 TFLOPS,...)(key): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 57.22 us, 0.07% latency, 375.3 TFLOPS,...)(value): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 53.17 us, 0.07% latency, 403.91 TFLOPS,...)(dropout): Dropout(...)(softmax): Softmax(...))(output): BertSelfOutput(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 114.68 us, 0.15% latency, 187.26 TFLOPS,(dense): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 64.13 us, 0.08% latency, 334.84 TFLOPS, ...)(dropout): Dropout(...)))(PreAttentionLayerNorm): FusedLayerNorm(...)(PostAttentionLayerNorm): FusedLayerNorm(...)(intermediate): BertIntermediate(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 186.68 us, 0.24% latency, 460.14 TFLOPS,(dense_act): LinearActivation(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 175.0 us, 0.23% latency, 490.86 TFLOPS,...))(output): BertOutput(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 116.83 us, 0.15% latency, 735.28 TFLOPS,(dense): Linear(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 65.57 us, 0.09% latency, 1310.14 TFLOPS,...)(dropout): Dropout(...)))...(23): BertLayer(...)))(pooler): BertPooler(...))(cls): BertPreTrainingHeads(...)
)
------------------------------------------------------------------------------

注意：

小写flops指计算量；大写FLOPS指flops每秒；

backward计算量，是forward的2倍；

乘加运算量(MAC) * 2 = flops

FLOPS，可以和GPU理论上界比较，看看差多少；

通过观察上层module的参数量、计算量、延迟、吞吐量，找到包含很多”小“计算的module，从而看看有没有可能进一步优化为一个”大“计算；(kernel fusion)

和PyTorch profiler的区别：DeepSpeed是从Module角度来看；PyTorch是从operator角度来看；

推理阶段，用：

get_model_profile

训练阶段，用：

FlopsProfiler

PyTorch Profiler:

torch.profiler.schedule(wait=5, warmup=2, active=6, repeat=2)

（跳过5个step，然后运行profiler 2个step（结果丢弃），运行profiler且保留结果6个step），重复()步骤2轮；目的：跳过一上来warmup阶段的慢；

选项：

with_stack=True：记录call stack；增加一些overhead；

activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]：CPU是记录PyTorch operators，CUDA是记录CUDA kernel；

profile_memory=True：对CPU和GPU的memory allocate/release事件进行记录；区分Allocated和Reserved这2种；

自定义scope: (record_function)
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:with record_function("model_forward"):model_engine(inputs)print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

这篇关于DeepSpeed Profiling的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！