Training - PyTorch Lightning 分布式训练的 global_step 参数 (accumulate_grad

Training - PyTorch Lightning 分布式训练的 global_step 参数 (accumulate_grad_batches)

本文主要是介绍Training - PyTorch Lightning 分布式训练的 global_step 参数 (accumulate_grad_batches)，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

欢迎关注我的CSDN：https://spike.blog.csdn.net/
本文地址：https://blog.csdn.net/caroline_wendy/article/details/137640653

PyTorch

在 PyTorch Lightning 中，pl.Trainer 的 accumulate_grad_batches 参数允许在执行反向传播和优化器步骤之前，累积多个批次的梯度。这样，可以增加有效的批次大小，而不会增加内存开销。例如，如果设置 accumulate_grad_batches=8，则会在执行优化器的 .step() 方法之前，累积 8 个批次的梯度。

accumulate_grad_batches 与 global_step 的关系：

global_step 会在每次调用优化器的 .step() 方法后递增。
使用梯度累积，global_step 增长小于批次(batch) 的数量
多个批次贡献到 1 个 global_step 的更新中。

例如，如果 accumulate_grad_batches=8，那么每 8 个批次，只会增加 1 次 global_step，如果多卡，则 global_step 表示单卡的次数。日志，如下：

[INFO] [CL] global_step: 0, iter_step: 8
[INFO] [CL] global_step: 1, iter_step: 16

其中 pl.Trainer 的源码：

    trainer = pl.Trainer(accelerator="gpu",# ...accumulate_grad_batches=args.accumulate_grad,strategy=strategy,  # 多机多卡配置num_nodes=args.num_nodes,  # 节点数devices=1,  # 每个节点 GPU 卡数)

输出日志：

log = {'epoch': self.trainer.current_epoch, 'step': self.trainer.global_step}
wandb.log(log)

这篇关于Training - PyTorch Lightning 分布式训练的 global_step 参数 (accumulate_grad_batches)的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

Training - PyTorch Lightning 分布式训练的 global_step 参数 (accumulate_grad_batches)

相关文章

深入理解Apache Kafka(分布式流处理平台)

在PyCharm中安装PyTorch、torchvision和OpenCV详解

Spring Boot项目部署命令java -jar的各种参数及作用详解

pytorch之torch.flatten()和torch.nn.Flatten()的用法

SpringBoot利用@Validated注解优雅实现参数校验

Python FastAPI+Celery+RabbitMQ实现分布式图片水印处理系统

一文带你了解SpringBoot中启动参数的各种用法

redis+lua实现分布式限流的示例

使用PyTorch实现手写数字识别功能

基于@RequestParam注解之Spring MVC参数绑定的利器