LLM Fine-Tuning大模型FT方法

2023-11-29 06:30
文章标签 方法 模型 llm tuning fine ft

本文主要是介绍LLM Fine-Tuning大模型FT方法,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

Here we discuss fine-tuning Llama 2 with a couple of different recipes. We will cover two scenarios here:

1. Parameter Efficient Model Fine-Tuning

This helps make the fine-tuning process more affordable even on 1 consumer grade GPU. These methods enable us to keep the whole model frozen and to just add tiny learnable parameters/ layers into the model. In this way, we just train a very tiny portion of the parameters. The most famous method in this category is LORA, LLaMA Adapter and Prefix-tuning.

These methods will address three aspects:

  1. Cost of full fine-tuning – these methods only train a small set of extra parameters instead of the full model, this makes it possible to run these on consumer GPUs.

  2. Cost of deployment – for each fine-tuned downstream model we need to deploy a separate model; however, when using these methods, only a small set of parameters (few MB instead of several GBs) of the pretrained model can do the job. In this case, for each task we only add these extra parameters on top of the pretrained model so pretrained models can be assumed as backbone and these parameters as heads for the model on different tasks.

  3. Catastrophic forgetting — these methods also help with forgetting the first task that can happen in fine-tunings.

HF PEFT library provides an easy way of using these methods which we make use of here. Please read more here.

2. Full/ Partial Parameter Fine-Tuning

Full parameter fine-tuning has its own advantages, in this method there are multiple strategies that can help:

  1. Keep the pretrained model frozen and only fine-tune the task head for example, the classifier model.

  2. Keep the pretrained model frozen and add a few fully connected layers on the top.

  3. Fine-tuning on all the layers.

You can also keep most of the layers frozen and only fine-tune a few layers. There are many different techniques to choose from to freeze/unfreeze layers based on different criteria.

在这里插入图片描述

In this scenario depending on the model size, you might need to go beyond one GPU, especially if your model does not fit into one GPU for training. In this case Llama 2 7B parameter won’t fit into one gpu. The way you want to think about it is, you would need enough GPU memory to keep model parameters, gradients and optimizer states. Where each of these, depending on the precision you are training, can take up multiple times of your parameter count x precision( depending on if its fp32/ 4 bytes, fp16/2 bytes/ bf16/2 bytes). For example AdamW optimizer keeps 2 parameters for each of your parameters and in many cases these are kept in fp32. This implies that depending on how many layers you are training/ unfreezing your GPU memory can grow beyond one GPU.

FSDP (Fully Sharded Data Parallel)

Pytorch has the FSDP package for training models that do not fit into one GPU. FSDP lets you train a much larger model with the same amount of resources. Prior to FSDP was DDP (Distributed Data Parallel) where each GPU was holding a full replica of the model and would only shard the data. At the end of backward pass it would sync up the gradients.

FSDP extends this idea, not only sharding the data but also model parameters, gradients and optimizer states. This means each GPU will only keep one shard of the model. This will result in huge memory savings that enable us to fit a much larger model into the same number of GPU. As an example in DDP the most you could fit into a GPU with 16GB memory is a model around 700M parameters. So, suppose you had 4 GPUs, in this case even though you access 4 GPUs, you still can’t scale beyond the model size that can fit into one GPU. However with FSDP you can fit a 3B model into 4 GPUs, > 4x larger model.

Please read more on FSDP here & get started with FSDP here.

To boost the performance of fine-tuning with FSDP, we can make use a number of features such as:

  1. Mixed Precision which in FSDP is much more flexible compared to Autocast. It gives user control over setting precision for model parameters, buffers and gradients.

  2. Activation Checkpointing which is a technique to save memory by discarding the intermediate activation in forward pass instead of keeping it in the memory with the cost recomputing them in the backward pass. FSDP Activation checkpointing is shard aware meaning we need to apply it after wrapping the model with FSDP. In our script we are making use of that.

  3. auto_wrap_policy Which is the way to specify how FSDP would partition the model, there is default support for transformer wrapping policy. This allows FSDP to form each FSDP unit ( partition of the model ) based on the transformer class in the model. To identify this layer in the model, you need to look at the layer that wraps both the attention layer and MLP. This helps FSDP have more fine-grained units for communication that help with optimizing the communication cost.

这篇关于LLM Fine-Tuning大模型FT方法的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/431794

相关文章

python中time模块的常用方法及应用详解

《python中time模块的常用方法及应用详解》在Python开发中,时间处理是绕不开的刚需场景,从性能计时到定时任务,从日志记录到数据同步,时间模块始终是开发者最得力的工具之一,本文将通过真实案例... 目录一、时间基石:time.time()典型场景:程序性能分析进阶技巧:结合上下文管理器实现自动计时

Linux文件名修改方法大全

《Linux文件名修改方法大全》在Linux系统中,文件名修改是一个常见且重要的操作,文件名修改可以更好地管理文件和文件夹,使其更具可读性和有序性,本文将介绍三种在Linux系统下常用的文件名修改方法... 目录一、引言二、使用mv命令修改文件名三、使用rename命令修改文件名四、mv命令和rename命

mysql删除无用用户的方法实现

《mysql删除无用用户的方法实现》本文主要介绍了mysql删除无用用户的方法实现,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着小编来一起学习学习吧... 1、删除不用的账户(1) 查看当前已存在账户mysql> select user,host,pa

电脑提示msvcp90.dll缺少怎么办? MSVCP90.dll文件丢失的修复方法

《电脑提示msvcp90.dll缺少怎么办?MSVCP90.dll文件丢失的修复方法》今天我想和大家分享的主题是关于在使用软件时遇到的一个问题——msvcp90.dll丢失,相信很多老师在使用电脑时... 在计算机使用过程中,可能会遇到 MSVCP90.dll 丢失的问题。MSVCP90.dll 是 Mic

使用国内镜像源优化pip install下载的方法步骤

《使用国内镜像源优化pipinstall下载的方法步骤》在Python开发中,pip是一个不可或缺的工具,用于安装和管理Python包,然而,由于默认的PyPI服务器位于国外,国内用户在安装依赖时可... 目录引言1. 为什么需要国内镜像源?2. 常用的国内镜像源3. 临时使用国内镜像源4. 永久配置国内镜

Mybatis从3.4.0版本到3.5.7版本的迭代方法实现

《Mybatis从3.4.0版本到3.5.7版本的迭代方法实现》本文主要介绍了Mybatis从3.4.0版本到3.5.7版本的迭代方法实现,包括主要的功能增强、不兼容的更改和修复的错误,具有一定的参考... 目录一、3.4.01、主要的功能增强2、selectCursor example3、不兼容的更改二、

VSCode中C/C++编码乱码问题的两种解决方法

《VSCode中C/C++编码乱码问题的两种解决方法》在中国地区,Windows系统中的cmd和PowerShell默认编码是GBK,但VSCode默认使用UTF-8编码,这种编码不一致会导致在VSC... 目录问题方法一:通过 Code Runner 插件调整编码配置步骤方法二:在 PowerShell

C/C++随机数生成的五种方法

《C/C++随机数生成的五种方法》C++作为一种古老的编程语言,其随机数生成的方法已经经历了多次的变革,早期的C++版本使用的是rand()函数和RAND_MAX常量,这种方法虽然简单,但并不总是提供... 目录C/C++ 随机数生成方法1. 使用 rand() 和 srand()2. 使用 <random

QT移植到RK3568开发板的方法步骤

《QT移植到RK3568开发板的方法步骤》本文主要介绍了QT移植到RK3568开发板的方法步骤,文中通过图文示例介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着小编来一... 目录前言一、获取SDK1. 安装依赖2. 获取SDK资源包3. SDK工程目录介绍4. 获取补丁包二

Java反转字符串的五种方法总结

《Java反转字符串的五种方法总结》:本文主要介绍五种在Java中反转字符串的方法,包括使用StringBuilder的reverse()方法、字符数组、自定义StringBuilder方法、直接... 目录前言方法一:使用StringBuilder的reverse()方法方法二:使用字符数组方法三:使用自