LLM Fine-Tuning大模型FT方法

2023-11-29 06:30
文章标签 方法 模型 llm tuning fine ft

本文主要是介绍LLM Fine-Tuning大模型FT方法,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

Here we discuss fine-tuning Llama 2 with a couple of different recipes. We will cover two scenarios here:

1. Parameter Efficient Model Fine-Tuning

This helps make the fine-tuning process more affordable even on 1 consumer grade GPU. These methods enable us to keep the whole model frozen and to just add tiny learnable parameters/ layers into the model. In this way, we just train a very tiny portion of the parameters. The most famous method in this category is LORA, LLaMA Adapter and Prefix-tuning.

These methods will address three aspects:

  1. Cost of full fine-tuning – these methods only train a small set of extra parameters instead of the full model, this makes it possible to run these on consumer GPUs.

  2. Cost of deployment – for each fine-tuned downstream model we need to deploy a separate model; however, when using these methods, only a small set of parameters (few MB instead of several GBs) of the pretrained model can do the job. In this case, for each task we only add these extra parameters on top of the pretrained model so pretrained models can be assumed as backbone and these parameters as heads for the model on different tasks.

  3. Catastrophic forgetting — these methods also help with forgetting the first task that can happen in fine-tunings.

HF PEFT library provides an easy way of using these methods which we make use of here. Please read more here.

2. Full/ Partial Parameter Fine-Tuning

Full parameter fine-tuning has its own advantages, in this method there are multiple strategies that can help:

  1. Keep the pretrained model frozen and only fine-tune the task head for example, the classifier model.

  2. Keep the pretrained model frozen and add a few fully connected layers on the top.

  3. Fine-tuning on all the layers.

You can also keep most of the layers frozen and only fine-tune a few layers. There are many different techniques to choose from to freeze/unfreeze layers based on different criteria.

在这里插入图片描述

In this scenario depending on the model size, you might need to go beyond one GPU, especially if your model does not fit into one GPU for training. In this case Llama 2 7B parameter won’t fit into one gpu. The way you want to think about it is, you would need enough GPU memory to keep model parameters, gradients and optimizer states. Where each of these, depending on the precision you are training, can take up multiple times of your parameter count x precision( depending on if its fp32/ 4 bytes, fp16/2 bytes/ bf16/2 bytes). For example AdamW optimizer keeps 2 parameters for each of your parameters and in many cases these are kept in fp32. This implies that depending on how many layers you are training/ unfreezing your GPU memory can grow beyond one GPU.

FSDP (Fully Sharded Data Parallel)

Pytorch has the FSDP package for training models that do not fit into one GPU. FSDP lets you train a much larger model with the same amount of resources. Prior to FSDP was DDP (Distributed Data Parallel) where each GPU was holding a full replica of the model and would only shard the data. At the end of backward pass it would sync up the gradients.

FSDP extends this idea, not only sharding the data but also model parameters, gradients and optimizer states. This means each GPU will only keep one shard of the model. This will result in huge memory savings that enable us to fit a much larger model into the same number of GPU. As an example in DDP the most you could fit into a GPU with 16GB memory is a model around 700M parameters. So, suppose you had 4 GPUs, in this case even though you access 4 GPUs, you still can’t scale beyond the model size that can fit into one GPU. However with FSDP you can fit a 3B model into 4 GPUs, > 4x larger model.

Please read more on FSDP here & get started with FSDP here.

To boost the performance of fine-tuning with FSDP, we can make use a number of features such as:

  1. Mixed Precision which in FSDP is much more flexible compared to Autocast. It gives user control over setting precision for model parameters, buffers and gradients.

  2. Activation Checkpointing which is a technique to save memory by discarding the intermediate activation in forward pass instead of keeping it in the memory with the cost recomputing them in the backward pass. FSDP Activation checkpointing is shard aware meaning we need to apply it after wrapping the model with FSDP. In our script we are making use of that.

  3. auto_wrap_policy Which is the way to specify how FSDP would partition the model, there is default support for transformer wrapping policy. This allows FSDP to form each FSDP unit ( partition of the model ) based on the transformer class in the model. To identify this layer in the model, you need to look at the layer that wraps both the attention layer and MLP. This helps FSDP have more fine-grained units for communication that help with optimizing the communication cost.

这篇关于LLM Fine-Tuning大模型FT方法的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/431794

相关文章

检查 Nginx 是否启动的几种方法

《检查Nginx是否启动的几种方法》本文主要介绍了检查Nginx是否启动的几种方法,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着小编来一起学习学... 目录1. 使用 systemctl 命令(推荐)2. 使用 service 命令3. 检查进程是否存在4

Java方法重载与重写之同名方法的双面魔法(最新整理)

《Java方法重载与重写之同名方法的双面魔法(最新整理)》文章介绍了Java中的方法重载Overloading和方法重写Overriding的区别联系,方法重载是指在同一个类中,允许存在多个方法名相同... 目录Java方法重载与重写:同名方法的双面魔法方法重载(Overloading):同门师兄弟的不同绝

MySQL字符串转数值的方法全解析

《MySQL字符串转数值的方法全解析》在MySQL开发中,字符串与数值的转换是高频操作,本文从隐式转换原理、显式转换方法、典型场景案例、风险防控四个维度系统梳理,助您精准掌握这一核心技能,需要的朋友可... 目录一、隐式转换:自动但需警惕的&ld编程quo;双刃剑”二、显式转换:三大核心方法详解三、典型场景

MySQL快速复制一张表的四种核心方法(包括表结构和数据)

《MySQL快速复制一张表的四种核心方法(包括表结构和数据)》本文详细介绍了四种复制MySQL表(结构+数据)的方法,并对每种方法进行了对比分析,适用于不同场景和数据量的复制需求,特别是针对超大表(1... 目录一、mysql 复制表(结构+数据)的 4 种核心方法(面试结构化回答)方法 1:CREATE

详解C++ 存储二进制数据容器的几种方法

《详解C++存储二进制数据容器的几种方法》本文主要介绍了详解C++存储二进制数据容器,包括std::vector、std::array、std::string、std::bitset和std::ve... 目录1.std::vector<uint8_t>(最常用)特点:适用场景:示例:2.std::arra

springboot中配置logback-spring.xml的方法

《springboot中配置logback-spring.xml的方法》文章介绍了如何在SpringBoot项目中配置logback-spring.xml文件来进行日志管理,包括如何定义日志输出方式、... 目录一、在src/main/resources目录下,也就是在classpath路径下创建logba

SQL Server中行转列方法详细讲解

《SQLServer中行转列方法详细讲解》SQL行转列、列转行可以帮助我们更方便地处理数据,生成需要的报表和结果集,:本文主要介绍SQLServer中行转列方法的相关资料,需要的朋友可以参考下... 目录前言一、为什么需要行转列二、行转列的基本概念三、使用PIVOT运算符进行行转列1.创建示例数据表并插入数

C++打印 vector的几种方法小结

《C++打印vector的几种方法小结》本文介绍了C++中遍历vector的几种方法,包括使用迭代器、auto关键字、typedef、计数器以及C++11引入的范围基础循环,具有一定的参考价值,感兴... 目录1. 使用迭代器2. 使用 auto (C++11) / typedef / type alias

python项目打包成docker容器镜像的两种方法实现

《python项目打包成docker容器镜像的两种方法实现》本文介绍两种将Python项目打包为Docker镜像的方法,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要... 目录简单版:(一次成功,后续下载对应的软件依赖)第一步:肯定是构建dockerfile,如下:第二步

C# GC回收的方法实现

《C#GC回收的方法实现》本文主要介绍了C#GC回收的方法实现,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着小编来一起学习学习吧... 目录 一、什么是 GC? 二、GC 管理的是哪部分内存? 三、GC 什么时候触发?️ 四、GC 如何判断一个对象是“垃圾