LLM Fine-Tuning大模型FT方法

2023-11-29 06:30
文章标签 方法 模型 llm tuning fine ft

本文主要是介绍LLM Fine-Tuning大模型FT方法,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

Here we discuss fine-tuning Llama 2 with a couple of different recipes. We will cover two scenarios here:

1. Parameter Efficient Model Fine-Tuning

This helps make the fine-tuning process more affordable even on 1 consumer grade GPU. These methods enable us to keep the whole model frozen and to just add tiny learnable parameters/ layers into the model. In this way, we just train a very tiny portion of the parameters. The most famous method in this category is LORA, LLaMA Adapter and Prefix-tuning.

These methods will address three aspects:

  1. Cost of full fine-tuning – these methods only train a small set of extra parameters instead of the full model, this makes it possible to run these on consumer GPUs.

  2. Cost of deployment – for each fine-tuned downstream model we need to deploy a separate model; however, when using these methods, only a small set of parameters (few MB instead of several GBs) of the pretrained model can do the job. In this case, for each task we only add these extra parameters on top of the pretrained model so pretrained models can be assumed as backbone and these parameters as heads for the model on different tasks.

  3. Catastrophic forgetting — these methods also help with forgetting the first task that can happen in fine-tunings.

HF PEFT library provides an easy way of using these methods which we make use of here. Please read more here.

2. Full/ Partial Parameter Fine-Tuning

Full parameter fine-tuning has its own advantages, in this method there are multiple strategies that can help:

  1. Keep the pretrained model frozen and only fine-tune the task head for example, the classifier model.

  2. Keep the pretrained model frozen and add a few fully connected layers on the top.

  3. Fine-tuning on all the layers.

You can also keep most of the layers frozen and only fine-tune a few layers. There are many different techniques to choose from to freeze/unfreeze layers based on different criteria.

在这里插入图片描述

In this scenario depending on the model size, you might need to go beyond one GPU, especially if your model does not fit into one GPU for training. In this case Llama 2 7B parameter won’t fit into one gpu. The way you want to think about it is, you would need enough GPU memory to keep model parameters, gradients and optimizer states. Where each of these, depending on the precision you are training, can take up multiple times of your parameter count x precision( depending on if its fp32/ 4 bytes, fp16/2 bytes/ bf16/2 bytes). For example AdamW optimizer keeps 2 parameters for each of your parameters and in many cases these are kept in fp32. This implies that depending on how many layers you are training/ unfreezing your GPU memory can grow beyond one GPU.

FSDP (Fully Sharded Data Parallel)

Pytorch has the FSDP package for training models that do not fit into one GPU. FSDP lets you train a much larger model with the same amount of resources. Prior to FSDP was DDP (Distributed Data Parallel) where each GPU was holding a full replica of the model and would only shard the data. At the end of backward pass it would sync up the gradients.

FSDP extends this idea, not only sharding the data but also model parameters, gradients and optimizer states. This means each GPU will only keep one shard of the model. This will result in huge memory savings that enable us to fit a much larger model into the same number of GPU. As an example in DDP the most you could fit into a GPU with 16GB memory is a model around 700M parameters. So, suppose you had 4 GPUs, in this case even though you access 4 GPUs, you still can’t scale beyond the model size that can fit into one GPU. However with FSDP you can fit a 3B model into 4 GPUs, > 4x larger model.

Please read more on FSDP here & get started with FSDP here.

To boost the performance of fine-tuning with FSDP, we can make use a number of features such as:

  1. Mixed Precision which in FSDP is much more flexible compared to Autocast. It gives user control over setting precision for model parameters, buffers and gradients.

  2. Activation Checkpointing which is a technique to save memory by discarding the intermediate activation in forward pass instead of keeping it in the memory with the cost recomputing them in the backward pass. FSDP Activation checkpointing is shard aware meaning we need to apply it after wrapping the model with FSDP. In our script we are making use of that.

  3. auto_wrap_policy Which is the way to specify how FSDP would partition the model, there is default support for transformer wrapping policy. This allows FSDP to form each FSDP unit ( partition of the model ) based on the transformer class in the model. To identify this layer in the model, you need to look at the layer that wraps both the attention layer and MLP. This helps FSDP have more fine-grained units for communication that help with optimizing the communication cost.

这篇关于LLM Fine-Tuning大模型FT方法的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/431794

相关文章

Python中的魔术方法__new__详解

《Python中的魔术方法__new__详解》:本文主要介绍Python中的魔术方法__new__的使用,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录一、核心意义与机制1.1 构造过程原理1.2 与 __init__ 对比二、核心功能解析2.1 核心能力2.2

Python Transformer 库安装配置及使用方法

《PythonTransformer库安装配置及使用方法》HuggingFaceTransformers是自然语言处理(NLP)领域最流行的开源库之一,支持基于Transformer架构的预训练模... 目录python 中的 Transformer 库及使用方法一、库的概述二、安装与配置三、基础使用:Pi

关于pandas的read_csv方法使用解读

《关于pandas的read_csv方法使用解读》:本文主要介绍关于pandas的read_csv方法使用,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教... 目录pandas的read_csv方法解读read_csv中的参数基本参数通用解析参数空值处理相关参数时间处理相关

前端下载文件时如何后端返回的文件流一些常见方法

《前端下载文件时如何后端返回的文件流一些常见方法》:本文主要介绍前端下载文件时如何后端返回的文件流一些常见方法,包括使用Blob和URL.createObjectURL创建下载链接,以及处理带有C... 目录1. 使用 Blob 和 URL.createObjectURL 创建下载链接例子:使用 Blob

Java中使用Hutool进行AES加密解密的方法举例

《Java中使用Hutool进行AES加密解密的方法举例》AES是一种对称加密,所谓对称加密就是加密与解密使用的秘钥是一个,下面:本文主要介绍Java中使用Hutool进行AES加密解密的相关资料... 目录前言一、Hutool简介与引入1.1 Hutool简介1.2 引入Hutool二、AES加密解密基础

Pytest多环境切换的常见方法介绍

《Pytest多环境切换的常见方法介绍》Pytest作为自动化测试的主力框架,如何实现本地、测试、预发、生产环境的灵活切换,本文总结了通过pytest框架实现自由环境切换的几种方法,大家可以根据需要进... 目录1.pytest-base-url2.hooks函数3.yml和fixture结论你是否也遇到过

鸿蒙中Axios数据请求的封装和配置方法

《鸿蒙中Axios数据请求的封装和配置方法》:本文主要介绍鸿蒙中Axios数据请求的封装和配置方法,本文给大家介绍的非常详细,对大家的学习或工作具有一定的参考借鉴价值,需要的朋友参考下吧... 目录1.配置权限 应用级权限和系统级权限2.配置网络请求的代码3.下载在Entry中 下载AxIOS4.封装Htt

Redis实现延迟任务的三种方法详解

《Redis实现延迟任务的三种方法详解》延迟任务(DelayedTask)是指在未来的某个时间点,执行相应的任务,本文为大家整理了三种常见的实现方法,感兴趣的小伙伴可以参考一下... 目录1.前言2.Redis如何实现延迟任务3.代码实现3.1. 过期键通知事件实现3.2. 使用ZSet实现延迟任务3.3

idea maven编译报错Java heap space的解决方法

《ideamaven编译报错Javaheapspace的解决方法》这篇文章主要为大家详细介绍了ideamaven编译报错Javaheapspace的相关解决方法,文中的示例代码讲解详细,感兴趣的... 目录1.增加 Maven 编译的堆内存2. 增加 IntelliJ IDEA 的堆内存3. 优化 Mave

Java String字符串的常用使用方法

《JavaString字符串的常用使用方法》String是JDK提供的一个类,是引用类型,并不是基本的数据类型,String用于字符串操作,在之前学习c语言的时候,对于一些字符串,会初始化字符数组表... 目录一、什么是String二、如何定义一个String1. 用双引号定义2. 通过构造函数定义三、St