Why L1 norm for sparse models?

2024-06-14 12:32
文章标签 l1 norm sparse models

本文主要是介绍Why L1 norm for sparse models?,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

Explanation 1

Consider the vector x⃗ =(1,ε)R2 where ε>0 is small. The l1 and l2 norms of x⃗  , respectively, are given by
||x⃗ ||1=1+ε,  ||x⃗ ||22=1+ε2

Now say that, as part of some regularization procedure, we are going to reduce the magnitude of one of the elements of x⃗    by δε . If we change x1 to 1δ , the resulting norms are

||x⃗ (δ,0)||1=1δ+ε,  ||x⃗ (δ,0)||22=12δ+δ2+ε2

On the other hand, reducing x2 by δ gives norms

||x⃗ (0,δ)||1=1δ+ε,  ||x⃗ (0,δ)||22=12εδ+δ2+ε2

The thing to notice here is that, for an l2 penalty, regularizing the larger term x1 results in a much greater reduction in norm than doing so to the smaller term x20 . For the l1 penalty, however, the reduction is the same. Thus, when penalizing a model using the l2 norm, it is highly unlikely that anything will ever be set to zero, since the reduction in l2 norm going from ε to 0 is almost nonexistent when ε is small. On the other hand, the reduction in l1 norm is always equal to δ , regardless of the quantity being penalized.

Another way to think of it: it's not so much that  l1 penalties encourage sparsity, but that l2 penalties in some sense discourage sparsity by yielding diminishing returns as elements are moved closer to zero.

Explanation 2

With a sparse model, we think of a model where many of the weights are 0. Let us therefore reason about how L1-regularization is more likely to create 0-weights.

Consider a model consisting of the weights (w1,w2,,wm) .

With L1 regularization, you penalize the model by a loss function L1(w) = Σi|wi| .

With L2-regularization, you penalize the model by a loss function L2(w) = 12Σiw2i

If using gradient descent, you will iteratively make the weights change in the opposite direction of the gradient with a step size η . Let us look at the gradients:

dL1(w)dw=sign(w) , where sign(w)=(w1|w1|,w1|w1|,,w1|wm|)

dL2(w)dw=wDSSSSCC

If we plot the loss function and it's derivative for a model consisting of just a single parameter, it looks like this for L1:

enter image description here

And like this for L2:

enter image description here

Notice that for L1 , the gradient is either 1 or -1, except for when w1=0 . That means that L1-regularization will move any weight towards 0 with the same step size, regardless the weight's value. In contrast, you can see that the L2 gradient is linearly decreasing towards 0 as the weight goes towards 0. Therefore, L2-regularization will also move any weight towards 0, but it will take smaller and smaller steps as a weight approaches 0.

Try to imagine that you start with a model with w1=5 and using η=12 . In the following picture, you can see how gradient descent using L1-regularization makes 10 of the updates w1:=w1ηdL1(w)dw=w10.51 , until reaching a model with w1=0

:

enter image description here

In constrast, with L2-regularization where η=12 , the gradient is w1 , causing every step to be only halfway towards 0. That is we make the update w1:=w1ηdL1(w)dw=w10.5w1

Therefore, the model never reaches a weight of 0, regardless of how many steps we take:

enter image description here

Note that L2-regularization can make a weight reach zero if the step size η is so high that it reaches zero or beyond in a single step. However, the loss function will also consist of a term measuring the error of the model with the respect to the given weights, and that term will also affect the gradient and hence the change in weights. However, what is shown in this example is just how the two types of regularization contribute to a change in weights.

这篇关于Why L1 norm for sparse models?的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1060412

相关文章

Retrieval-Augmented Generation for Large Language Models A Survey

Retrieval-Augmented Generation for Large Language Models: A Survey 文献综述 文章目录 Retrieval-Augmented Generation for Large Language Models: A Survey 文献综述 Abstract背景介绍 RAG概述原始RAG先进RAG预检索过程后检索过程 模块化RAGMo

AI 大模型企业应用实战(10)-LLMs和Chat Models

1 模型 来看两种不同类型的模型--LLM 和聊天模型。然后,它将介绍如何使用提示模板来格式化这些模型的输入,以及如何使用输出解析器来处理输出。 LangChain 中的语言模型有两种类型: 1.1 Chat Models 聊天模型通常由 LLM 支持,但专门针对会话进行了调整。提供者 API 使用与纯文本补全模型不同的接口。它们的输入不是单个字符串,而是聊天信息列表,输出则是一条人工智能

从同—视角理解扩散模型(Understanding Diffusion Models A Unified Perspective)

从同—视角理解扩散模型 Understanding Diffusion Models A Unified Perspective【全公式推导】【免费视频讲解】 B站视频讲解 视频的论文笔记 从同一视角理解扩散模型【视频讲解笔记】 配合视频讲解的同步笔记。 整个系列完整的论文笔记内容如下,仅为了不用—一回复,共计14个视频讲解笔记,故设定了一个比较低的价格(粉丝仅6毛),大家可以自取。

Autoencoder(AE)、Variational Autoencoder(VAE)和Diffusion Models(DM)了解

Autoencoder (AE) 工作原理: Autoencoder就像一个数据压缩机器。它由两部分组成: 编码器:将输入数据压缩成一个小小的代码。解码器:将这个小代码还原成尽可能接近原始输入的数据。 优点和应用: 简单易懂:用于学习数据的特征和去除噪声。应用场景:例如可以用来缩小图像的大小但保留关键特征,或者去除文本数据中的错误。 挑战: 数据损坏:如果输入数据太乱,编码器可能无法有

PAT-L1-020 帅到没朋友

L1-020 帅到没朋友 (20 分) 当芸芸众生忙着在朋友圈中发照片的时候,总有一些人因为太帅而没有朋友。本题就要求你找出那些帅到没有朋友的人。 输入格式: 输入第一行给出一个正整数N(≤100),是已知朋友圈的个数;随后N行,每行首先给出一个正整数K(≤1000),为朋友圈中的人数,然后列出一个朋友圈内的所有人——为方便起见,每人对应一个ID号,为5位数字(从00000到99999),I

PAT-L1-011. A-B

这个题目很坑,同样的算法思路,C++秒过,java怎么也超时,给跪 L1-011. A-B 时间限制 100 ms 内存限制 65536 kB 代码长度限制 8000 B 判题程序 Standard 作者 陈越 本题要求你计算A-B。不过麻烦的是,A和B都是字符串 —— 即从字符串A中把字符串B所包含的字符全删掉,剩下的字符组成的就是字

PAT-L1-010. 比较大小

L1-010. 比较大小 时间限制 400 ms 内存限制 65536 kB 代码长度限制 8000 B 判题程序 Standard 作者 杨起帆(浙江大学城市学院) 本题要求将输入的任意3个整数从小到大输出。 输入格式: 输入在一行中给出3个整数,其间以空格分隔。 输出格式: 在一行中将3个整数从小到大输出,其间以“->”相连。 输入样例

PAT-L1-009. N个数求和

之前用java没跑出来,过了很久重新用python写了一遍跑出来了,原有的java的代码也没有删除,知道问题在哪里了,写在注意点里面 这道题目要考虑到细节 1.掌握求解最大公约数和最小公倍数的方法——辗转相除法 2.考虑最终的结果可能是负数的情况,应该先约分和变为整数和分数的情况后,再添加上符号 L1-009. N个数求和 时间限制 400 ms 内存限制 65536 kB 代码

PAT-L1-008. 求整数段和

L1-008. 求整数段和 时间限制 400 ms 内存限制 65536 kB 代码长度限制 8000 B 判题程序 Standard 作者 杨起帆 给定两个整数A和B,输出从A到B的所有整数以及这些数的和。 输入格式: 输入在一行中给出2个整数A和B,其中-100<=A<=B<=100,其间以空格分隔。 输出格式: 首先顺序输出从A到B的

PAT-L1-006. 连续因子

一个正整数N的因子中可能存在若干连续的数字。例如630可以分解为3*5*6*7,其中5、6、7就是3个连续的数字。给定任一正整数N,要求编写程序求出最长连续因子的个数,并输出最小的连续因子序列。 输入格式: 输入在一行中给出一个正整数N(1<N<231)。 输出格式: 首先在第1行输出最长连续因子的个数;然后在第2行中按“因子1*因子2*……*因子k”的格式输出最小的连续因子序列,其中因子按递增