Why L1 norm for sparse models?

2024-06-14 12:32
文章标签 norm models l1 sparse

本文主要是介绍Why L1 norm for sparse models?,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

Explanation 1

Consider the vector x⃗ =(1,ε)R2 where ε>0 is small. The l1 and l2 norms of x⃗  , respectively, are given by
||x⃗ ||1=1+ε,  ||x⃗ ||22=1+ε2

Now say that, as part of some regularization procedure, we are going to reduce the magnitude of one of the elements of x⃗    by δε . If we change x1 to 1δ , the resulting norms are

||x⃗ (δ,0)||1=1δ+ε,  ||x⃗ (δ,0)||22=12δ+δ2+ε2

On the other hand, reducing x2 by δ gives norms

||x⃗ (0,δ)||1=1δ+ε,  ||x⃗ (0,δ)||22=12εδ+δ2+ε2

The thing to notice here is that, for an l2 penalty, regularizing the larger term x1 results in a much greater reduction in norm than doing so to the smaller term x20 . For the l1 penalty, however, the reduction is the same. Thus, when penalizing a model using the l2 norm, it is highly unlikely that anything will ever be set to zero, since the reduction in l2 norm going from ε to 0 is almost nonexistent when ε is small. On the other hand, the reduction in l1 norm is always equal to δ , regardless of the quantity being penalized.

Another way to think of it: it's not so much that  l1 penalties encourage sparsity, but that l2 penalties in some sense discourage sparsity by yielding diminishing returns as elements are moved closer to zero.

Explanation 2

With a sparse model, we think of a model where many of the weights are 0. Let us therefore reason about how L1-regularization is more likely to create 0-weights.

Consider a model consisting of the weights (w1,w2,,wm) .

With L1 regularization, you penalize the model by a loss function L1(w) = Σi|wi| .

With L2-regularization, you penalize the model by a loss function L2(w) = 12Σiw2i

If using gradient descent, you will iteratively make the weights change in the opposite direction of the gradient with a step size η . Let us look at the gradients:

dL1(w)dw=sign(w) , where sign(w)=(w1|w1|,w1|w1|,,w1|wm|)

dL2(w)dw=wDSSSSCC

If we plot the loss function and it's derivative for a model consisting of just a single parameter, it looks like this for L1:

enter image description here

And like this for L2:

enter image description here

Notice that for L1 , the gradient is either 1 or -1, except for when w1=0 . That means that L1-regularization will move any weight towards 0 with the same step size, regardless the weight's value. In contrast, you can see that the L2 gradient is linearly decreasing towards 0 as the weight goes towards 0. Therefore, L2-regularization will also move any weight towards 0, but it will take smaller and smaller steps as a weight approaches 0.

Try to imagine that you start with a model with w1=5 and using η=12 . In the following picture, you can see how gradient descent using L1-regularization makes 10 of the updates w1:=w1ηdL1(w)dw=w10.51 , until reaching a model with w1=0

:

enter image description here

In constrast, with L2-regularization where η=12 , the gradient is w1 , causing every step to be only halfway towards 0. That is we make the update w1:=w1ηdL1(w)dw=w10.5w1

Therefore, the model never reaches a weight of 0, regardless of how many steps we take:

enter image description here

Note that L2-regularization can make a weight reach zero if the step size η is so high that it reaches zero or beyond in a single step. However, the loss function will also consist of a term measuring the error of the model with the respect to the given weights, and that term will also affect the gradient and hence the change in weights. However, what is shown in this example is just how the two types of regularization contribute to a change in weights.

这篇关于Why L1 norm for sparse models?的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1060412

相关文章

论文翻译:arxiv-2024 Benchmark Data Contamination of Large Language Models: A Survey

Benchmark Data Contamination of Large Language Models: A Survey https://arxiv.org/abs/2406.04244 大规模语言模型的基准数据污染:一项综述 文章目录 大规模语言模型的基准数据污染:一项综述摘要1 引言 摘要 大规模语言模型(LLMs),如GPT-4、Claude-3和Gemini的快

论文翻译:ICLR-2024 PROVING TEST SET CONTAMINATION IN BLACK BOX LANGUAGE MODELS

PROVING TEST SET CONTAMINATION IN BLACK BOX LANGUAGE MODELS https://openreview.net/forum?id=KS8mIvetg2 验证测试集污染在黑盒语言模型中 文章目录 验证测试集污染在黑盒语言模型中摘要1 引言 摘要 大型语言模型是在大量互联网数据上训练的,这引发了人们的担忧和猜测,即它们可能已

速通GPT-3:Language Models are Few-Shot Learners全文解读

文章目录 论文实验总览1. 任务设置与测试策略2. 任务类别3. 关键实验结果4. 数据污染与实验局限性5. 总结与贡献 Abstract1. 概括2. 具体分析3. 摘要全文翻译4. 为什么不需要梯度更新或微调⭐ Introduction1. 概括2. 具体分析3. 进一步分析 Approach1. 概括2. 具体分析3. 进一步分析 Results1. 概括2. 具体分析2.1 语言模型

[论文笔记]Making Large Language Models A Better Foundation For Dense Retrieval

引言 今天带来北京智源研究院(BAAI)团队带来的一篇关于如何微调LLM变成密集检索器的论文笔记——Making Large Language Models A Better Foundation For Dense Retrieval。 为了简单,下文中以翻译的口吻记录,比如替换"作者"为"我们"。 密集检索需要学习具有区分性的文本嵌入,以表示查询和文档之间的语义关系。考虑到大语言模

【机器学习 sklearn】模型正则化L1-Lasso,L2-Ridge

#coding:utf-8from __future__ import divisionimport sysreload(sys)sys.setdefaultencoding('utf-8')import timestart_time = time.time()import pandas as pd# 输入训练样本的特征以及目标值,分别存储在变量X_train与y_train之中。

AI基础 L1 Introduction to Artificial Intelligence

什么是AI Chinese Room Thought Experiment 关于“强人工智能”的观点,即认为只要一个系统在行为上表现得像有意识,那么它就真的具有理解能力。  实验内容如下: 假设有一个不懂中文的英语说话者被关在一个房间里。房间里有一本用英文写的中文使用手册,可以指导他如何处理中文符号。当外面的中文母语者通过一个小窗口传递给房间里的人一些用中文写的问题时,房间里的人能够依

ModuleNotFoundError: No module named ‘diffusers.models.dual_transformer_2d‘解决方法

Python应用运行报错,部分错误信息如下: Traceback (most recent call last): File “\pipelines_ootd\unet_vton_2d_blocks.py”, line 29, in from diffusers.models.dual_transformer_2d import DualTransformer2DModel ModuleNotF

阅读笔记--Guiding Attention in End-to-End Driving Models

作者:Diego Porres1, Yi Xiao1, Gabriel Villalonga1, Alexandre Levy1, Antonio M. L ́ opez1,2 出版时间:arXiv:2405.00242v1 [cs.CV] 30 Apr 2024 这篇论文研究了如何引导基于视觉的端到端自动驾驶模型的注意力,以提高它们的驾驶质量和获得更直观的激活图。 摘 要   介绍

【Python机器学习】核心数、进程、线程、超线程、L1、L2、L3级缓存

如何知道自己电脑的CPU是几核的,打开任务管理器(同时按下:Esc键、SHIFT键、CTRL键) 然后,点击任务管理器左上角的性能选项,观察右下角中的内核:后面的数字,就是你CPU的核心数,下图中我的是16个核心的。 需要注意的是,下面的逻辑处理器:32 表示支持 32 线程(即超线程技术) 图中的进程:和线程:后面的数字代表什么 在你上传的图片中,“进程:180” 和 “线程:3251”

【稀疏矩阵】使用torch.sparse模块

文章目录 稀疏矩阵的格式coocsrcsc Construction of Sparse COO tensorsConstruction of CSR tensorsLinear Algebra operations(稀疏与稠密之间混合运算)Tensor methods and sparse(与稀疏有关的tensor成员函数)coo张量可用的tensor成员函数(经实测,csr也有一些可以用