重磅独家 | 腾讯AI Lab AAAI18现场陈述论文:用随机象限性消极下降算法训练L1范数约束模型

本文主要是介绍重磅独家 | 腾讯AI Lab AAAI18现场陈述论文:用随机象限性消极下降算法训练L1范数约束模型,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!




前言:腾讯 AI Lab共有12篇论文入选在美国新奥尔良举行的国际人工智能领域顶级学术会议 AAAI 2018。腾讯技术工程官方号独家编译了论文《用随机象限性消极下降算法训练L1范数约束模型》(Training L1-Regularized Models with Orthant-Wise Passive Descent Algorithms),该论文被 AAAI 2018录用为现场陈述论文(Oral Presentation),由腾讯 AI Lab独立完成,作者为王倪剑桥。


中文概要

L1范数约束模型是一种常用的高维数据的分析方法。对于现代大规模互联网数据上的该模型,研究其优化算法可以提高其收敛速度,进而在有限时间内显著其模型准确率,或者降低对服务器资源的依赖。经典的随机梯度下降 (SGD) 虽然可以适用神经网络等多种模型,但对于L1范数不可导性并不适用。


在本文中,我们提出了一种新的随机优化方法,随机象限性消极下降算法 (OPDA)。本算法的出发点是L1范数函数在任何一个象限内是连续可导的,因此,模型的参数在每一次更新之后被投影到上一次迭代时所在的象限。我们使用随机方差缩减梯度 (SVRG) 方法产生梯度方向,并结合伪牛顿法 (Quasi-Newton) 来使用损失函数的二阶信息。我们提出了一种新的象限投影方法,使得该算法收敛于一个更加稀疏的最优点。我们在理论上证明了,在强凸和光滑的损失函数上,该算法可以线性收敛。在RCV1等典型稀疏数据集上,我们测试了不同参数下L1/L2范数约束Logistic回归下该算法性能,其结果显著超越了已有的线性收敛算法Proximal-SVRG,并且在卷积神经网络 (CNN) 的实验上超越Proximal-SGD等算法,证明了该算法在凸函数和非凸函数上均有很好的表现。

 

英文演讲全文

Hello, everyone, I am Jianqiao Wangni, from Tencent AI Lab.


1. Introduction: Learning sparse representation has been a very important task for data analysis. For example, in biology, it usually involves millions of genes for the genetic analysis of a single individual. In finance series prediction, online advertising, there also lots of cases where the data numbers are even smaller than data dimensions, which is an ill-conditioned problem without the sparse prior. So, for the conventional models, like logistic regression and linear regression, we put an L1 norm regularization, which is the sum of absolute values, to build robust applications with high dimensional data. And they are very powerful for learning sparse representations. To give an intuitive example. The blue areas are the constrained regions, L-1 norm ball on the left and L-2 norm ball on the right, respectively, while the red circles are the contours of the average of square loss functions. The intersection point between the balls and the contours are the solution to such regularized models. We can see that the solution to the L1-regularized model is near the Y-axis, which means that the X-dimension element is more approaching zero.

 

2. Formal Definition: Now we go to the analytical part. We study a regularized function P(x) which equals to F(x) + R(x), where F(x) is the average of N loss functions each of which depends on a data sample, and R is the L_1 regularization. We also assume that each loss function is twice differentiable, strongly convex and smooth, which are general assumptions in convex optimization. The L1 norm is not differentiable. One of most representative optimization method is the proximal method, which iteratively takes a gradient descent step and then solves a proximal problem on the current point.

 

3. Reference 1: Our primary reference is the orthant-wise limited-memory quasi-newton method, OWL-QN, which was based on L-BFGS, a representative quasi-Newton method that overcame an obstacle that L1 norms are not differentiable. This method restricts the updated parameter to be within a certain orthant, because that in every single orthant, the absolute value function are actually differentiable. A key component of OWL-QN is about the subgradient on zero points. The subgradient of the L1 regularization R(x) can be whether positive lambda or negative lambda. Take the third branch for an example, we study a single dimension, I-th dimension, of the current point X_i and the gradient, V_i. If X_i equals to zero, and X_i plus lambda is negative, then the subgradient is set to be V_i plus positive lambda, since after subtracting this subgradient, X_i will be a positive value, then the subgradient of R(x) will still be positive lambda, this makes the subgradient of the L1 norm to be consistent across one iteration.

 

4. Reference 2: To process the massive internet data, many optimization methods were proposed to speed up the training process. The stochastic gradient descent method, SGD, is a popular choice for optimization. However, SGD generally needs a decreasing stepsize to converge, so it only has a sublinear convergence rate. Recently, some stochastic variance reduction methods, such as SVRG and SAGA, can converge without decreasing stepsizes and achieve linear convergence rates on smooth and strongly-convex models. In SGD, the descent direction in the k-th step, v_k, is evaluated on a stochastic subset S_k of the dataset. In SVRG, we need to periodically calculate a series of full gradients that depend on all data, on some reference points, say tilde-X. The full gradient constructs the third term of v_k of SVRG, then we have to balance the gradient in expectation, by subtracting a stochastic gradient on tilde-X, which is calculated on the same subset S_k.

 

5. The First Step: Now we proceed to our method. Although being efficient, OWL-QN can be further improved, for examples, by dropping the line-search procedure, or using a stochastic gradient instead of the accurate but costly full gradient. Inspired by SVRG and OWL-QN, we develop a stochastic optimization method for in L1-regularized models. This is more complicated than the smooth function. At the first step, we calculate the SVRG of the loss function F(x), then we use an idea from OWL-QN to push the descent direction toward maintaining the same orthant after one iteration, and this will be used as a reference orthant. Our actual descent direction V_k is calculated as the third equation here.

 

6. The Second-Order Information: At this point, we have an optional choice about the descent direction, we can use the second-order information of the loss function, by calculating the Hessian matrix on the current point, or estimating an approximated Hessian matrix, like quasi-Newton method. Then the descent direction D_k can be obtained by minimizing the following quadratic expansion around the current point. And if we use L-BFGS, we actually do not need to do the matrix inversion like the equation, this can be done through efficient matrix-vector multiplications. We may also directly assign V_k to D_k, as a typical first-order method. After this step, the orthant of the direction D_k has to be consistent with V_k, which means that, if some dimensions of D_k have different signs with V_k, they have to be aligned to zero, and we note the aligned version to be P_k.

 

7. Final Updates: The aforementioned calculation does not explicitly involve the partial derivative of R(x), the L1 regularization, except for the alignment reference, since we are avoiding additional variance to the stochastic gradient. To make the solution to be sparse, we introduce a novel alignment operator to encourage zero elements, if X and Y had different signs, or the absolute value of X is less than a threshold, X would be forced to be zero. By this alignment operator, after each time we complete the previous calculation, we examine if the next point is in the same orthant as the current point, if not, some dimensions of the next point will be zero. After this step, clearly more dimensions of X_k should be zero instead of being small but nonzero values.

 

8. Convergence Analysis: In the paper, we prove that, under the assumption of smoothness and strong-convexity, our method will converge with a linear convergence rate. Due to lack of time, we do not use too much time on details of this heavy mathematical analysis.

 

9. Synthetic Experiments: To visualize a comparison, we plot the optimization trajectories on a simple two-dimensional synthetic function in this figure. Our method, OPDA, is noted by the red line, and proximal-gradient descent, is noted by the blue line, as the baseline. After the same number of iterations, we see that our method converges to the minimum with a faster speed.

 

10. Experiments on Convex Cases: We also do some experiments on logistic regression with both L_2 and L_1 regularizations for binary classification. In this part, we compare our method with the proximal-SVRG method, which is also linearly-convergent. In the figure, Y-axis notes the suboptimality, which means how far is the current objective function to the minimum value, and X-axis represents the number of data passes. We found a stable result that OPDA runs faster than proximal-SVRG with different step sizes, L2 regularization, and L1 regularization.

 

11. Experiments on Deep Learning: We also conducted experiments with L-1 regularized convolutional neural networks, or sparse CNN, to demonstrate the efficiency under the nonconvex case. This application is also useful to reduce the parameter size of neural networks. The red line represents our method, and the blue line is the proximal-SVRG. We test different scales of L1-regularization. We see that OPDA converge faster than proximal-SVRG, and the difference is stronger if the L1 regularization is stronger. Actually, by the orthant-wise nature of our methods, lots of dimensions of the descent direction and the updated points are forced to sparse during alignment, making the equivalent speed much slower, however, our methods still converge much faster than the proximal methods, with the same step size. This proves that our propose dalignment operator does calibrate the direction to be better, making the overall framework more efficient in terms of iterations, but with only negligible extra arithmetic operations.

 


这篇关于重磅独家 | 腾讯AI Lab AAAI18现场陈述论文:用随机象限性消极下降算法训练L1范数约束模型的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/839669

相关文章

Ilya-AI分享的他在OpenAI学习到的15个提示工程技巧

Ilya(不是本人,claude AI)在社交媒体上分享了他在OpenAI学习到的15个Prompt撰写技巧。 以下是详细的内容: 提示精确化:在编写提示时,力求表达清晰准确。清楚地阐述任务需求和概念定义至关重要。例:不用"分析文本",而用"判断这段话的情感倾向:积极、消极还是中性"。 快速迭代:善于快速连续调整提示。熟练的提示工程师能够灵活地进行多轮优化。例:从"总结文章"到"用

AI绘图怎么变现?想做点副业的小白必看!

在科技飞速发展的今天,AI绘图作为一种新兴技术,不仅改变了艺术创作的方式,也为创作者提供了多种变现途径。本文将详细探讨几种常见的AI绘图变现方式,帮助创作者更好地利用这一技术实现经济收益。 更多实操教程和AI绘画工具,可以扫描下方,免费获取 定制服务:个性化的创意商机 个性化定制 AI绘图技术能够根据用户需求生成个性化的头像、壁纸、插画等作品。例如,姓氏头像在电商平台上非常受欢迎,

大模型研发全揭秘:客服工单数据标注的完整攻略

在人工智能(AI)领域,数据标注是模型训练过程中至关重要的一步。无论你是新手还是有经验的从业者,掌握数据标注的技术细节和常见问题的解决方案都能为你的AI项目增添不少价值。在电信运营商的客服系统中,工单数据是客户问题和解决方案的重要记录。通过对这些工单数据进行有效标注,不仅能够帮助提升客服自动化系统的智能化水平,还能优化客户服务流程,提高客户满意度。本文将详细介绍如何在电信运营商客服工单的背景下进行

SQL中的外键约束

外键约束用于表示两张表中的指标连接关系。外键约束的作用主要有以下三点: 1.确保子表中的某个字段(外键)只能引用父表中的有效记录2.主表中的列被删除时,子表中的关联列也会被删除3.主表中的列更新时,子表中的关联元素也会被更新 子表中的元素指向主表 以下是一个外键约束的实例展示

不懂推荐算法也能设计推荐系统

本文以商业化应用推荐为例,告诉我们不懂推荐算法的产品,也能从产品侧出发, 设计出一款不错的推荐系统。 相信很多新手产品,看到算法二字,多是懵圈的。 什么排序算法、最短路径等都是相对传统的算法(注:传统是指科班出身的产品都会接触过)。但对于推荐算法,多数产品对着网上搜到的资源,都会无从下手。特别当某些推荐算法 和 “AI”扯上关系后,更是加大了理解的难度。 但,不了解推荐算法,就无法做推荐系

从去中心化到智能化:Web3如何与AI共同塑造数字生态

在数字时代的演进中,Web3和人工智能(AI)正成为塑造未来互联网的两大核心力量。Web3的去中心化理念与AI的智能化技术,正相互交织,共同推动数字生态的变革。本文将探讨Web3与AI的融合如何改变数字世界,并展望这一新兴组合如何重塑我们的在线体验。 Web3的去中心化愿景 Web3代表了互联网的第三代发展,它基于去中心化的区块链技术,旨在创建一个开放、透明且用户主导的数字生态。不同于传统

康拓展开(hash算法中会用到)

康拓展开是一个全排列到一个自然数的双射(也就是某个全排列与某个自然数一一对应) 公式: X=a[n]*(n-1)!+a[n-1]*(n-2)!+...+a[i]*(i-1)!+...+a[1]*0! 其中,a[i]为整数,并且0<=a[i]<i,1<=i<=n。(a[i]在不同应用中的含义不同); 典型应用: 计算当前排列在所有由小到大全排列中的顺序,也就是说求当前排列是第

AI一键生成 PPT

AI一键生成 PPT 操作步骤 作为一名打工人,是不是经常需要制作各种PPT来分享我的生活和想法。但是,你们知道,有时候灵感来了,时间却不够用了!😩直到我发现了Kimi AI——一个能够自动生成PPT的神奇助手!🌟 什么是Kimi? 一款月之暗面科技有限公司开发的AI办公工具,帮助用户快速生成高质量的演示文稿。 无论你是职场人士、学生还是教师,Kimi都能够为你的办公文

csu 1446 Problem J Modified LCS (扩展欧几里得算法的简单应用)

这是一道扩展欧几里得算法的简单应用题,这题是在湖南多校训练赛中队友ac的一道题,在比赛之后请教了队友,然后自己把它a掉 这也是自己独自做扩展欧几里得算法的题目 题意:把题意转变下就变成了:求d1*x - d2*y = f2 - f1的解,很明显用exgcd来解 下面介绍一下exgcd的一些知识点:求ax + by = c的解 一、首先求ax + by = gcd(a,b)的解 这个

Andrej Karpathy最新采访:认知核心模型10亿参数就够了,AI会打破教育不公的僵局

夕小瑶科技说 原创  作者 | 海野 AI圈子的红人,AI大神Andrej Karpathy,曾是OpenAI联合创始人之一,特斯拉AI总监。上一次的动态是官宣创办一家名为 Eureka Labs 的人工智能+教育公司 ,宣布将长期致力于AI原生教育。 近日,Andrej Karpathy接受了No Priors(投资博客)的采访,与硅谷知名投资人 Sara Guo 和 Elad G