【Machine Learning】Other Stuff

2024-01-11 08:44
文章标签 learning machine stuff

本文主要是介绍【Machine Learning】Other Stuff,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

本笔记基于清华大学《机器学习》的课程讲义中有关机器学习的此前未提到的部分,基本为笔者在考试前一两天所作的Cheat Sheet。内容较多,并不详细,主要作为复习和记忆的资料。

Robust Machine Learning

Attack: PGD

  • max ⁡ δ ∈ Δ L o s s ( f θ ( x + δ ) , y ) \max_{\delta\in \Delta}Loss(f_\theta(x+\delta),y) maxδΔLoss(fθ(x+δ),y): find δ \delta δ that maximizes the loss
  • How to compute δ \delta δ?
    • Projected Gradient Descent: GD then project back to Δ \Delta Δ
    • Fast Gradient Sign Method: For example, Δ = { δ : ∣ δ ∣ ∞ ≤ ϵ } \Delta =\{\delta:|\delta|_\infty\le \epsilon\} Δ={δ:δϵ}. Then as learning rate goes to ∞ \infty , we always go to corner. Then δ = η ⋅ s i g n ( ∇ x L θ ( x + δ t , y ) ) \delta=\eta\cdot sign(\nabla_xL_\theta(x+\delta_t,y)) δ=ηsign(xLθ(x+δt,y)), we only take the sign.
    • PGD: runs FGSM multiple times.

Defend: Adversarial Training

  • Daskin’s Theorem:
    ∂ ∂ θ max ⁡ δ L ( f θ ( x + δ , y ) ) = ∂ ∂ θ L ( f θ ( x + δ ∗ , y ) ) \frac{\partial}{\partial \theta }\max_\delta L(f_\theta(x+\delta,y))=\frac{\partial }{\partial \theta}L(f_\theta(x+\delta^*,y)) θδmaxL(fθ(x+δ,y))=θL(fθ(x+δ,y))
    Only care about the worst attack samples. For a batch of samples, find δ ∗ \delta^* δ, then do GD on θ \theta θ based on δ ∗ \delta^* δ.

  • Models become more semantically meaningful by adversarial training

Robust Feature Learning

  • For a dog photo x x x, attack it to x ′ x' x, so that f ( x ′ ) = c a t f(x')=cat f(x)=cat. This ( x ′ , c a t ) (x',cat) (x,cat) training set gives model the “non-robust” features.
  • For train image x x x, generate from random initialization x τ x_\tau xτ such that g ( x ) ≈ g ( x τ ) g(x)\approx g(x_\tau) g(x)g(xτ). x τ x_\tau xτ is also a good training sample that gives similar robust feature to the model.

False Sense of Security

  • Backward pass differentiable approximation(ignore the gradient that is hard to compute).
  • Take multiple samples to solve the randomness.

Provable Robust Certificates

  • Classification problem: use histogram, pick the largest color nearby.
  • Compare the histogram centered at x x x and x + δ x+\delta x+δ:
    • Worst case: Greedy filling(*)

Hyperparameter

Bayesian Optimization

  • estimate the parameters as Gaussian.
  • explore both the great-variance point and the low point.

Gradient descent

  • Directly use GD to find hyperparameter: memory storage problem
  • SGD with momentum: v t + 1 = γ t v t − ( 1 − γ t ) ∇ w L ( w t , θ , t ) v_{t+1}=\gamma_tv_t-(1-\gamma_t)\nabla _wL(w_t,\theta ,t) vt+1=γtvt(1γt)wL(wt,θ,t)
    • Store w t + 1 , v t + 1 w_{t+1},v_{t+1} wt+1,vt+1 and ∇ w L ( w t , θ , t ) \nabla _wL(w_t,\theta,t) wL(wt,θ,t), we can recover the whole chain
  • Need: Continuous

Random Search

  • Work better than grid search

Best Arm identification

  • Successive Halving(SH) Algorithm(*)

    • Per round use B / log ⁡ 2 ( n ) B/\log_2(n) B/log2(n), log ⁡ 2 ( n ) \log_2(n) log2(n) rounds totally.
    • Each round, every survivor use the same budget. Only half of them survive in each round.
  • Assume v 1 > v 2 ≥ . . . ≥ v n , Δ i = v 1 − v i v_1>v_2\ge ...\ge v_n,\Delta_i=v_1-v_i v1>v2...vn,Δi=v1vi. With probability 1 − δ 1-\delta 1δ, the algorithm finds the best arm with B = O ( max ⁡ i > 1 i Δ i 2 log ⁡ n log ⁡ ( log ⁡ n / δ ) ) B=O(\max_{i>1}\frac{i}{\Delta_i^2}\log n\log(\log n/\delta)) B=O(maxi>1Δi2ilognlog(logn/δ)) assuming v i ∈ [ 0 , 1 ] v_i\in [0,1] vi[0,1]

    • Proof. Concentration inequaility, probability that 1 3 \frac{1}{3} 31 of 3 4 N r \frac{3}{4}N_r 43Nr smaller ones are greater than best one is bounded, union bound for all round.
  • Hyperparameter tuning

    • B ≥ 2 log ⁡ 2 ( n ) ( n + ∑ i = 2... n γ ˉ − 1 ( v i − v 1 2 ) ) B\ge 2\log_2(n)\left(n+\sum_{i=2...n}\bar{\gamma}^{-1}(\frac{v_i-v_1}{2})\right) B2log2(n)(n+i=2...nγˉ1(2viv1))

Neural Architecture Search

  • Reinforcement learning
  • ProxylessNAS: each architecture has weight α i \alpha_i αi to be chosen. Choose a binary variable, that is only one architecture exists, and the probability is about α i \alpha_i αi. ∂ L ∂ α i ≈ ∂ L ∂ g i ∂ p i ∂ α i \frac{\partial L}{\partial \alpha_i}\approx\frac{\partial L}{\partial g_i}\frac{\partial p_i}{\partial \alpha_i} αiLgiLαipi, g i = 0 / 1 g_i=0/1 gi=0/1, ∂ L / ∂ g j \partial L/\partial g_j L/gj means the influence of architecture j j j to L L L.

Machine Learning Augmented

Differencial Privacy

  • Randomness is essential.

    • If we have a non-trivial deterministic algorithm, we can change data base A A A to B B B one by one until their output is the same. There exists a pair of databases differ by one row, then we know about that row.
  • Database: histogram N ∣ X ∣ N^{|X|} NX for discrete set X X X(the categories).

  • M M M is ( ϵ , δ ) (\epsilon,\delta ) (ϵ,δ)-differentially privacy if ∀ x , y ∈ N ∣ X ∣ , ∣ x − y ∣ 1 ≤ 1 , S ⊆ M ( N ∣ X ∣ ) \forall x,y\in N^{|X|},|x-y|_1\le 1,S\subseteq M(N^{|X|}) x,yNX,xy11,SM(NX)
    P ( M ( x ) ∈ S ) ≤ e ϵ P ( M ( y ) ∈ S ) + δ P(M(x)\in S)\le e^{\epsilon}P(M(y)\in S)+\delta P(M(x)S)eϵP(M(y)S)+δ

  • Differencial privacy is immune to post-processing. Proof:

    • For deterministic function: Proved easily by inverse function
    • Random function is convex combination of deterministic function, that is, each deterministic function is chosen with some probability.
  • With ( ϵ , 0 ) (\epsilon,0) (ϵ,0)-differential privacy(DP mechanism), the voting result will not change too much by changing one’s vote.
    E a ∼ f ( M ( x ) ) [ u i ( a ) ] = ∑ a ∈ A u i ( a ) Pr ⁡ f ( M ( x ) ) [ a ] ≤ ∑ a ∈ A u i ( a ) exp ⁡ ( ϵ ) Pr ⁡ f ( M ( y ) ) [ a ] = exp ⁡ ( ϵ ) E a ∼ f ( M ( y ) ) [ u i ( a ) ] \begin{align*} E_{a\sim f(M(x))}[u_i(a)]&=\sum_{a\in A} u_i(a)\Pr_{f(M(x))}[a]\\&\le \sum_{a\in A}u_i(a)\exp(\epsilon)\Pr_{f(M(y))}[a]\\&=\exp(\epsilon)E_{a\sim f(M(y))}[u_i(a)] \end{align*} Eaf(M(x))[ui(a)]=aAui(a)f(M(x))Pr[a]aAui(a)exp(ϵ)f(M(y))Pr[a]=exp(ϵ)Eaf(M(y))[ui(a)]
    f f f is a map to the feature, and we only consider the expected voting utility u i ( a ) u_i(a) ui(a) about the feature a a a.

  • Laplace Mechanism: reach ( ϵ , 0 ) (\epsilon,0) (ϵ,0)-DP by just adding a random gaussian noise to f ( x ) , x ∈ N ∣ X ∣ f(x),x\in N^{|X|} f(x),xNX

    • Assume f : N ∣ X ∣ → R k f:N^{|X|}\to \R^{k} f:NXRk

    • Definition: M L ( x , f , ϵ ) = f ( x ) + ( Y 1 , . . , Y k ) M_L(x,f,\epsilon)=f(x)+(Y_1,..,Y_k) ML(x,f,ϵ)=f(x)+(Y1,..,Yk)

      • Y i Y_i Yi is iid random variable from L a p ( Δ f ϵ ) Lap(\frac{\Delta f }{\epsilon}) Lap(ϵΔf)
      • l 1 l_1 l1 sensitive of f f f is Δ f = max ⁡ x , y , ∣ x − y ∣ 1 ≤ 1 ∣ f ( x ) − f ( y ) ∣ 1 \Delta f=\max_{x,y,|x-y|_1\le 1}|f(x)-f(y)|_1 Δf=maxx,y,xy11f(x)f(y)1. How sensitive f f f is by changing one entry a little.
      • Probability density: L a p ( b ) = 1 2 b exp ⁡ ( − ∣ x ∣ b ) Lap(b)=\frac{1}{2b}\exp(-\frac{|x|}{b}) Lap(b)=2b1exp(bx). Variance σ 2 = 2 b 2 \sigma^2=2b^2 σ2=2b2.
    • Proof. The probability density of M L ( x , f , ϵ ) M_L(x,f,\epsilon) ML(x,f,ϵ) is p x ( z ) = ∏ i = 1 k exp ⁡ ( − ϵ ∣ f ( x ) i − z i ∣ Δ f ) p_x(z)=\prod_{i=1}^k\exp\left(-\frac{\epsilon|f(x)_i-z_i|}{\Delta f}\right) px(z)=i=1kexp(Δfϵf(x)izi). Easily prove p x ( z ) / p y ( z ) ≤ exp ⁡ ( ϵ ) p_x(z)/p_y(z)\le \exp(\epsilon) px(z)/py(z)exp(ϵ)

    • Accuracy: for δ ∈ ( 0 , 1 ] \delta\in(0,1] δ(0,1],
      Pr ⁡ [ ∣ f ( x ) − M L ( x , f , ϵ ) ∣ ∞ ≥ ln ⁡ ( k δ ) ⋅ ( Δ f ϵ ) ] ≤ δ \Pr\left[|f(x)-M_L(x,f,\epsilon)|_\infty\ge \ln\left(\frac{k}{\delta}\right)\cdot \left(\frac{\Delta f}{\epsilon}\right)\right]\le \delta Pr[f(x)ML(x,f,ϵ)ln(δk)(ϵΔf)]δ

      • Directly prove by Pr ⁡ [ ∣ Y ∣ ≥ t ⋅ b ] = exp ⁡ ( − t ) \Pr[|Y|\ge t\cdot b]=\exp(-t) Pr[Ytb]=exp(t) for Y ∼ L a p ( b ) Y\sim Lap(b) YLap(b)

Big data

  • Idea: data distribution rarely changes.
  • Replace B-tree with a model. Know err_min err_max because you only care about the data in your database
    • advantage: less storage, faster lookups, more parallelism, hardware accelators
    • use linear model(fast)
    • Do exponential search since the prediction is good enough
    • Bloom Filter: is the key in my set=>maybe yes/no

Low Rank Apporximation

  • A ∈ R n × m A\in \mathbb{R}^{n\times m} ARn×m, [ A ] k = arg ⁡ min ⁡ r a n k ( A ′ ) ≤ k ∣ A − A ′ ∣ F [A]_k=\arg \min_{rank(A')\le k} |A-A'|_F [A]k=argminrank(A)kAAF
  • Learn S ∈ R p × m S\in \mathbb{R}^{p\times m} SRp×m and do low rank decomposition for S A SA SA, then do some recover.

Recified flow

  • Given empirical observations about distribution π 0 \pi_0 π0 and π 1 \pi_1 π1. Find a transport map T T T that T ( Z ) ∼ π 1 T(Z)\sim \pi_1 T(Z)π1 for Z ∼ π 0 Z\sim \pi_0 Zπ0.
    • E.g. π 0 \pi_0 π0 is gaussian noise, π 1 \pi_1 π1 are data containing pictures. We want to find a map from noise to picture(diffusion) that has some features.
    • If T ( X 0 ) = X 1 T(X_0)=X_1 T(X0)=X1, then we want v ( t X 0 + ( 1 − t ) X 1 , t ) = d X d t = X 1 − X 0 v(tX_0+(1-t)X_1,t)=\frac{dX}{dt}=X_1-X_0 v(tX0+(1t)X1,t)=dtdX=X1X0. The transformation is linear. So that one cluster will map to another clusters.
    • Minimize the loss E [ ∥ ( X 1 − X 0 ) − v ( X t , t ) ∥ 2 ] \mathbb{E}[\|(X_1-X_0)-v(X_t,t)\|^2] E[(X1X0)v(Xt,t)2]
  • Algorithm:
    • randomly connect π 0 , π 1 \pi_0,\pi_1 π0,π1
    • minimize loss of $\theta $ for v θ v_\theta vθ
    • connect π 0 , π 1 \pi_0,\pi_1 π0,π1 by v θ v_\theta vθ again, and repeat minimization.

Rope Attention

  • Embed position to q m , k n q_m,k_n qm,kn for attention by rotation in complex field
    • β \beta β-base system: focus on different accuracy level
  • use inner product of q m , k n q_m,k_n qm,kn to represent their similarity
  • Softmax attention Attention ( Q , K , V ) = softmax ( Q K ⊤ ) V \text{Attention}(Q,K,V)=\text{softmax}(QK^\top)V Attention(Q,K,V)=softmax(QK)V
  • Faster computation
    • linear
    • softmax separately
    • design Q K ⊤ QK^\top QK

这篇关于【Machine Learning】Other Stuff的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/593787

相关文章

简单的Q-learning|小明的一维世界(3)

简单的Q-learning|小明的一维世界(1) 简单的Q-learning|小明的一维世界(2) 一维的加速度世界 这个世界,小明只能控制自己的加速度,并且只能对加速度进行如下三种操作:增加1、减少1、或者不变。所以行动空间为: { u 1 = − 1 , u 2 = 0 , u 3 = 1 } \{u_1=-1, u_2=0, u_3=1\} {u1​=−1,u2​=0,u3​=1}

简单的Q-learning|小明的一维世界(2)

上篇介绍了小明的一维世界模型 、Q-learning的状态空间、行动空间、奖励函数、Q-table、Q table更新公式、以及从Q值导出策略的公式等。最后给出最简单的一维位置世界的Q-learning例子,从给出其状态空间、行动空间、以及稠密与稀疏两种奖励函数的设置方式。下面将继续深入,GO! 一维的速度世界 这个世界,小明只能控制自己的速度,并且只能对速度进行如下三种操作:增加1、减

ZOJ 3324 Machine(线段树区间合并)

这道题网上很多代码是错误的,由于后台数据水,他们可以AC。 比如这组数据 10 3 p 0 9 r 0 5 r 6 9 输出应该是 0 1 1 所以有的人直接记录该区间是否被覆盖过的方法是错误的 正确方法应该是记录这段区间的最小高度(就是最接近初始位置的高度),和最小高度对应的最长左区间和右区间 开一个sum记录这段区间最小高度的块数,min_v 记录该区间最小高度 cover

Learning Memory-guided Normality for Anomaly Detection——学习记忆引导的常态异常检测

又是一篇在自编码器框架中研究使用记忆模块的论文,可以看做19年的iccv的论文的衍生,在我的博客中对19年iccv这篇论文也做了简单介绍。韩国人写的,应该是吧,这名字听起来就像。 摘要abstract 我们解决异常检测的问题,即检测视频序列中的异常事件。基于卷积神经网络的异常检测方法通常利用代理任务(如重建输入视频帧)来学习描述正常情况的模型,而在训练时看不到异常样本,并在测试时使用重建误

Learning Temporal Regularity in Video Sequences——视频序列的时间规则性学习

Learning Temporal Regularity in Video Sequences CVPR2016 无监督视频异常事件检测早期工作 摘要 由于对“有意义”的定义不明确以及场景混乱,因此在较长的视频序列中感知有意义的活动是一个具有挑战性的问题。我们通过在非常有限的监督下使用多种来源学习常规运动模式的生成模型(称为规律性)来解决此问题。体来说,我们提出了两种基于自动编码器的方法,以

COD论文笔记 Adaptive Guidance Learning for Camouflaged Object Detection

论文的主要动机、现有方法的不足、拟解决的问题、主要贡献和创新点如下: 动机: 论文的核心动机是解决伪装目标检测(COD)中的挑战性任务。伪装目标检测旨在识别和分割那些在视觉上与周围环境高度相似的目标,这对于计算机视觉来说是非常困难的任务。尽管深度学习方法在该领域取得了一定进展,但现有方法仍面临有效分离目标和背景的难题,尤其是在伪装目标与背景特征高度相似的情况下。 现有方法的不足之处: 过于

One-Shot Imitation Learning

发表时间:NIPS2017 论文链接:https://readpaper.com/pdf-annotate/note?pdfId=4557560538297540609&noteId=2424799047081637376 作者单位:Berkeley AI Research Lab, Work done while at OpenAI Yan Duan†§ , Marcin Andrychow

Introduction to Deep Learning with PyTorch

1、Introduction to PyTorch, a Deep Learning Library 1.1、Importing PyTorch and related packages import torch# supports:## image data with torchvision## audio data with torchaudio## text data with t

《Learning To Count Everything》CVPR2021

摘要 论文提出了一种新的方法来解决视觉计数问题,即在给定类别中仅有少量标注实例的情况下,对任何类别的对象进行计数。将计数问题视为一个少样本回归任务,并提出了一种新颖的方法,该方法通过查询图像和查询图像中的少量示例对象来预测图像中所有感兴趣对象的存在密度图。此外,还提出了一种新颖的适应策略,使网络能够在测试时仅使用新类别中的少量示例对象来适应任何新的视觉类别。为了支持这一任务,作者还引入了一个包含

One-Shot Imitation Learning with Invariance Matching for Robotic Manipulation

发表时间:5 Jun 2024 论文链接:https://readpaper.com/pdf-annotate/note?pdfId=2408639872513958656&noteId=2408640378699078912 作者单位:Rutgers University Motivation:学习一个通用的policy,可以执行一组不同的操作任务,是机器人技术中一个有前途的新方向。然而,