本文主要是介绍【Machine Learning】Other Stuff,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
本笔记基于清华大学《机器学习》的课程讲义中有关机器学习的此前未提到的部分,基本为笔者在考试前一两天所作的Cheat Sheet。内容较多,并不详细,主要作为复习和记忆的资料。
Robust Machine Learning
Attack: PGD
- max δ ∈ Δ L o s s ( f θ ( x + δ ) , y ) \max_{\delta\in \Delta}Loss(f_\theta(x+\delta),y) maxδ∈ΔLoss(fθ(x+δ),y): find δ \delta δ that maximizes the loss
- How to compute δ \delta δ?
- Projected Gradient Descent: GD then project back to Δ \Delta Δ
- Fast Gradient Sign Method: For example, Δ = { δ : ∣ δ ∣ ∞ ≤ ϵ } \Delta =\{\delta:|\delta|_\infty\le \epsilon\} Δ={δ:∣δ∣∞≤ϵ}. Then as learning rate goes to ∞ \infty ∞, we always go to corner. Then δ = η ⋅ s i g n ( ∇ x L θ ( x + δ t , y ) ) \delta=\eta\cdot sign(\nabla_xL_\theta(x+\delta_t,y)) δ=η⋅sign(∇xLθ(x+δt,y)), we only take the sign.
- PGD: runs FGSM multiple times.
Defend: Adversarial Training
-
Daskin’s Theorem:
∂ ∂ θ max δ L ( f θ ( x + δ , y ) ) = ∂ ∂ θ L ( f θ ( x + δ ∗ , y ) ) \frac{\partial}{\partial \theta }\max_\delta L(f_\theta(x+\delta,y))=\frac{\partial }{\partial \theta}L(f_\theta(x+\delta^*,y)) ∂θ∂δmaxL(fθ(x+δ,y))=∂θ∂L(fθ(x+δ∗,y))
Only care about the worst attack samples. For a batch of samples, find δ ∗ \delta^* δ∗, then do GD on θ \theta θ based on δ ∗ \delta^* δ∗. -
Models become more semantically meaningful by adversarial training
Robust Feature Learning
- For a dog photo x x x, attack it to x ′ x' x′, so that f ( x ′ ) = c a t f(x')=cat f(x′)=cat. This ( x ′ , c a t ) (x',cat) (x′,cat) training set gives model the “non-robust” features.
- For train image x x x, generate from random initialization x τ x_\tau xτ such that g ( x ) ≈ g ( x τ ) g(x)\approx g(x_\tau) g(x)≈g(xτ). x τ x_\tau xτ is also a good training sample that gives similar robust feature to the model.
False Sense of Security
- Backward pass differentiable approximation(ignore the gradient that is hard to compute).
- Take multiple samples to solve the randomness.
Provable Robust Certificates
- Classification problem: use histogram, pick the largest color nearby.
- Compare the histogram centered at x x x and x + δ x+\delta x+δ:
- Worst case: Greedy filling(*)
Hyperparameter
Bayesian Optimization
- estimate the parameters as Gaussian.
- explore both the great-variance point and the low point.
Gradient descent
- Directly use GD to find hyperparameter: memory storage problem
- SGD with momentum: v t + 1 = γ t v t − ( 1 − γ t ) ∇ w L ( w t , θ , t ) v_{t+1}=\gamma_tv_t-(1-\gamma_t)\nabla _wL(w_t,\theta ,t) vt+1=γtvt−(1−γt)∇wL(wt,θ,t)
- Store w t + 1 , v t + 1 w_{t+1},v_{t+1} wt+1,vt+1 and ∇ w L ( w t , θ , t ) \nabla _wL(w_t,\theta,t) ∇wL(wt,θ,t), we can recover the whole chain
- Need: Continuous
Random Search
- Work better than grid search
Best Arm identification
-
Successive Halving(SH) Algorithm(*)
- Per round use B / log 2 ( n ) B/\log_2(n) B/log2(n), log 2 ( n ) \log_2(n) log2(n) rounds totally.
- Each round, every survivor use the same budget. Only half of them survive in each round.
-
Assume v 1 > v 2 ≥ . . . ≥ v n , Δ i = v 1 − v i v_1>v_2\ge ...\ge v_n,\Delta_i=v_1-v_i v1>v2≥...≥vn,Δi=v1−vi. With probability 1 − δ 1-\delta 1−δ, the algorithm finds the best arm with B = O ( max i > 1 i Δ i 2 log n log ( log n / δ ) ) B=O(\max_{i>1}\frac{i}{\Delta_i^2}\log n\log(\log n/\delta)) B=O(maxi>1Δi2ilognlog(logn/δ)) assuming v i ∈ [ 0 , 1 ] v_i\in [0,1] vi∈[0,1]
- Proof. Concentration inequaility, probability that 1 3 \frac{1}{3} 31 of 3 4 N r \frac{3}{4}N_r 43Nr smaller ones are greater than best one is bounded, union bound for all round.
-
Hyperparameter tuning
- B ≥ 2 log 2 ( n ) ( n + ∑ i = 2... n γ ˉ − 1 ( v i − v 1 2 ) ) B\ge 2\log_2(n)\left(n+\sum_{i=2...n}\bar{\gamma}^{-1}(\frac{v_i-v_1}{2})\right) B≥2log2(n)(n+∑i=2...nγˉ−1(2vi−v1))
Neural Architecture Search
- Reinforcement learning
- ProxylessNAS: each architecture has weight α i \alpha_i αi to be chosen. Choose a binary variable, that is only one architecture exists, and the probability is about α i \alpha_i αi. ∂ L ∂ α i ≈ ∂ L ∂ g i ∂ p i ∂ α i \frac{\partial L}{\partial \alpha_i}\approx\frac{\partial L}{\partial g_i}\frac{\partial p_i}{\partial \alpha_i} ∂αi∂L≈∂gi∂L∂αi∂pi, g i = 0 / 1 g_i=0/1 gi=0/1, ∂ L / ∂ g j \partial L/\partial g_j ∂L/∂gj means the influence of architecture j j j to L L L.
Machine Learning Augmented
Differencial Privacy
-
Randomness is essential.
- If we have a non-trivial deterministic algorithm, we can change data base A A A to B B B one by one until their output is the same. There exists a pair of databases differ by one row, then we know about that row.
-
Database: histogram N ∣ X ∣ N^{|X|} N∣X∣ for discrete set X X X(the categories).
-
M M M is ( ϵ , δ ) (\epsilon,\delta ) (ϵ,δ)-differentially privacy if ∀ x , y ∈ N ∣ X ∣ , ∣ x − y ∣ 1 ≤ 1 , S ⊆ M ( N ∣ X ∣ ) \forall x,y\in N^{|X|},|x-y|_1\le 1,S\subseteq M(N^{|X|}) ∀x,y∈N∣X∣,∣x−y∣1≤1,S⊆M(N∣X∣)
P ( M ( x ) ∈ S ) ≤ e ϵ P ( M ( y ) ∈ S ) + δ P(M(x)\in S)\le e^{\epsilon}P(M(y)\in S)+\delta P(M(x)∈S)≤eϵP(M(y)∈S)+δ -
Differencial privacy is immune to post-processing. Proof:
- For deterministic function: Proved easily by inverse function
- Random function is convex combination of deterministic function, that is, each deterministic function is chosen with some probability.
-
With ( ϵ , 0 ) (\epsilon,0) (ϵ,0)-differential privacy(DP mechanism), the voting result will not change too much by changing one’s vote.
E a ∼ f ( M ( x ) ) [ u i ( a ) ] = ∑ a ∈ A u i ( a ) Pr f ( M ( x ) ) [ a ] ≤ ∑ a ∈ A u i ( a ) exp ( ϵ ) Pr f ( M ( y ) ) [ a ] = exp ( ϵ ) E a ∼ f ( M ( y ) ) [ u i ( a ) ] \begin{align*} E_{a\sim f(M(x))}[u_i(a)]&=\sum_{a\in A} u_i(a)\Pr_{f(M(x))}[a]\\&\le \sum_{a\in A}u_i(a)\exp(\epsilon)\Pr_{f(M(y))}[a]\\&=\exp(\epsilon)E_{a\sim f(M(y))}[u_i(a)] \end{align*} Ea∼f(M(x))[ui(a)]=a∈A∑ui(a)f(M(x))Pr[a]≤a∈A∑ui(a)exp(ϵ)f(M(y))Pr[a]=exp(ϵ)Ea∼f(M(y))[ui(a)]
f f f is a map to the feature, and we only consider the expected voting utility u i ( a ) u_i(a) ui(a) about the feature a a a. -
Laplace Mechanism: reach ( ϵ , 0 ) (\epsilon,0) (ϵ,0)-DP by just adding a random gaussian noise to f ( x ) , x ∈ N ∣ X ∣ f(x),x\in N^{|X|} f(x),x∈N∣X∣
-
Assume f : N ∣ X ∣ → R k f:N^{|X|}\to \R^{k} f:N∣X∣→Rk
-
Definition: M L ( x , f , ϵ ) = f ( x ) + ( Y 1 , . . , Y k ) M_L(x,f,\epsilon)=f(x)+(Y_1,..,Y_k) ML(x,f,ϵ)=f(x)+(Y1,..,Yk)
- Y i Y_i Yi is iid random variable from L a p ( Δ f ϵ ) Lap(\frac{\Delta f }{\epsilon}) Lap(ϵΔf)
- l 1 l_1 l1 sensitive of f f f is Δ f = max x , y , ∣ x − y ∣ 1 ≤ 1 ∣ f ( x ) − f ( y ) ∣ 1 \Delta f=\max_{x,y,|x-y|_1\le 1}|f(x)-f(y)|_1 Δf=maxx,y,∣x−y∣1≤1∣f(x)−f(y)∣1. How sensitive f f f is by changing one entry a little.
- Probability density: L a p ( b ) = 1 2 b exp ( − ∣ x ∣ b ) Lap(b)=\frac{1}{2b}\exp(-\frac{|x|}{b}) Lap(b)=2b1exp(−b∣x∣). Variance σ 2 = 2 b 2 \sigma^2=2b^2 σ2=2b2.
-
Proof. The probability density of M L ( x , f , ϵ ) M_L(x,f,\epsilon) ML(x,f,ϵ) is p x ( z ) = ∏ i = 1 k exp ( − ϵ ∣ f ( x ) i − z i ∣ Δ f ) p_x(z)=\prod_{i=1}^k\exp\left(-\frac{\epsilon|f(x)_i-z_i|}{\Delta f}\right) px(z)=∏i=1kexp(−Δfϵ∣f(x)i−zi∣). Easily prove p x ( z ) / p y ( z ) ≤ exp ( ϵ ) p_x(z)/p_y(z)\le \exp(\epsilon) px(z)/py(z)≤exp(ϵ)
-
Accuracy: for δ ∈ ( 0 , 1 ] \delta\in(0,1] δ∈(0,1],
Pr [ ∣ f ( x ) − M L ( x , f , ϵ ) ∣ ∞ ≥ ln ( k δ ) ⋅ ( Δ f ϵ ) ] ≤ δ \Pr\left[|f(x)-M_L(x,f,\epsilon)|_\infty\ge \ln\left(\frac{k}{\delta}\right)\cdot \left(\frac{\Delta f}{\epsilon}\right)\right]\le \delta Pr[∣f(x)−ML(x,f,ϵ)∣∞≥ln(δk)⋅(ϵΔf)]≤δ- Directly prove by Pr [ ∣ Y ∣ ≥ t ⋅ b ] = exp ( − t ) \Pr[|Y|\ge t\cdot b]=\exp(-t) Pr[∣Y∣≥t⋅b]=exp(−t) for Y ∼ L a p ( b ) Y\sim Lap(b) Y∼Lap(b)
-
Big data
- Idea: data distribution rarely changes.
- Replace B-tree with a model. Know err_min err_max because you only care about the data in your database
- advantage: less storage, faster lookups, more parallelism, hardware accelators
- use linear model(fast)
- Do exponential search since the prediction is good enough
- Bloom Filter: is the key in my set=>maybe yes/no
Low Rank Apporximation
- A ∈ R n × m A\in \mathbb{R}^{n\times m} A∈Rn×m, [ A ] k = arg min r a n k ( A ′ ) ≤ k ∣ A − A ′ ∣ F [A]_k=\arg \min_{rank(A')\le k} |A-A'|_F [A]k=argminrank(A′)≤k∣A−A′∣F
- Learn S ∈ R p × m S\in \mathbb{R}^{p\times m} S∈Rp×m and do low rank decomposition for S A SA SA, then do some recover.
Recified flow
- Given empirical observations about distribution π 0 \pi_0 π0 and π 1 \pi_1 π1. Find a transport map T T T that T ( Z ) ∼ π 1 T(Z)\sim \pi_1 T(Z)∼π1 for Z ∼ π 0 Z\sim \pi_0 Z∼π0.
- E.g. π 0 \pi_0 π0 is gaussian noise, π 1 \pi_1 π1 are data containing pictures. We want to find a map from noise to picture(diffusion) that has some features.
- If T ( X 0 ) = X 1 T(X_0)=X_1 T(X0)=X1, then we want v ( t X 0 + ( 1 − t ) X 1 , t ) = d X d t = X 1 − X 0 v(tX_0+(1-t)X_1,t)=\frac{dX}{dt}=X_1-X_0 v(tX0+(1−t)X1,t)=dtdX=X1−X0. The transformation is linear. So that one cluster will map to another clusters.
- Minimize the loss E [ ∥ ( X 1 − X 0 ) − v ( X t , t ) ∥ 2 ] \mathbb{E}[\|(X_1-X_0)-v(X_t,t)\|^2] E[∥(X1−X0)−v(Xt,t)∥2]
- Algorithm:
- randomly connect π 0 , π 1 \pi_0,\pi_1 π0,π1
- minimize loss of $\theta $ for v θ v_\theta vθ
- connect π 0 , π 1 \pi_0,\pi_1 π0,π1 by v θ v_\theta vθ again, and repeat minimization.
Rope Attention
- Embed position to q m , k n q_m,k_n qm,kn for attention by rotation in complex field
- β \beta β-base system: focus on different accuracy level
- use inner product of q m , k n q_m,k_n qm,kn to represent their similarity
- Softmax attention Attention ( Q , K , V ) = softmax ( Q K ⊤ ) V \text{Attention}(Q,K,V)=\text{softmax}(QK^\top)V Attention(Q,K,V)=softmax(QK⊤)V
- Faster computation
- linear
- softmax separately
- design Q K ⊤ QK^\top QK⊤
这篇关于【Machine Learning】Other Stuff的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!