本文主要是介绍pmf-automl源码分析,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
- arxiv论文(有附录,但是字小)
Probabilistic Matrix Factorization for Automated Machine Learning - NIPS2018论文(字大但是没有附录)
Probabilistic Matrix Factorization for Automated Machine Learning - 代码
https://github.com/rsheth80/pmf-automl
文章目录
- 初窥项目文件
- PMF模型训练
- 数据切分
- 初始隐变量
- 模型的定义与训练
- D个高斯过程的定义
- 后验分布协方差矩阵的求解
- transform_forward与transform_backward函数
- get_cov函数的顶层设计
- kernel的RBF
- kernel的White
- 求协方差矩阵复盘
- GP前向函数的返回值的含义
初窥项目文件
用jupyter lab打开all_normalized_accuracy_with_pipelineID.csv
all_normalized_accuracy_with_pipelineID.zip contains the performance observations from running 42K pipelines on 553 OpenML datasets. The task was classification and the performance metric was balanced accuracy. Unzip prior to running code.
行表示pipeline id,列表示dataset id,元素表示balanced accuracy
。
简单查阅了一下pipelines.json
,基本只有pca
和polynomial
两种preprocessor。
PMF模型训练
数据切分
Ytrain, Ytest, Ftrain, Ftest = get_data()
>>> Ytrain.shape
Out[2]: (42000, 464)
>>> Ytest.shape
Out[3]: (42000, 89)
>>> Ftrain.shape
Out[4]: (464, 46)
>>> Ftest.shape
Out[5]: (89, 46)
训练测试集切分,89个数据集作为测试集,464个训练集
初始隐变量
imp = sklearn.impute.SimpleImputer(missing_values=np.nan, strategy='mean')X = sklearn.decomposition.PCA(Q).fit_transform(imp.fit(Ytrain).transform(Ytrain))
>>> X.shape
Out[7]: (42000, 20)
根据目前的理解,整个训练过程就是根据GP来训练X的隐变量。这个隐变量是用PCA初始化的。
处理训练集的缺失值,并降维为20维(42K个pipelines,数据集从553降为20个隐变量)
论文:the elements of Y Y Y are given by as nonlinear function of the latent variables, y n , d = f d ( x n ) + ϵ y_{n,d}=f_d(x_n)+\epsilon yn,d=fd(xn)+ϵ, where ϵ \epsilon ϵ is independent Gaussian noise.
这里的 Y Y Y指的是整个 42000 × 464 42000\times464 42000×464矩阵,那么 X X X就是pipeline空间的隐变量,这里隐变量维度 Q = 20 Q=20 Q=20, X X X的shape为 42000 × 20 42000\times20 42000×20
模型的定义与训练
模型的顶层定义:
kernel = kernels.Add(kernels.RBF(Q, lengthscale=None), kernels.White(Q))m = gplvm.GPLVM(Q, X, Ytrain, kernel, N_max=N_max, D_max=batch_size)optimizer = torch.optim.SGD(m.parameters(), lr=lr)m = train(m, optimizer, f_callback=f_callback, f_stop=f_stop)
f_callback
和f_stop
都是两个local函数
def f_callback(m, v, it, t):varn_list.append(transform_forward(m.variance).item())logpr_list.append(m().item()/m.D)if it == 1:t_list.append(t)else:t_list.append(t_list[-1] + t)if save_checkpoint and not (it % checkpoint_period):torch.save(m.state_dict(), fn_checkpoint + '_it%d.pt' % it)print('it=%d, f=%g, varn=%g, t: %g'% (it, logpr_list[-1], transform_forward(m.variance), t_list[-1]))
def f_stop(m, v, it, t):if it >= maxiter-1:print('maxiter (%d) reached' % maxiter)return Truereturn False
看到训练函数train
def train(m, optimizer, f_callback=None, f_stop=None):it = 0while True:try:t = time.time()optimizer.zero_grad()nll = m()nll.backward()optimizer.step()it += 1t = time.time() - tif f_callback is not None:f_callback(m, nll, it, t)# f_stop should not be a substantial portion of total iteration timeif f_stop is not None and f_stop(m, nll, it, t):breakexcept KeyboardInterrupt:breakreturn m
论文公式(5):
N L L d = 1 2 ( N d l o g ( 2 π ) + l o g ∣ C d ∣ + Y c ( d )
这篇关于pmf-automl源码分析的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!