Forests of randomized trees

2023-10-18 13:59
本文主要是介绍Forests of randomized trees

像其它分类器一样,森林分类器必须拟合两个矩阵:大小为[n_samples, n_features]的X(包含训练样本),大小为[n_samples]的Y,包含训练样本的目标值。

在极端随机树(ExtraTreesClassifier和 ExtraTreesRegressor)中,随机化往前走了一步,它的切割点是被计算的。就像在随机森林中一样,候选特征的随机子集被使用,而不是去找最差异化的阈值,阈值对于每个候选特征被随机引入,这些随机产生的阈值最优的被选择作为切割规则。这通常允许减少多一点的模型方差(以增加偏差为代价)。

print(__doc__)import numpy as np
import matplotlib.pyplot as pltfrom sklearn import clone
from sklearn.datasets import load_iris
from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier,AdaBoostClassifier)
from sklearn.externals.six.moves import xrange
from sklearn.tree import DecisionTreeClassifier# Parameters
n_classes = 3
n_estimators = 30
plot_colors = "ryb"
cmap =
plot_step = 0.02  # fine step width for decision surface contours
plot_step_coarser = 0.5  # step widths for coarse classifier guesses
RANDOM_SEED = 13  # fix the seed on each iteration# Load data
iris = load_iris()plot_idx = 1models = [DecisionTreeClassifier(max_depth=None),RandomForestClassifier(n_estimators=n_estimators),ExtraTreesClassifier(n_estimators=n_estimators),AdaBoostClassifier(DecisionTreeClassifier(max_depth=3),n_estimators=n_estimators)]for pair in ([0, 1], [0, 2], [2, 3]):for model in models:# We only take the two corresponding featuresX =[:, pair]y = Shuffleidx = np.arange(X.shape[0])np.random.seed(RANDOM_SEED)np.random.shuffle(idx)X = X[idx]y = y[idx]# Standardizemean = X.mean(axis=0)std = X.std(axis=0)X = (X - mean) / std# Trainclf = clone(model)clf =, y)scores = clf.score(X, y)# Create a title for each column and the console by using str() and# slicing away useless parts of the stringmodel_title = str(type(model)).split(".")[-1][:-2][:-len("Classifier")]model_details = model_titleif hasattr(model, "estimators_"):model_details += " with {} estimators".format(len(model.estimators_))print( model_details + " with features", pair, "has a score of", scores )plt.subplot(3, 4, plot_idx)if plot_idx <= len(models):# Add a title at the top of each columnplt.title(model_title)# Now plot the decision boundary using a fine mesh as input to a# filled contour plotx_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),np.arange(y_min, y_max, plot_step))# Plot either a single DecisionTreeClassifier or alpha blend the# decision surfaces of the ensemble of classifiersif isinstance(model, DecisionTreeClassifier):Z = model.predict(np.c_[xx.ravel(), yy.ravel()])Z = Z.reshape(xx.shape)cs = plt.contourf(xx, yy, Z, cmap=cmap)else:# Choose alpha blend level with respect to the number of estimators# that are in use (noting that AdaBoost can use fewer estimators# than its maximum if it achieves a good enough fit early on)estimator_alpha = 1.0 / len(model.estimators_)for tree in model.estimators_:Z = tree.predict(np.c_[xx.ravel(), yy.ravel()])Z = Z.reshape(xx.shape)cs = plt.contourf(xx, yy, Z, alpha=estimator_alpha, cmap=cmap)# Build a coarser grid to plot a set of ensemble classifications# to show how these are different to what we see in the decision# surfaces. These points are regularly space and do not have a black outlinexx_coarser, yy_coarser = np.meshgrid(np.arange(x_min, x_max, plot_step_coarser),np.arange(y_min, y_max, plot_step_coarser))Z_points_coarser = model.predict(np.c_[xx_coarser.ravel(), yy_coarser.ravel()]).reshape(xx_coarser.shape)cs_points = plt.scatter(xx_coarser, yy_coarser, s=15, c=Z_points_coarser, cmap=cmap, edgecolors="none")# Plot the training points, these are clustered together and have a# black outlinefor i, c in zip(xrange(n_classes), plot_colors):idx = np.where(y == i)plt.scatter(X[idx, 0], X[idx, 1], c=c, label=iris.target_names[i],cmap=cmap)plot_idx += 1  # move on to the next plot in sequenceplt.suptitle("Classifiers on feature subsets of the Iris dataset")


ExtraTreesClassifier()  # 0.95 score
RandomForestClassifier()  # 0.94 score
AdaBoost(DecisionTree(max_depth=3))  # 0.94 score
DecisionTree(max_depth=None)  # 0.94 score

增加AdaBoost 的最大深度降低了得分的标准差,但是平均得分并没有增加。
1.DecisionTreeClassifier 和 AdaBoostClassifier可以改变max_depth,可以产生尝试max_depth=3(DecisionTreeClassifier)以及depth=None(max_depth=None)
值得注意的是,RandomForests 和 ExtraTrees每棵树可以独立跑,因此可以多核并行。而 AdaBoost是序列型建立,不能使用多核。

主要参数就是n_estimators 和max_features。前者是森林中树的个数,越大越好当然也越费计算时间。此外,当结果超过树的临界数时,结果将不会变得更好。后者是当切割点是要考虑的特征的随机子集的大小。方差越低越好,但也会使得偏差增大。经验上来说对于回归问题max_features=n_features,对于分类问题max_features=sqrt(n_features),其中n_features是数据特征的个数。比较好的结果通常设置成max_depth=None以及min_samples_split=1相结合的方式。注意这些结果通常来说的不是最佳的,可能造成很耗内存。最优的参数总是应该被交叉验证。此外,注意,random forests,和bootstrap默认bootstrap=True,而extra-trees默认设置bootstrap=False。单数用bootstrap时,可以通过oob_score=True计算泛化误差。


print(__doc__)from time import time
import matplotlib.pyplot as pltfrom sklearn.datasets import fetch_olivetti_faces
from sklearn.ensemble import ExtraTreesClassifier# Number of cores to use to perform parallel fitting of the forest model
n_jobs = 1# Load the faces dataset
data = fetch_olivetti_faces()
X = data.images.reshape((len(data.images), -1))
y = data.targetmask = y < 5  # Limit to 5 classes
X = X[mask]
y = y[mask]# Build a forest and compute the pixel importances
print("Fitting ExtraTreesClassifier on faces data with %d cores..." % n_jobs)
t0 = time()
forest = ExtraTreesClassifier(n_estimators=1000,max_features=128,n_jobs=n_jobs,random_state=0), y)
print("done in %0.3fs" % (time() - t0))
importances = forest.feature_importances_
importances = importances.reshape(data.images[0].shape)# Plot pixel importances
plt.title("Pixel importances with forests of trees")

6. Totally Random Trees Embedding
随机树嵌入实施了一个非监督的数据转化。使用完全随机树的一个森林,RandomTreesEmbedding通过索引留出一个数据终止。这个索引然后通过K个方式中的一个编码,得到高维稀疏二进制编码。这种编码可以被非常高效地计算,然后被用来作为其它学习任务的基础。这种编码的大小和稀疏性可以通过选择树的数量和每棵树的深度来影响。对于每一棵集成中的树,这种编码包含每棵树一个号。编码的大小是至多n_estimators * 2 ** max_depth,森林中叶子数的最大值。

example: Hashing feature transformation using Totally Random Trees
在高维空间中,线性分类器通常都会实现杰出的准确率。对于稀疏的二进制数据,BernoulliNB 部分适配很好。底部一行比较了BernoulliNB得到的在转化数据空间中的决定边界以及原始数据中学习到的ExtraTreesClassifier森林。

import numpy as np
import matplotlib.pyplot as pltfrom sklearn.datasets import make_circles
from sklearn.ensemble import RandomTreesEmbedding, ExtraTreesClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.naive_bayes import BernoulliNB# make a synthetic dataset
X, y = make_circles(factor=0.5, random_state=0, noise=0.05)# use RandomTreesEmbedding to transform data
hasher = RandomTreesEmbedding(n_estimators=10, random_state=0, max_depth=3)
X_transformed = hasher.fit_transform(X)# Visualize result using PCA
pca = TruncatedSVD(n_components=2)
X_reduced = pca.fit_transform(X_transformed)# Learn a Naive Bayes classifier on the transformed data
nb = BernoulliNB(), y)# Learn an ExtraTreesClassifier for comparison
trees = ExtraTreesClassifier(max_depth=3, n_estimators=10, random_state=0), y)# scatter plot of original and reduced data
fig = plt.figure(figsize=(9, 8))ax = plt.subplot(221)
ax.scatter(X[:, 0], X[:, 1], c=y, s=50)
ax.set_title("Original Data (2d)")
ax.set_yticks(())ax = plt.subplot(222)
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, s=50)
ax.set_title("PCA reduction (2d) of transformed data (%dd)" %X_transformed.shape[1])
ax.set_yticks(())# Plot the decision in original space. For that, we will assign a color to each
# point in the mesh [x_min, m_max] x [y_min, y_max].
h = .01
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))# transform grid using RandomTreesEmbedding
transformed_grid = hasher.transform(np.c_[xx.ravel(), yy.ravel()])
y_grid_pred = nb.predict_proba(transformed_grid)[:, 1]ax = plt.subplot(223)
ax.set_title("Naive Bayes on Transformed data")
ax.pcolormesh(xx, yy, y_grid_pred.reshape(xx.shape))
ax.scatter(X[:, 0], X[:, 1], c=y, s=50)
ax.set_ylim(-1.4, 1.4)
ax.set_xlim(-1.4, 1.4)
ax.set_yticks(())# transform grid using ExtraTreesClassifier
y_grid_pred = trees.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]ax = plt.subplot(224)
ax.set_title("ExtraTrees predictions")
ax.pcolormesh(xx, yy, y_grid_pred.reshape(xx.shape))
ax.scatter(X[:, 0], X[:, 1], c=y, s=50)
ax.set_ylim(-1.4, 1.4)
ax.set_xlim(-1.4, 1.4)

这篇关于Forests of randomized trees的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!


