完全Layman语言的随机森林

本文主要是介绍完全Layman语言的随机森林，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

First, this picture might come to your mind when you heard the words “Random Forest”. If it happened for you, you just thought like me. Nothing wrong in it, because the random forest model also works the same as a forest in one perspective. Usually, an ensemble of trees are considered as forest, same like that, an ensemble of decision trees are considered as Random Forest here.

首先，当您听到“ Random Forest”一词时，您可能会想到这张照片。如果发生在您身上，您只是想像我。这没什么错，因为从一个角度看，随机森林模型也与森林一样工作。通常，一棵树被认为是森林，同样，决策树也被认为是随机森林。

In below, I have given a bullet fast intro about Random Forest in the form of points for reader’s satisfaction.

在下面，我以分数的形式快速介绍了随机森林，以使读者满意。

Random Forest is one of the main ensemble techniques.
随机森林是主要的合奏技术之一。
It is one of the many supervised learning algorithms.
它是许多监督学习算法之一。
We can use this technique for both regression and classification problems.
我们可以将这种技术用于回归和分类问题。

It is advisable to have knowledge about decision tree and ensemble techniques before learning Random Forest. I want this article to be just focused on Random Forest.

建议在学习随机森林之前，先了解一下决策树和集成技术。我希望本文只关注随机森林。

Things you will learn through this article below,

您将从下面的这篇文章中学到的东西，

a) What is the concept behind Bootstrapping and Aggregation techniques…?

a)引导和聚合技术背后的概念是什么？

b) How Random Forest actually works…?

b)随机森林实际上是如何工作的？

c) How to build Random Forest with Python scikit-learn library…?

c)如何使用Python scikit-learn库构建随机森林？

引导和聚合 (Bootstrapping and Aggregation)

Random Forest algorithm mainly follows the techniques of Bootstrapping and Aggregation. Let’s understand these techniques in a fun way with a small story.

随机森林算法主要遵循自举和聚合技术。让我们通过一个小故事以有趣的方式了解这些技术。

Let’s imagine you want to find answers for two math questions that you don’t know the actual formula to solve them. One is ‘yes’ or ‘no’ type based question and the other one answer will be in continuous number. On that time, your five friends(all are pretty good at math) come to your place and ask, (and the conversation goes on like this)

假设您想找到两个数学问题的答案，而您不知道要解决这些问题的实际公式。一个是基于“是”或“否”类型的问题，另一个是连续数字。那时，您的五个朋友(都很擅长数学)来到您的地方并问，(对话如此进行)

Hey, can we go somewhere outside together…?

嘿，我们可以一起去外面的某个地方吗...？

You: No, first I have to solve these math problems.

您：不，首先我必须解决这些数学问题。

Oh, can you show me the questions to all of us…?

哦，您能告诉我我们所有人的问题吗？

Instead of showing all the questions to your friends, you just splitting the questions randomly into five splits, one split has some information about the question, that may or may not be present in another split. In this way, you just want to challenge their ability.

无需将所有问题显示给您的朋友，您只需将问题随机分为五个部分，一个部分包含有关该问题的一些信息，而该信息可能会或可能不会出现在另一部分中。这样，您只想挑战他们的能力。

Your five friends started solving these questions each in their own way. What you are going to get is five answers to both questions. Means, you get five different answers in continuous numbers for the first one and in ‘yes’ or ‘no’ for the other one.

您的五个朋友开始以自己的方式解决这些问题。您将获得两个问题的五个答案。意思是，对于第一个答案，您会得到五个不同的连续答案，对于另一个答案是“是”或“否”。

After getting these answers, you need to come to the conclusion with one answer for each question. Now the task is yours’, here what you do is, you just get the mean of the five numerical numbers and fix the first question and you get the mode of the other categorical values and stick it under the second one.

得到这些答案后，您需要对每个问题给出一个答案来得出结论。现在，任务就是您的了，在这里您要做的是，您只需获取五个数字的均值并修正第一个问题，然后获取其他分类值的模式并将其置于第二个下。

Hurray!

欢呼！

You have just done bootstrap and aggregation. Yes, I just need to explain to you what is what. First, you split the questions in some random way with replacement is the bootstrap technique. Next, you aggregated all the five answers to make one is the aggregation technique.

您刚刚完成了引导和聚合。是的，我只需要向您解释什么是什么。首先，您使用随机引导方法以随机方式拆分问题。接下来，您汇总了所有五个答案，从而得出了一种汇总技术。

Bootstrapping is simply a statistical resampling technique that involves random sampling of data from the dataset with replacement(which means you can pick the same data many times).

自举是一种简单的统计重采样技术，涉及从数据集中随机抽取数据进行替换(这意味着您可以多次选择相同的数据)。

Aggregation is aggregating the final results from the decision tree. Whether it can be an average of the results or most voted one.

聚合正在聚合决策树的最终结果。可以是结果的平均值，也可以是票数最高的结果。

随机森林算法 (Random Forest Algorithm)

It is one of the powerful machine learning algorithms. See, coming to the conclusion about the incident which is happened not in your presence from the words of many persons is always better than the words of a single person. That’s what we are doing with the Random Forest algorithm.

它是功能强大的机器学习算法之一。从许多人的言语中得出关于不在您面前发生的事件的结论总是比单人的言语更好。这就是我们使用随机森林算法所做的事情。

Let’s see how it actually works for a better understanding. Consider, you have the dataset that consists of 1000 records(the rows) and 100 features(the columns). We are picking some ’n’ number of records and ‘m’ number of features from the dataset and give to a decision tree. We again pick another subset randomly and give it to another decision tree. We follow this procedure until we fill all the decision trees with subsets. The number of decision trees one wants can be specified by the practitioner when building the model. Each Decision tree process the having data and give us a different output.

让我们看看它是如何工作的，以便更好地理解。考虑一下，您具有由1000个记录(行)和100个要素(列)组成的数据集。我们从数据集中选择一些“ n”个记录和“ m”个特征，并给出决策树。我们再次随机选择另一个子集，并将其分配给另一个决策树。我们遵循此过程，直到用子集填充所有决策树。想要的决策树的数量可以由从业者在构建模型时指定。每个决策树都会处理拥有的数据，并为我们提供不同的输出。

Finally, it will do the aggregation process to finalize the result. An ideal model should be with low bias and low variance. A single decision tree results maybe with low bias and high variance. To mitigate this high variance problem in the model, Random Forest lending us helping hands.

最后，它将执行聚合过程以最终确定结果。理想的模型应具有低偏差和低方差。单个决策树的结果可能具有低偏差和高方差。为了减轻模型中的高方差问题，Random Forest向我们伸出了援助之手。

If the model is with high bias, it leads to the underfitting.

如果模型具有高偏差，则会导致拟合不足。

If it is with high variance, it leads to overfitting.

如果方差很大，则会导致过度拟合。

Underfitting occurs when the model cannot capture data trends. Underfitting happens in classes when the students not covering the given syllabus and perform poorly to the questions come from inside the syllabus as well as outside the syllabus in the exam.

当模型无法捕获数据趋势时，就会发生拟合不足 。当学生没有覆盖给定的课程大纲并且对考试中的课程大纲内部和外部的问题表现不佳时，就会在课堂上出现不适当的情况。

Overfitting occurs when the model fits the training data too well. Overfitting happens in classes when the students study the syllabus very well and writing in exam what they read in the given syllabus to the questions comes from outside the syllabus too.

过度拟合时模型拟合训练数据太清楚发生。当学生非常好地学习课程大纲并在考试中写下他们在给定的课程大纲中所读到的问题也来自课程大纲之外时，就会在课堂上出现过度拟合现象。

Image for post — Pictorial Representation of Random Forest

Hope this above picture helps you to come to a better understanding about Random Forest.

希望以上图片能帮助您更好地了解随机森林。

使用Python Scikit-Learn实现随机森林 (Implementation of Random Forest using Python Scikit-Learn)

As I said before, it can be used for both classification and regression. There are two classes in the sklearn.ensemble library related to Random Forest. Import Random Forest class using the below code for different problems.

正如我之前所说，它既可以用于分类又可以用于回归。 sklearn.ensemble库中有两个与Random Forest相关的类。针对不同的问题，使用以下代码导入Random Forest类。

For classification problems,

对于分类问题，

from sklearn.ensemble import RandomForestClassifier

For Regression problems,

对于回归问题，

from sklearn.ensemble import RandomForestRegressor

Let’s create an object for the class RandomForestClassifier,

让我们为RandomForestClassifier类创建一个对象，

clsf = RandomForestClassifier()

We can specify the hyperparameters inside the class like this,

我们可以像这样在类内部指定超参数，

clsf = RandomForestClassifier(n_estimators = 100)

Here, n_estimators is the hyperparameter which is actually to specify the number of decision tree we want for the model. There are a lot of hyperparameters that a practitioner can specify when building the model(that’s why it is hyperparameters). To know more about the Random Forest hyperparameters, take a look at the below documentation.

在这里， n_estimators是超参数，实际上是用于指定我们要用于模型的决策树的数量。从业人员可以在构建模型时指定很多超参数( 这就是为什么它是超参数 )。要了解有关随机森林超参数的更多信息，请查看以下文档。

RandomForestClassifier documentation.
RandomForestClassifier文档。
RandomForestRegressor documentation.
RandomForestRegressor文档。

Train the model using the following code,

使用以下代码训练模型，

clsf.fit(x_train,y_train)

The whole training process is in one line code. The Scikit-learn library reduced our work to a greater extent.

整个培训过程都在一行代码中。 Scikit学习库在很大程度上减少了我们的工作。

Let’s test the trained model using the following code,

让我们使用以下代码测试训练好的模型，

Prediction_result = clsf.predict(x_test)

Hurray! We finally predicted something with the Random Forest algorithm.

欢呼！我们终于用随机森林算法预测了一些东西。

Personally, Random Forest is my favorite algorithm. It gave me better results in many projects when compared to the other algorithms. But not in all the cases. Hyperparameter tuning will help you get better accuracy always.

就个人而言，随机森林是我最喜欢的算法。与其他算法相比，它在许多项目中给了我更好的结果。但并非在所有情况下都如此。超参数调整将帮助您始终获得更好的精度。

Happy Coding!

编码愉快！

Happy Learning!!

学习愉快！