Spark ML机器学习库评估指标示例

2024-05-28 02:58

本文主要是介绍Spark ML机器学习库评估指标示例,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

本文主要对 Spark ML库下模型评估指标的讲解,以下代码均以Jupyter Notebook进行讲解,Spark版本为2.4.5。模型评估指标位于包org.apache.spark.ml.evaluation下。

模型评估指标是指测试集的评估指标,而不是训练集的评估指标

1、回归评估指标

RegressionEvaluator

Evaluator for regression, which expects two input columns: prediction and label.

评估指标支持以下几种:

val metricName: Param[String]

  • "rmse" (default): root mean squared error
  • "mse": mean squared error
  • "r2": R2 metric
  • "mae": mean absolute error

Examples

# import dependencies
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.evaluation.RegressionEvaluator// Load training data
val data = spark.read.format("libsvm").load("/data1/software/spark/data/mllib/sample_linear_regression_data.txt")val lr = new LinearRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)// Fit the model
val lrModel = lr.fit(training)// Summarize the model over the training set and print out some metrics
val trainingSummary = lrModel.summary
println(s"Train MSE: ${trainingSummary.meanSquaredError}")
println(s"Train RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"Train MAE: ${trainingSummary.meanAbsoluteError}")
println(s"Train r2: ${trainingSummary.r2}")val predictions = lrModel.transform(test)// 计算精度
val evaluator = new RegressionEvaluator().setLabelCol("label").setPredictionCol("prediction").setMetricName("mse")
val accuracy = evaluator.evaluate(predictions)
print(s"Test MSE: ${accuracy}")

输出:

Train MSE: 101.57870147367461
Train RMSE: 10.078625971513905
Train MAE: 8.108865602095849
Train r2: 0.039467152584195975Test MSE: 114.28454406581636

2、分类评估指标

2.1 BinaryClassificationEvaluator

Evaluator for binary classification, which expects two input columns: rawPrediction and label. The rawPrediction column can be of type double (binary 0/1 prediction, or probability of label 1) or of type vector (length-2 vector of raw predictions, scores, or label probabilities).

评估指标支持以下几种:

val metricName: Param[String]
param for metric name in evaluation (supports "areaUnderROC" (default), "areaUnderPR")

Examples

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator// Load training data
val data = spark.read.format("libsvm").load("/data1/software/spark/data/mllib/sample_libsvm_data.txt")val Array(train, test) = data.randomSplit(Array(0.8, 0.2))val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)// Fit the model
val lrModel = lr.fit(train)// Summarize the model over the training set and print out some metrics
val trainSummary = lrModel.summary
println(s"Train accuracy: ${trainSummary.accuracy}")
println(s"Train weightedPrecision: ${trainSummary.weightedPrecision}")
println(s"Train weightedRecall: ${trainSummary.weightedRecall}")
println(s"Train weightedFMeasure: ${trainSummary.weightedFMeasure}")val predictions = lrModel.transform(test)
predictions.show(5)// 模型评估
val evaluator = new BinaryClassificationEvaluator().setLabelCol("label").setRawPredictionCol("rawPrediction").setMetricName("areaUnderROC")
val auc = evaluator.evaluate(predictions)
print(s"Test AUC: ${auc}")val mulEvaluator = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction").setMetricName("weightedPrecision")
val precision = evaluator.evaluate(predictions)
print(s"Test weightedPrecision: ${precision}")

输出结果:

Train accuracy: 0.9873417721518988
Train weightedPrecision: 0.9876110961486668
Train weightedRecall: 0.9873417721518987
Train weightedFMeasure: 0.9873124561568825+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(692,[122,123,148...|[0.29746771419036...|[0.57382336211209...|       0.0|
|  0.0|(692,[125,126,127...|[0.42262389447949...|[0.60411095396791...|       0.0|
|  0.0|(692,[126,127,128...|[0.74220898710237...|[0.67747871191347...|       0.0|
|  0.0|(692,[126,127,128...|[0.77729372618481...|[0.68509655708828...|       0.0|
|  0.0|(692,[127,128,129...|[0.70928896866149...|[0.67024402884354...|       0.0|
+-----+--------------------+--------------------+--------------------+----------+Test AUC: 1.0Test weightedPrecision: 1.0

2.2 MulticlassClassificationEvaluator

Evaluator for multiclass classification, which expects two input columns: prediction and label.

注:既然适用于多分类,当然适用于上面的二分类

评估指标支持如下几种:

val metricName: Param[String]
param for metric name in evaluation (supports "f1" (default), "weightedPrecision", "weightedRecall", "accuracy")

Examples

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}// Load the data stored in LIBSVM format as a DataFrame.
val data = spark.read.format("libsvm").load("/data1/software/spark/data/mllib/sample_libsvm_data.txt")// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(data)
// Automatically identify categorical features, and index them.
val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(4) // features with > 4 distinct values are treated as continuous..fit(data)// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))// Train a DecisionTree model.
val dt = new DecisionTreeClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures")// Convert indexed labels back to original labels.
val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)// Chain indexers and tree in a Pipeline.
val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)// Make predictions.
val predictions = model.transform(testData)// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction").setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${(1.0 - accuracy)}")

输出结果:

+--------------+-----+--------------------+
|predictedLabel|label|            features|
+--------------+-----+--------------------+
|           0.0|  0.0|(692,[95,96,97,12...|
|           0.0|  0.0|(692,[122,123,124...|
|           0.0|  0.0|(692,[122,123,148...|
|           0.0|  0.0|(692,[126,127,128...|
|           0.0|  0.0|(692,[126,127,128...|
+--------------+-----+--------------------+
only showing top 5 rowsTest Error = 0.040000000000000036

欢迎关注微信公众号

这篇关于Spark ML机器学习库评估指标示例的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1009351

相关文章

python实现pdf转word和excel的示例代码

《python实现pdf转word和excel的示例代码》本文主要介绍了python实现pdf转word和excel的示例代码,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价... 目录一、引言二、python编程1,PDF转Word2,PDF转Excel三、前端页面效果展示总结一

在MyBatis的XML映射文件中<trim>元素所有场景下的完整使用示例代码

《在MyBatis的XML映射文件中<trim>元素所有场景下的完整使用示例代码》在MyBatis的XML映射文件中,trim元素用于动态添加SQL语句的一部分,处理前缀、后缀及多余的逗号或连接符,示... 在MyBATis的XML映射文件中,<trim>元素用于动态地添加SQL语句的一部分,例如SET或W

Redis延迟队列的实现示例

《Redis延迟队列的实现示例》Redis延迟队列是一种使用Redis实现的消息队列,本文主要介绍了Redis延迟队列的实现示例,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习... 目录一、什么是 Redis 延迟队列二、实现原理三、Java 代码示例四、注意事项五、使用 Redi

在Pandas中进行数据重命名的方法示例

《在Pandas中进行数据重命名的方法示例》Pandas作为Python中最流行的数据处理库,提供了强大的数据操作功能,其中数据重命名是常见且基础的操作之一,本文将通过简洁明了的讲解和丰富的代码示例,... 目录一、引言二、Pandas rename方法简介三、列名重命名3.1 使用字典进行列名重命名3.编

Python使用Colorama库美化终端输出的操作示例

《Python使用Colorama库美化终端输出的操作示例》在开发命令行工具或调试程序时,我们可能会希望通过颜色来区分重要信息,比如警告、错误、提示等,而Colorama是一个简单易用的Python库... 目录python Colorama 库详解:终端输出美化的神器1. Colorama 是什么?2.

Go Gorm 示例详解

《GoGorm示例详解》Gorm是一款高性能的GolangORM库,便于开发人员提高效率,本文介绍了Gorm的基本概念、数据库连接、基本操作(创建表、新增记录、查询记录、修改记录、删除记录)等,本... 目录1. 概念2. 数据库连接2.1 安装依赖2.2 连接数据库3. 数据库基本操作3.1 创建表(表关

Python视频剪辑合并操作的实现示例

《Python视频剪辑合并操作的实现示例》很多人在创作视频时都需要进行剪辑,本文主要介绍了Python视频剪辑合并操作的实现示例,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习... 目录介绍安装FFmpegWindowsMACOS安装MoviePy剪切视频合并视频转换视频结论介绍

python多进程实现数据共享的示例代码

《python多进程实现数据共享的示例代码》本文介绍了Python中多进程实现数据共享的方法,包括使用multiprocessing模块和manager模块这两种方法,具有一定的参考价值,感兴趣的可以... 目录背景进程、进程创建进程间通信 进程间共享数据共享list实践背景 安卓ui自动化框架,使用的是

SpringBoot基于MyBatis-Plus实现Lambda Query查询的示例代码

《SpringBoot基于MyBatis-Plus实现LambdaQuery查询的示例代码》MyBatis-Plus是MyBatis的增强工具,简化了数据库操作,并提高了开发效率,它提供了多种查询方... 目录引言基础环境配置依赖配置(Maven)application.yml 配置表结构设计demo_st

SpringCloud集成AlloyDB的示例代码

《SpringCloud集成AlloyDB的示例代码》AlloyDB是GoogleCloud提供的一种高度可扩展、强性能的关系型数据库服务,它兼容PostgreSQL,并提供了更快的查询性能... 目录1.AlloyDBjavascript是什么?AlloyDB 的工作原理2.搭建测试环境3.代码工程1.