【Spark Mllib】逻辑回归——垃圾邮件分类器与maven构建独立项目

本文主要是介绍【Spark Mllib】逻辑回归——垃圾邮件分类器与maven构建独立项目,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

  • http://blog.csdn.net/u011239443/article/details/51655469
  • 使用SGD算法逻辑回归的垃圾邮件分类器
     1 package com.oreilly.learningsparkexamples.scala
     2 
     3 import org.apache.spark.{SparkConf, SparkContext}
     4 import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
     5 import org.apache.spark.mllib.feature.HashingTF
     6 import org.apache.spark.mllib.regression.LabeledPoint
     7 
     8 object MLlib {
     9 
    10   def main(args: Array[String]) {
    11     val conf = new SparkConf().setAppName(s"MLlib example")
    12     val sc = new SparkContext(conf)
    13 
    14     // Load 2 types of emails from text files: spam and ham (non-spam).
    15     // Each line has text from one email.
    16     val spam = sc.textFile("files/spam.txt")
    17     val ham = sc.textFile("files/ham.txt")
    18 
    19     // Create a HashingTF instance to map email text to vectors of 100 features.
    20     val tf = new HashingTF(numFeatures = 100)
    21     // Each email is split into words, and each word is mapped to one feature.
    22     val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
    23     val hamFeatures = ham.map(email => tf.transform(email.split(" ")))
    24 
    25     // Create LabeledPoint datasets for positive (spam) and negative (ham) examples.
    26     val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features))
    27     val negativeExamples = hamFeatures.map(features => LabeledPoint(0, features))
    28     val trainingData = positiveExamples ++ negativeExamples
    29     trainingData.cache() // Cache data since Logistic Regression is an iterative algorithm.
    30 
    31     // Create a Logistic Regression learner which uses the SGD.
    32     val lrLearner = new LogisticRegressionWithSGD()
    33     // Run the actual learning algorithm on the training data.
    34     val model = lrLearner.run(trainingData)
    35 
    36     // Test on a positive example (spam) and a negative one (ham).
    37     // First apply the same HashingTF feature transformation used on the training data.
    38     val posTestExample = tf.transform("O M G GET cheap stuff by sending money to ...".split(" "))
    39     val negTestExample = tf.transform("Hi Dad, I started studying Spark the other ...".split(" "))
    40     // Now use the learned model to predict spam/ham for new emails.
    41     println(s"Prediction for positive test example: ${model.predict(posTestExample)}")
    42     println(s"Prediction for negative test example: ${model.predict(negTestExample)}")
    43 
    44     sc.stop()
    45   }
    46 }

     

     

    spam.txt
    Dear sir, I am a Prince in a far kingdom you have not heard of.  I want to send you money via wire transfer so please ...
    Get Viagra real cheap!  Send money right away to ...
    Oh my gosh you can be really strong too with these drugs found in the rainforest. Get them cheap right now ...
    YOUR COMPUTER HAS BEEN INFECTED!  YOU MUST RESET YOUR PASSWORD.  Reply to this email with your password and SSN ...
    THIS IS NOT A SCAM!  Send money and get access to awesome stuff really cheap and never have to ...

     

    ham.txt
    
    Dear Spark Learner, Thanks so much for attending the Spark Summit 2014!  Check out videos of talks from the summit at ...
    Hi Mom, Apologies for being late about emailing and forgetting to send you the package.  I hope you and bro have been ...
    Wow, hey Fred, just heard about the Spark petabyte sort.  I think we need to take time to try it out immediately ...
    Hi Spark user list, This is my first question to this list, so thanks in advance for your help!  I tried running ...
    Thanks Tom for your email.  I need to refer you to Alice for this one.  I haven't yet figured out that part either ...
    Good job yesterday!  I was attending your talk, and really enjoyed it.  I want to try out GraphX ...
    Summit demo got whoops from audience!  Had to let you know. --Joe

     

    • maven打包scala程序

     

    ├── pom.xml
    ├── README.md
    ├── src
    │   └── main
    │       └── scala
    │           └── com
    │                   └── learningsparkexamples
    │                           └── scala
    │                               └── MLlib.scala

     

    MLlib.scala 就是上面写的scala代码,pom.xml 是 maven 编译时候的 配置 文件:


    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>my.demo</groupId><artifactId>sparkdemo</artifactId><version>1.0-SNAPSHOT</version><properties><!--编译时候 java版本 <maven.compiler.source>1.7</maven.compiler.source> <maven.compiler.target>1.7</maven.compiler.target> --><encoding>UTF-8</encoding><scala.tools.version>2.10</scala.tools.version><!-- Put the Scala version of the cluster --><scala.version>2.10.5</scala.version></properties><dependencies><dependency> <!-- Spark dependency --><groupId>org.apache.spark</groupId><artifactId>spark-core_2.10</artifactId><version>1.6.1</version><scope>provided</scope></dependency><dependency> <!-- Spark dependency --><groupId>org.apache.spark</groupId><artifactId>spark-mllib_2.10</artifactId><version>1.6.1</version><scope>provided</scope></dependency><dependency><groupId>org.scala-lang</groupId><artifactId>scala-library</artifactId><version>2.10.5</version></dependency></dependencies><build><pluginManagement><plugins><plugin><!--用来编译scala的--><groupId>net.alchim31.maven</groupId><artifactId>scala-maven-plugin</artifactId><version>3.1.5</version></plugin></plugins></pluginManagement><plugins><plugin><groupId>net.alchim31.maven</groupId><artifactId>scala-maven-plugin</artifactId><executions><execution><id>scala-compile-first</id><phase>process-resources</phase><goals><goal>add-source</goal><goal>compile</goal></goals></execution><execution><id>scala-test-compile</id><phase>process-test-resources</phase><goals><goal>testCompile</goal></goals></execution></executions></plugin></plugins></build>
    </project>





  • 其中:
    
    import org.apache.spark.{SparkConf, SparkContext}
     

    所需要的依赖包配置是:

      <dependency> <!-- Spark dependency --><groupId>org.apache.spark</groupId><artifactId>spark-core_2.10</artifactId><version>1.6.1</version><scope>provided</scope></dependency>


     

     

     

     

    import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
    import org.apache.spark.mllib.feature.HashingTF
    import org.apache.spark.mllib.regression.LabeledPoint

    所需要的依赖包配置是:

     <dependency> <!-- Spark dependency --><groupId>org.apache.spark</groupId><artifactId>spark-mllib_2.10</artifactId><version>1.6.1</version><scope>provided</scope></dependency>


     

     

     

    配置的时候要注意spark 和 scala 的版本,可以打开spark-shell 观察:

     

     

    配置完成后,在pom.xml 所在的目录运行命令:

    mvn clean && mvn compile && mvn package

     

    如果mvn 下载 有问题,可以参考这篇博文:http://www.cnblogs.com/xiaoyesoso/p/5489822.html 的 3. Bulid GitHub Spark Runnable Distribution

     

    • spark运行项目

    mvn编译打包完成后会pom.xml所在目录下出现一个target文件夹:

    ├── target
    │   ├── classes
    │   │   └── com
    │   │       └── oreilly
    │   │           └── learningsparkexamples
    │   │               └── scala
    │   │                   ├── MLlib$$anonfun$1.class
    │   │                   ├── MLlib$$anonfun$2.class
    │   │                   ├── MLlib$$anonfun$3.class
    │   │                   ├── MLlib$$anonfun$4.class
    │   │                   ├── MLlib.class
    │   │                   └── MLlib$.class
    │   ├── classes.-475058802.timestamp
    │   ├── maven-archiver
    │   │   └── pom.properties
    │   ├── maven-status
    │   │   └── maven-compiler-plugin
    │   │       └── compile
    │   │           └── default-compile
    │   │               ├── createdFiles.lst
    │   │               └── inputFiles.lst
    │   └── sparkdemo-1.0-SNAPSHOT.jar

     

     

    最后 运行命令,提交执行任务(注意两个test文件所对应的位置):

     

    ${SPARK_HOME}/bin/spark-submit --class ${package.name}.${class.name} ${PROJECT_HOME}/target/*.jar

     

    运行结果:

    caizhenwei@caizhenwei-Inspiron-3847:~/桌面/learning-spark$ vim mini-complete-example/src/main/scala/com/oreilly/learningsparkexamples/mini/scala/MLlib.scala caizhenwei@caizhenwei-Inspiron-3847:~/桌面/learning-spark$ ../bin-spark-1.6.1/bin/spark-submit --class com.oreilly.learningsparkexamples.scala.MLlib ./mini-complete-example/target/sparkdemo-1.0-SNAPSHOT.jar 
    16/06/03 13:23:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    16/06/03 13:23:23 WARN Utils: Your hostname, caizhenwei-Inspiron-3847 resolves to a loopback address: 127.0.1.1; using 172.16.111.93 instead (on interface eth0)
    16/06/03 13:23:23 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
    16/06/03 13:23:24 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
    16/06/03 13:23:26 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
    16/06/03 13:23:26 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
    Prediction for positive test example: 1.0
    Prediction for negative test example: 0.0

     

这篇关于【Spark Mllib】逻辑回归——垃圾邮件分类器与maven构建独立项目的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1023086

相关文章

一文教你如何将maven项目转成web项目

《一文教你如何将maven项目转成web项目》在软件开发过程中,有时我们需要将一个普通的Maven项目转换为Web项目,以便能够部署到Web容器中运行,本文将详细介绍如何通过简单的步骤完成这一转换过程... 目录准备工作步骤一:修改​​pom.XML​​1.1 添加​​packaging​​标签1.2 添加

一文详解如何从零构建Spring Boot Starter并实现整合

《一文详解如何从零构建SpringBootStarter并实现整合》SpringBoot是一个开源的Java基础框架,用于创建独立、生产级的基于Spring框架的应用程序,:本文主要介绍如何从... 目录一、Spring Boot Starter的核心价值二、Starter项目创建全流程2.1 项目初始化(

tomcat多实例部署的项目实践

《tomcat多实例部署的项目实践》Tomcat多实例是指在一台设备上运行多个Tomcat服务,这些Tomcat相互独立,本文主要介绍了tomcat多实例部署的项目实践,具有一定的参考价值,感兴趣的可... 目录1.创建项目目录,测试文China编程件2js.创建实例的安装目录3.准备实例的配置文件4.编辑实例的

使用Java实现通用树形结构构建工具类

《使用Java实现通用树形结构构建工具类》这篇文章主要为大家详细介绍了如何使用Java实现通用树形结构构建工具类,文中的示例代码讲解详细,感兴趣的小伙伴可以跟随小编一起学习一下... 目录完整代码一、设计思想与核心功能二、核心实现原理1. 数据结构准备阶段2. 循环依赖检测算法3. 树形结构构建4. 搜索子

springboot集成Deepseek4j的项目实践

《springboot集成Deepseek4j的项目实践》本文主要介绍了springboot集成Deepseek4j的项目实践,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价... 目录Deepseek4j快速开始Maven 依js赖基础配置基础使用示例1. 流式返回示例2. 进阶

SpringBoot项目启动报错"找不到或无法加载主类"的解决方法

《SpringBoot项目启动报错找不到或无法加载主类的解决方法》在使用IntelliJIDEA开发基于SpringBoot框架的Java程序时,可能会出现找不到或无法加载主类com.example.... 目录一、问题描述二、排查过程三、解决方案一、问题描述在使用 IntelliJ IDEA 开发基于

使用Python和python-pptx构建Markdown到PowerPoint转换器

《使用Python和python-pptx构建Markdown到PowerPoint转换器》在这篇博客中,我们将深入分析一个使用Python开发的应用程序,该程序可以将Markdown文件转换为Pow... 目录引言应用概述代码结构与分析1. 类定义与初始化2. 事件处理3. Markdown 处理4. 转

SpringBoot项目使用MDC给日志增加唯一标识的实现步骤

《SpringBoot项目使用MDC给日志增加唯一标识的实现步骤》本文介绍了如何在SpringBoot项目中使用MDC(MappedDiagnosticContext)为日志增加唯一标识,以便于日... 目录【Java】SpringBoot项目使用MDC给日志增加唯一标识,方便日志追踪1.日志效果2.实现步

最新Spring Security实战教程之表单登录定制到处理逻辑的深度改造(最新推荐)

《最新SpringSecurity实战教程之表单登录定制到处理逻辑的深度改造(最新推荐)》本章节介绍了如何通过SpringSecurity实现从配置自定义登录页面、表单登录处理逻辑的配置,并简单模拟... 目录前言改造准备开始登录页改造自定义用户名密码登陆成功失败跳转问题自定义登出前后端分离适配方案结语前言

Ubuntu中Nginx虚拟主机设置的项目实践

《Ubuntu中Nginx虚拟主机设置的项目实践》通过配置虚拟主机,可以在同一台服务器上运行多个独立的网站,本文主要介绍了Ubuntu中Nginx虚拟主机设置的项目实践,具有一定的参考价值,感兴趣的可... 目录简介安装 Nginx创建虚拟主机1. 创建网站目录2. 创建默认索引文件3. 配置 Nginx4