【Spark Mllib】逻辑回归——垃圾邮件分类器与maven构建独立项目

本文主要是介绍【Spark Mllib】逻辑回归——垃圾邮件分类器与maven构建独立项目,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

  • http://blog.csdn.net/u011239443/article/details/51655469
  • 使用SGD算法逻辑回归的垃圾邮件分类器
     1 package com.oreilly.learningsparkexamples.scala
     2 
     3 import org.apache.spark.{SparkConf, SparkContext}
     4 import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
     5 import org.apache.spark.mllib.feature.HashingTF
     6 import org.apache.spark.mllib.regression.LabeledPoint
     7 
     8 object MLlib {
     9 
    10   def main(args: Array[String]) {
    11     val conf = new SparkConf().setAppName(s"MLlib example")
    12     val sc = new SparkContext(conf)
    13 
    14     // Load 2 types of emails from text files: spam and ham (non-spam).
    15     // Each line has text from one email.
    16     val spam = sc.textFile("files/spam.txt")
    17     val ham = sc.textFile("files/ham.txt")
    18 
    19     // Create a HashingTF instance to map email text to vectors of 100 features.
    20     val tf = new HashingTF(numFeatures = 100)
    21     // Each email is split into words, and each word is mapped to one feature.
    22     val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
    23     val hamFeatures = ham.map(email => tf.transform(email.split(" ")))
    24 
    25     // Create LabeledPoint datasets for positive (spam) and negative (ham) examples.
    26     val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features))
    27     val negativeExamples = hamFeatures.map(features => LabeledPoint(0, features))
    28     val trainingData = positiveExamples ++ negativeExamples
    29     trainingData.cache() // Cache data since Logistic Regression is an iterative algorithm.
    30 
    31     // Create a Logistic Regression learner which uses the SGD.
    32     val lrLearner = new LogisticRegressionWithSGD()
    33     // Run the actual learning algorithm on the training data.
    34     val model = lrLearner.run(trainingData)
    35 
    36     // Test on a positive example (spam) and a negative one (ham).
    37     // First apply the same HashingTF feature transformation used on the training data.
    38     val posTestExample = tf.transform("O M G GET cheap stuff by sending money to ...".split(" "))
    39     val negTestExample = tf.transform("Hi Dad, I started studying Spark the other ...".split(" "))
    40     // Now use the learned model to predict spam/ham for new emails.
    41     println(s"Prediction for positive test example: ${model.predict(posTestExample)}")
    42     println(s"Prediction for negative test example: ${model.predict(negTestExample)}")
    43 
    44     sc.stop()
    45   }
    46 }

     

     

    spam.txt
    Dear sir, I am a Prince in a far kingdom you have not heard of.  I want to send you money via wire transfer so please ...
    Get Viagra real cheap!  Send money right away to ...
    Oh my gosh you can be really strong too with these drugs found in the rainforest. Get them cheap right now ...
    YOUR COMPUTER HAS BEEN INFECTED!  YOU MUST RESET YOUR PASSWORD.  Reply to this email with your password and SSN ...
    THIS IS NOT A SCAM!  Send money and get access to awesome stuff really cheap and never have to ...

     

    ham.txt
    
    Dear Spark Learner, Thanks so much for attending the Spark Summit 2014!  Check out videos of talks from the summit at ...
    Hi Mom, Apologies for being late about emailing and forgetting to send you the package.  I hope you and bro have been ...
    Wow, hey Fred, just heard about the Spark petabyte sort.  I think we need to take time to try it out immediately ...
    Hi Spark user list, This is my first question to this list, so thanks in advance for your help!  I tried running ...
    Thanks Tom for your email.  I need to refer you to Alice for this one.  I haven't yet figured out that part either ...
    Good job yesterday!  I was attending your talk, and really enjoyed it.  I want to try out GraphX ...
    Summit demo got whoops from audience!  Had to let you know. --Joe

     

    • maven打包scala程序

     

    ├── pom.xml
    ├── README.md
    ├── src
    │   └── main
    │       └── scala
    │           └── com
    │                   └── learningsparkexamples
    │                           └── scala
    │                               └── MLlib.scala

     

    MLlib.scala 就是上面写的scala代码,pom.xml 是 maven 编译时候的 配置 文件:


    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>my.demo</groupId><artifactId>sparkdemo</artifactId><version>1.0-SNAPSHOT</version><properties><!--编译时候 java版本 <maven.compiler.source>1.7</maven.compiler.source> <maven.compiler.target>1.7</maven.compiler.target> --><encoding>UTF-8</encoding><scala.tools.version>2.10</scala.tools.version><!-- Put the Scala version of the cluster --><scala.version>2.10.5</scala.version></properties><dependencies><dependency> <!-- Spark dependency --><groupId>org.apache.spark</groupId><artifactId>spark-core_2.10</artifactId><version>1.6.1</version><scope>provided</scope></dependency><dependency> <!-- Spark dependency --><groupId>org.apache.spark</groupId><artifactId>spark-mllib_2.10</artifactId><version>1.6.1</version><scope>provided</scope></dependency><dependency><groupId>org.scala-lang</groupId><artifactId>scala-library</artifactId><version>2.10.5</version></dependency></dependencies><build><pluginManagement><plugins><plugin><!--用来编译scala的--><groupId>net.alchim31.maven</groupId><artifactId>scala-maven-plugin</artifactId><version>3.1.5</version></plugin></plugins></pluginManagement><plugins><plugin><groupId>net.alchim31.maven</groupId><artifactId>scala-maven-plugin</artifactId><executions><execution><id>scala-compile-first</id><phase>process-resources</phase><goals><goal>add-source</goal><goal>compile</goal></goals></execution><execution><id>scala-test-compile</id><phase>process-test-resources</phase><goals><goal>testCompile</goal></goals></execution></executions></plugin></plugins></build>
    </project>





  • 其中:
    
    import org.apache.spark.{SparkConf, SparkContext}
     

    所需要的依赖包配置是:

      <dependency> <!-- Spark dependency --><groupId>org.apache.spark</groupId><artifactId>spark-core_2.10</artifactId><version>1.6.1</version><scope>provided</scope></dependency>


     

     

     

     

    import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
    import org.apache.spark.mllib.feature.HashingTF
    import org.apache.spark.mllib.regression.LabeledPoint

    所需要的依赖包配置是:

     <dependency> <!-- Spark dependency --><groupId>org.apache.spark</groupId><artifactId>spark-mllib_2.10</artifactId><version>1.6.1</version><scope>provided</scope></dependency>


     

     

     

    配置的时候要注意spark 和 scala 的版本,可以打开spark-shell 观察:

     

     

    配置完成后,在pom.xml 所在的目录运行命令:

    mvn clean && mvn compile && mvn package

     

    如果mvn 下载 有问题,可以参考这篇博文:http://www.cnblogs.com/xiaoyesoso/p/5489822.html 的 3. Bulid GitHub Spark Runnable Distribution

     

    • spark运行项目

    mvn编译打包完成后会pom.xml所在目录下出现一个target文件夹:

    ├── target
    │   ├── classes
    │   │   └── com
    │   │       └── oreilly
    │   │           └── learningsparkexamples
    │   │               └── scala
    │   │                   ├── MLlib$$anonfun$1.class
    │   │                   ├── MLlib$$anonfun$2.class
    │   │                   ├── MLlib$$anonfun$3.class
    │   │                   ├── MLlib$$anonfun$4.class
    │   │                   ├── MLlib.class
    │   │                   └── MLlib$.class
    │   ├── classes.-475058802.timestamp
    │   ├── maven-archiver
    │   │   └── pom.properties
    │   ├── maven-status
    │   │   └── maven-compiler-plugin
    │   │       └── compile
    │   │           └── default-compile
    │   │               ├── createdFiles.lst
    │   │               └── inputFiles.lst
    │   └── sparkdemo-1.0-SNAPSHOT.jar

     

     

    最后 运行命令,提交执行任务(注意两个test文件所对应的位置):

     

    ${SPARK_HOME}/bin/spark-submit --class ${package.name}.${class.name} ${PROJECT_HOME}/target/*.jar

     

    运行结果:

    caizhenwei@caizhenwei-Inspiron-3847:~/桌面/learning-spark$ vim mini-complete-example/src/main/scala/com/oreilly/learningsparkexamples/mini/scala/MLlib.scala caizhenwei@caizhenwei-Inspiron-3847:~/桌面/learning-spark$ ../bin-spark-1.6.1/bin/spark-submit --class com.oreilly.learningsparkexamples.scala.MLlib ./mini-complete-example/target/sparkdemo-1.0-SNAPSHOT.jar 
    16/06/03 13:23:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    16/06/03 13:23:23 WARN Utils: Your hostname, caizhenwei-Inspiron-3847 resolves to a loopback address: 127.0.1.1; using 172.16.111.93 instead (on interface eth0)
    16/06/03 13:23:23 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
    16/06/03 13:23:24 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
    16/06/03 13:23:26 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
    16/06/03 13:23:26 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
    Prediction for positive test example: 1.0
    Prediction for negative test example: 0.0

     

这篇关于【Spark Mllib】逻辑回归——垃圾邮件分类器与maven构建独立项目的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1023086

相关文章

这15个Vue指令,让你的项目开发爽到爆

1. V-Hotkey 仓库地址: github.com/Dafrok/v-ho… Demo: 戳这里 https://dafrok.github.io/v-hotkey 安装: npm install --save v-hotkey 这个指令可以给组件绑定一个或多个快捷键。你想要通过按下 Escape 键后隐藏某个组件,按住 Control 和回车键再显示它吗?小菜一碟: <template

每天认识几个maven依赖(ActiveMQ+activemq-jaxb+activesoap+activespace+adarwin)

八、ActiveMQ 1、是什么? ActiveMQ 是一个开源的消息中间件(Message Broker),由 Apache 软件基金会开发和维护。它实现了 Java 消息服务(Java Message Service, JMS)规范,并支持多种消息传递协议,包括 AMQP、MQTT 和 OpenWire 等。 2、有什么用? 可靠性:ActiveMQ 提供了消息持久性和事务支持,确保消

如何用Docker运行Django项目

本章教程,介绍如何用Docker创建一个Django,并运行能够访问。 一、拉取镜像 这里我们使用python3.11版本的docker镜像 docker pull python:3.11 二、运行容器 这里我们将容器内部的8080端口,映射到宿主机的80端口上。 docker run -itd --name python311 -p

嵌入式QT开发:构建高效智能的嵌入式系统

摘要: 本文深入探讨了嵌入式 QT 相关的各个方面。从 QT 框架的基础架构和核心概念出发,详细阐述了其在嵌入式环境中的优势与特点。文中分析了嵌入式 QT 的开发环境搭建过程,包括交叉编译工具链的配置等关键步骤。进一步探讨了嵌入式 QT 的界面设计与开发,涵盖了从基本控件的使用到复杂界面布局的构建。同时也深入研究了信号与槽机制在嵌入式系统中的应用,以及嵌入式 QT 与硬件设备的交互,包括输入输出设

Retrieval-based-Voice-Conversion-WebUI模型构建指南

一、模型介绍 Retrieval-based-Voice-Conversion-WebUI(简称 RVC)模型是一个基于 VITS(Variational Inference with adversarial learning for end-to-end Text-to-Speech)的简单易用的语音转换框架。 具有以下特点 简单易用:RVC 模型通过简单易用的网页界面,使得用户无需深入了

30常用 Maven 命令

Maven 是一个强大的项目管理和构建工具,它广泛用于 Java 项目的依赖管理、构建流程和插件集成。Maven 的命令行工具提供了大量的命令来帮助开发人员管理项目的生命周期、依赖和插件。以下是 常用 Maven 命令的使用场景及其详细解释。 1. mvn clean 使用场景:清理项目的生成目录,通常用于删除项目中自动生成的文件(如 target/ 目录)。共性规律:清理操作

在cscode中通过maven创建java项目

在cscode中创建java项目 可以通过博客完成maven的导入 建立maven项目 使用快捷键 Ctrl + Shift + P 建立一个 Maven 项目 1 Ctrl + Shift + P 打开输入框2 输入 "> java create"3 选择 maven4 选择 No Archetype5 输入 域名6 输入项目名称7 建立一个文件目录存放项目,文件名一般为项目名8 确定

poj 2594 二分图最大独立集

题意: 求一张图的最大独立集,这题不同的地方在于,间接相邻的点也可以有一条边,所以用floyd来把间接相邻的边也连起来。 代码: #include <iostream>#include <cstdio>#include <cstdlib>#include <algorithm>#include <cstring>#include <cmath>#include <sta

poj 3692 二分图最大独立集

题意: 幼儿园里,有G个女生和B个男生。 他们中间有女生和女生认识,男生男生认识,也有男生和女生认识的。 现在要选出一些人,使得这里面的人都认识,问最多能选多少人。 解析: 反过来建边,将不认识的男生和女生相连,然后求一个二分图的最大独立集就行了。 下图很直观: 点击打开链接 原图: 现图: 、 代码: #pragma comment(

Vue3项目开发——新闻发布管理系统(六)

文章目录 八、首页设计开发1、页面设计2、登录访问拦截实现3、用户基本信息显示①封装用户基本信息获取接口②用户基本信息存储③用户基本信息调用④用户基本信息动态渲染 4、退出功能实现①注册点击事件②添加退出功能③数据清理 5、代码下载 八、首页设计开发 登录成功后,系统就进入了首页。接下来,也就进行首页的开发了。 1、页面设计 系统页面主要分为三部分,左侧为系统的菜单栏,右侧