【甘道夫】用贝叶斯文本分类测试打过1329-3.patch的Mahout0.9 on Hadoop2.2.0

本文主要是介绍【甘道夫】用贝叶斯文本分类测试打过1329-3.patch的Mahout0.9 on Hadoop2.2.0，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

引言

接前一篇文章《【甘道夫】Mahout0.9 打patch使其支持 Hadoop2.2.0》

http://blog.csdn.net/u010967382/article/details/39088035，

为Mahout0.9打过Patch编译成功后，使用贝叶斯文本分类来测试Mahout0.9对Hadoop2.2.0的兼容性。

欢迎转载，转载请注明出处：

http://blog.csdn.net/u010967382/article/details/39088285

步骤一：将20news的文件都上传到hdfs

yarn@singletest:~/Mahout/mahout-distribution-0.7$ hadoop fs -ls /workspace/mahout/week4/data/20news

Found 2 items

drwxr-xr-x - yarn supergroup 0 2014-09-04 21:52 /workspace/mahout/week4/data/20news/20news-bydate-test

drwxr-xr-x - yarn supergroup 0 2014-09-04 21:57 /workspace/mahout/week4/data/20news/20news-bydate-train

步骤二：对数据创建序列文件

yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ ./mahout seqdirectory -i /workspace/mahout/week4/data/20news -o /workspace/mahout/week4/data/20news_seq

yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ hadoop fs -ls /workspace/mahout/week4/data/20news_seq

Found 1 items

-rw-r--r-- 1 yarn supergroup 37064977 2014-09-04 22:12 /workspace/mahout/week4/data/ 20news_seq/chunk-0

第三步：将序列文件转化成向量

yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ ./mahout seq2sparse -i /workspace/mahout/week4/data/20news_seq/ -o /workspace/mahout/week4/data/20news_vectors -lnorm -nv -wt tfidf

yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ hadoop fs -ls /workspace/mahout/week4/data/ 20news_vectors

Found 7 items

drwxr-xr-x - yarn supergroup 0 2014-09-04 22:20 /workspace/mahout/week4/data/20news_vectors/df-count

-rw-r--r-- 1 yarn supergroup 1937084 2014-09-04 22:18 /workspace/mahout/week4/data/20news_vectors/dictionary.file-0

-rw-r--r-- 1 yarn supergroup 1890053 2014-09-04 22:20 /workspace/mahout/week4/data/20news_vectors/frequency.file-0

drwxr-xr-x - yarn supergroup 0 2014-09-04 22:19 /workspace/mahout/week4/data/20news_vectors/tf-vectors

drwxr-xr-x - yarn supergroup 0 2014-09-04 22:21 /workspace/mahout/week4/data/20news_vectors/tfidf-vectors

drwxr-xr-x - yarn supergroup 0 2014-09-04 22:18 /workspace/mahout/week4/data/20news_vectors/tokenized-documents

drwxr-xr-x - yarn supergroup 0 2014-09-04 22:18 /workspace/mahout/week4/data/20news_vectors/wordcount

第四步：将向量集分为训练集和测试数据

参数：

-tr训练集
-te测试集
-rp参数设定的是测试数据集占总数据集的百分比，以下代码设定为20%！

yarn@singletest:~/Mahout/mahout-distribution-0.7/bin$ ./mahout split -i /workspace/mahout/week4/data/20news_vectors/tfidf-vectors -tr /workspace/mahout/week4/data/train-vectors -te /workspace/mahout/week4/data/test-vectors -rp 20 -ow -seq -xm sequential

第五步：训练模型

yarn@singletest:~/Mahout/mahout-distribution-0.9/bin$ ./mahout trainnb -i /workspace/mahout/week4/data/train-vectors -el -o /workspace/mahout/week4/nbmodel -li /workspace/mahout/week4/labindex -ow -c

查看生成的索引：

yarn@singletest:~$ hadoop fs -text /workspace/mahout/week4/labindex

20news-bydate-test 0

20news-bydate-train 1

查看训练出来的模型：

yarn@singletest:~$ hadoop fs -ls /workspace/mahout/week4/nbmodel

Found 1 items

-rw-r--r-- 1 yarn supergroup 2437874 2014-09-05 23:09 /workspace/mahout/week4/nbmodel/naiveBayesModel.bin

第六步：测试

yarn@singletest:~/Mahout/mahout-distribution-0.9/bin$ ./mahout testnb -i /workspace/mahout/week4/data/test-vectors -m /workspace/mahout/week4/nbmodel -l /workspace/mahout/week4/labindex -ow -o /workspace/mahout/week4/20news-test-result -c

注意：测试时的-i跟着的输入路径是第四步拆分出来的测试集。

测试结果：

14/09/05 23:18:09 INFO test.TestNaiveBayesDriver: Complementary Results:

=======================================================

Summary

-------------------------------------------------------

Correctly Classified Instances : 2887 74.9675%

Incorrectly Classified Instances : 964 25.0325%

Total Classified Instances : 3851

=======================================================

Confusion Matrix

-------------------------------------------------------

a b <--Classified as

1131 413 | 1544 a = 20news-bydate-test

551 1756 | 2307 b = 20news-bydate-train

=======================================================

Statistics

-------------------------------------------------------

Kappa 0.486

Accuracy 74.9675%

Reliability 49.7892%

Reliability (standard deviation) 0.4314

14/09/05 23:18:09 INFO driver.MahoutDriver: Program took 17504 ms (Minutes: 0.29173333333333334)

这篇关于【甘道夫】用贝叶斯文本分类测试打过1329-3.patch的Mahout0.9 on Hadoop2.2.0的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

【甘道夫】用贝叶斯文本分类测试打过1329-3.patch的Mahout0.9 on Hadoop2.2.0

相关文章

Python自动化提取多个Word文档的文本

CPython与PyPy解释器架构的性能测试结果对比

C++中处理文本数据char与string的终极对比指南

Java实现在Word文档中添加文本水印和图片水印的操作指南

Python文本相似度计算的方法大全

Python中高级文本模式匹配与查找技术指南

基于Python Playwright进行前端性能测试的脚本实现

MySQL中的索引结构和分类实战案例详解

使用Python进行GRPC和Dubbo协议的高级测试

Python的端到端测试框架SeleniumBase使用解读