转:nutch相干框架安装使用最佳指南

2024-06-23 17:58

本文主要是介绍转:nutch相干框架安装使用最佳指南,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

转:http://user.qzone.qq.com/281032878/blog/1342675154#!app=2&via=QZ.HashRefresh&pos=1362131478Chinese installing and using instruction  -  The best guidance in installing and using  Nutch in China
超清原版下载地址:  http://pan.baidu.com/share/home?uk=3157595467
超清压缩下载地址:  http://pan.baidu.com/share/home?uk=1913680455%20


一、nutch1.2
二、nutch1.5.1
三、nutch2.0
四、配置SSH
五、安装Hadoop Cluster(伪分布式运行模式)并运行Nutch
六、安装Hadoop Cluster(分布式运行模式)并运行Nutch
七、配置Ganglia监控Hadoop集群和HBase集群
八、Hadoop配置Snappy压缩
九、Hadoop配置Lzo压缩
十、配置zookeeper集群以运行hbase
十一、配置Hbase集群以运行nutch-2.1(Region Servers会因为内存的问题宕机)
十二、配置Accumulo集群以运行nutch-2.1(gora存在BUG)
十三、配置Cassandra 集群以运行nutch-2.1(Cassandra 采用去中心化结构)
十四、配置MySQL 单机服务器以运行nutch-2.1
十五、nutch2.1 使用DataFileAvroStore作为数据源
十六、nutch2.1 使用AvroStore作为数据源
十七、配置SOLR
十八、Nagios监控
十九、配置Splunk
二十、配置Pig
二十一、配置Hive
二十二、配置Hadoop2.x集群



一、nutch1.2
 步 骤和二大同小异,在步骤 5、配置构建路径 中需要多两个操作:在左部Package Explorer的 nutch1.2文件夹上单击右键 > Build Path > Configure Build Path...   >  选中Source选项 > Default output folder:修改nutch1.2/bin为nutch1.2/_bin,在左部Package Explorer的 nutch1.2文件夹下的bin文件夹上单击右键 > Team > 还原
 二中黄色背景部分是版本号的差异,红色部分是1.2版本没有的,绿色部分是不一样的地方,如下:
 1、Add JARs... >  nutch1.2 > lib ,选中所有的.jar文件 > OK
 2、crawl-urlfilter.txt
 3、将crawl -urlfilter.txt.template改名为crawl -urlfilter.txt
 4、修改crawl-urlfilter.txt,将
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ -.
 5、cd /home/ysc/workspace/nutch1.2
 nutch1.2是一个完整的搜索引擎,nutch1.5.1只是一个爬虫。nutch1.2可以把索引提交给SOLR,也可以直接生成LUCENE索引,nutch1.5.1则只能把索引提交给SOLR:
 1、cd /home/ysc
 2、wget http://mirrors.tuna.tsinghua.edu.cn/apache/tomcat/tomcat-7/v7.0.29/bin/apache-tomcat-7.0.29.tar.gz
 3、tar -xvf apache-tomcat-7.0.29.tar.gz
 4、在左部Package Explorer的 nutch1.2文件夹下的build.xml文件上单击右键 > Run As > Ant Build... > 选中war target > Run
 5、cd /home/ysc/workspace/nutch1.2/build
 6、unzip nutch-1.2.war -d nutch-1.2
 7、cp -r nutch-1.2 /home/ysc/apache-tomcat-7.0.29/webapps
 8、vi /home/ysc/apache-tomcat-7.0.29/webapps/nutch-1.2/WEB-INF/classes/nutch-site.xml
 加入以下配置:
 <property>
  <name>searcher.dir</name>
  <value>/home/ysc/workspace/nutch1.2/data</value>
  <description>
  Path to root of crawl.  This directory is searched (in
  order) for either the file search-servers.txt, containing a list of
  distributed search servers, or the directory "index" containing
  merged indexes, or the directory "segments" containing segment
  indexes.
  </description>
</property>
9、vi /home/ysc/apache-tomcat-7.0.29/conf/server.xml

<Connector port="8080" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8443"/>
改为
<Connector port="8080" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8443" URIEncoding="utf-8"/>11、./startup.sh
12、访问:http://localhost:8080/nutch-1.2/1、下载并解压eclipse(集成开发环境)
 下载地址:http://www.eclipse.org/downloads/,下载Eclipse IDE for Java EE Developers
2、安装Subclipse插件(SVN客户端)
 插件地址:http://subclipse.tigris.org/update_1.8.x,
3、安装IvyDE插件(下载依赖Jar)
 插件地址:http://www.apache.org/dist/ant/ivyde/updatesite/
4、签出代码
 File > New > Project > SVN > 从SVN 检出项目
 创建新的资源库位置 > URL:https://svn.apache.org/repos/asf/nutch/tags/release-1.5.1/ > 选中URL > Finish
 弹出New Project向导,选择Java Project > Next,输入Project name:nutch1.5.1 > Finish
5、配置构建路径
 在左部Package Explorer的 nutch1.5.1文件夹上单击右键 > Build Path > Configure Build Path...  
> 选中Source选项 > 选择src > Remove > Add Folder... > 选择src/bin, src/java, src/test 和 src/testresources(对于插件,需要选中src/plugin目录下的每一个插件目录下的src/java , src/test文件夹) > OK
 切换到Libraries选项 >
 Add Class Folder... > 选中nutch1.5.1/conf > OK
 Add JARs... >  需要选中src/plugin目录下的每一个插件目录下的lib目录下的jar文件 > OK
 Add Library... > IvyDE Managed Dependencies > Next > Main > Ivy File > Browse > ivy/ivy.xml > Finish
 切换到Order and Export选项>
 选中conf > Top
6、执行ANT
 在左部Package Explorer的 nutch1.5.1文件夹下的build.xml文件上单击右键 > Run As > Ant Build
 在左部Package Explorer的 nutch1.5.1文件夹上单击右键 > Refresh
 在 左部Package Explorer的 nutch1.5.1文件夹上单击右键 > Build Path > Configure Build Path...   >  选中Libraries选项 > Add Class Folder... >  选中build > OK
7、修改配置文件nutch-site.xml 和regex-urlfilter.txt
 将nutch-site.xml.template改名为nutch-site.xml
 将regex-urlfilter.txt.template改名为regex-urlfilter.txt
 在左部Package Explorer的 nutch1.5.1文件夹上单击右键 > Refresh
 将如下配置项加入文件nutch-site.xml:
<property>
  <name>http.agent.name</name>
  <value>nutch</value>
</property>
<property>
  <name>http.content.limit</name>
  <value>-1</value>
</property>
 修改regex-urlfilter.txt,将
# accept anything else
+.
 替换为:
+^http://([a-z0-9]*\.)*news.163.com/
-.
8、开发调试
 在左部Package Explorer的 nutch1.5.1文件夹上单击右键 > New > Folder > Folder name: urls
 在刚新建的urls目录下新建一个文本文件url,文本内容为:http://news.163.com
 打 开src/java下的org.apache.nutch.crawl.Crawl.java类,单击右键Run As > Run Configurations > Arguments > 在Program arguments输入框中输入: urls -dir data -depth 3 > Run
 在需要调试的地方打上断点Debug As > Java Applicaton
9、查看结果
 查看segments目录:
 打开src/java下的org.apache.nutch.segment.SegmentReader.java类
 单击右键Run As > Java Applicaton,控制台会输出该命令的使用方法
 单击右键Run As > Run Configurations > Arguments > 在Program arguments输入框中输入: -dump data/segments/*  data/segments/dump
 用文本编辑器打开文件data/segments/dump/dump查看segments中存储的信息 打开src/java下的org.apache.nutch.crawl.CrawlDbReader.java类
 单击右键Run As > Java Applicaton,控制台会输出该命令的使用方法
 单击右键Run As > Run Configurations > Arguments > 在Program arguments输入框中输入: data/crawldb -stats
 控制台会输出 crawldb统计信息 打开src/java下的org.apache.nutch.crawl.LinkDbReader.java类
 单击右键Run As > Java Applicaton,控制台会输出该命令的使用方法
 单击右键Run As > Run Configurations > Arguments > 在Program arguments输入框中输入: data/linkdb -dump data/linkdb_dump
 用文本编辑器打开文件data/linkdb_dump/part-00000查看linkdb中存储的信息
10、全网分步骤抓取
 在左部Package Explorer的 nutch1.5.1文件夹下的build.xml文件上单击右键 > Run As > Ant Build
 cd  /home/ysc/workspace/nutch1.5.1/runtime/local
 #准备URL列表
 wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
 gunzip content.rdf.u8.gz
 mkdir dmoz
 bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/url
 #注入URL
 bin/nutch inject crawl/crawldb dmoz
 #生成抓取列表
 bin/nutch generate crawl/crawldb crawl/segments
 #第一次抓取
 s1=`ls -d crawl/segments/2* | tail -1`
 echo $s1
 #抓取网页
 bin/nutch fetch $s1
 #解析网页
 bin/nutch parse $s1
 #更新URL状态
 bin/nutch updatedb crawl/crawldb $s1
 #第二次抓取
 bin/nutch generate crawl/crawldb crawl/segments -topN 1000
 s2=`ls -d crawl/segments/2* | tail -1`
 echo $s2
 bin/nutch fetch $s2
 bin/nutch parse $s2
 bin/nutch updatedb crawl/crawldb $s2
 #第三次抓取
 bin/nutch generate crawl/crawldb crawl/segments -topN 1000
 s3=`ls -d crawl/segments/2* | tail -1`
 echo $s3
 bin/nutch fetch $s3
 bin/nutch parse $s3
 bin/nutch updatedb crawl/crawldb $s3
 #生成反向链接库
 bin/nutch invertlinks crawl/linkdb -dir crawl/segments cd  /home/ysc/ 
 wget http://mirror.bjtu.edu.cn/apache/lucene/solr/3.6.1/apache-solr-3.6.1.tgz
 tar -xvf apache-solr-3.6.1.tgz
 cd apache-solr-3.6.1 /example
 
 NUTCH_RUNTIME_HOME=/home/ysc/workspace/nutch1.5.1/runtime/local
 APACHE_SOLR_HOME=/home/ysc/apache-solr-3.6.1 如果需要把网页内容存储到索引中,则修改 schema.xml文件中的
 <field name="content" type="text" stored="false" indexed="true"/>
 为
 <field name="content" type="text" stored="true" indexed="true"/> #启动SOLR服务器
 java -jar start.jar http://127.0.0.1:8983/solr/admin/stats.jsp #提交索引
 bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/* bin/nutch crawl urls -dir data -depth 2 -topN 100 -solr http://127.0.0.1:8983/solr/ http://127.0.0.1:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on
 标题包含“网易”的文档:
 http://127.0.0.1:8983/solr/select/?q=title%3A%E7%BD%91%E6%98%93&version=2.2&start=0&rows=10&indent=on cd  /home/ysc/
 wget http://luke.googlecode.com/files/lukeall-3.5.0.jar
 java -jar lukeall-3.5.0.jar
 Path: /home/ysc/apache-solr-3.6.1/example/solr/data cd  /home/ysc/
 wget http://mmseg4j.googlecode.com/files/mmseg4j-1.8.5.zip
 unzip mmseg4j-1.8.5.zip -d  mmseg4j-1.8.5
 
 APACHE_SOLR_HOME=/home/ysc/apache-solr-3.6.1
 mkdir $APACHE_SOLR_HOME/example/solr/lib
 mkdir $APACHE_SOLR_HOME/example/solr/dic
 cp mmseg4j-1.8.5/mmseg4j-all-1.8.5.jar $APACHE_SOLR_HOME/example/solr/lib
 cp mmseg4j-1.8.5/data/*.dic $APACHE_SOLR_HOME/example/solr/dic
 
 将${APACHE_SOLR_HOME}/example/solr/conf/schema.xml文件中的
 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
 和
 <tokenizer class="solr.StandardTokenizerFactory"/>
 替换为
 <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="/home/ysc/apache-solr-3.6.1/example/solr/dic"/>
 
 #重新启动SOLR服务器
 java -jar start.jar 打开src/java下的org.apache.nutch.indexer.solr.SolrIndexer.java类
 单击右键Run As > Java Applicaton,控制台会输出该命令的使用方法
 单击右键Run As > Run Configurations > Arguments > 在Program arguments输入框中输入: http://127.0.0.1:8983/solr/ ; data/crawldb -linkdb  data/linkdb  data/segments/*
 使用luke重新打开索引就会发现分词起作用了 nutch2.0和二中的nutch1.5.1的步骤相同,但在8、开发调试之前需要做以下配置:
 在左部Package Explorer的 nutch2.0文件夹上单击右键 > New > Folder > Folder name: data并指定数据存储方式,选如下之一:
 1、使用mysql作为数据存储
  1)、在nutch2.0/conf/nutch-site.xml中加入如下配置:
 <property>
  <name>storage.data.store.class</name>
  <value>org.apache.gora.sql.store.SqlStore</value>
</property>
  2)、将nutch2.0/conf/gora.properties文件中的  
  gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver
gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest
gora.sqlstore.jdbc.user=sa
gora.sqlstore.jdbc.password=
  修改为
  gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://127.0.0.1:3306/nutch2
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=ROOT
  3)、打开nutch2.0/ivy/ivy.xml中的mysql-connector-java依赖
  4)、sudo apt-get install mysql-server
 2、使用hbase作为数据存储
  1)、在nutch2.0/conf/nutch-site.xml中加入如下配置:
 <property>
  <name>storage.data.store.class</name>
  <value>org.apache.gora.hbase.store.HBaseStore</value>
</property>
  2)、打开nutch2.0/ivy/ivy.xml中的gora-hbase依赖
  3)、cd /home/ysc
  4)、wget http://mirror.bit.edu.cn/apache/hbase/hbase-0.90.5/hbase-0.90.5.tar.gz
  5)、tar -xvf hbase-0.90.5.tar.gz
  6)、vi  hbase-0.90.5/conf/hbase-site.xml
   加入以下配置:
  <property>
    <name>hbase.rootdir</name>
    <value>file:///home/ysc/hbase-0.90.5-database</value>
  </property>
7)、hbase-0.90.5/bin/start-hbase.sh
8)、将/home/ysc/hbase-0.90.5/hbase-0.90.5.jar加入开发环境eclipse的build path 三台机器 devcluster01, devcluster02, devcluster03,分别在每一台机器上面执行如下操作:
 1、sudo vi /etc/hosts
 加入以下配置:
 192.168.1.1 devcluster01
 192.168.1.2 devcluster02
 192.168.1.3 devcluster03
 2、安装SSH服务:
  sudo apt-get install openssh-server
 3、(有提示的时候回车键确认)
  ssh-keygen -t rsa
  该命令会在用户主目录下创建 .ssh 目录,并在其中创建两个文件:id_rsa 私钥文件。是基于 RSA 算法创建。该私钥文件要妥善保管,不要泄漏。id_rsa.pub 公钥文件。和 id_rsa 文件是一对儿,该文件作为公钥文件,可以公开。
 4、cp .ssh/id_rsa.pub .ssh/authorized_keys
 把 三台机器 devcluster01, devcluster02, devcluster03 的文件/home/ysc/.ssh/authorized_keys的内容复制出来合并成一个文件并替换每一台机器上的/home/ysc/.ssh /authorized_keys文件
 在devcluster01上面执行时,以下两条命令的主机为02和03
 在devcluster02上面执行时,以下两条命令的主机为01和03
 在devcluster03上面执行时,以下两条命令的主机为01和02
 5、ssh-copy-id -i .ssh/id_rsa.pub ysc@ devcluster02
 6、ssh-copy-id -i .ssh/id_rsa.pub ysc@ devcluster03
 以上两条命令实际上是将 .ssh/id_rsa.pub 公钥文件追加到远程主机 server 的 user 主目录下的 .ssh/authorized_keys 文件中。 步骤和四大同小异,只需要1台机器 devcluster01,所以黄色背景部分全部设置为devcluster01,不需要第11步 三台机器 devcluster01, devcluster02, devcluster03(vi /etc/hostname)
 使用用户ysc登陆 devcluster01:
 1、cd /home/ysc
 2、wget http://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-1.1.1/hadoop-1.1.1-bin.tar.gz
 3、tar -xvf hadoop-1.1.1-bin.tar.gz
 4、cd  hadoop-1.1.1
 5、vi conf/masters
  替换内容为 :
  devcluster01
 6、vi conf/slaves
  替换内容为 :
  devcluster02
  devcluster03
 7、vi conf/core-site.xml
  加入配置:
  <property>
    <name>fs.default.name</name>
    <value>hdfs://devcluster01:9000</value>
    <description>
       Where to find the Hadoop Filesystem through the network.
       Note 9000 is not the default port.
       (This is slightly changed from previous versions which didnt have "hdfs")
    </description>
  </property>
    <property>
     <name>hadoop.security.authorization</name>
      <value>true</value>
    </property>
编辑conf/hadoop-policy.xml
 8、vi conf/hdfs-site.xml
  加入配置:
<property>
  <name>dfs.name.dir</name>
  <value>/home/ysc/dfs/filesystem/name</value>
</property>  <name>dfs.data.dir</name>
  <value>/home/ysc/dfs/filesystem/data</value>
</property>  <name>dfs.replication</name>
  <value>1</value>
</property>   <name>dfs.block.size</name>
  <value>671088640</value>
  <description>The default block size for new files.</description>
</property>
 9、vi conf/mapred-site.xml
  加入配置:
<property>
  <name>mapred.job.tracker</name>
  <value>devcluster01:9001</value>
  <description>
    The host and port that the MapReduce job tracker runs at. If
    "local", then jobs are run in-process as a single map and
    reduce task.
    Note 9001 is not the default port.
  </description>
</property>  <name>mapred.reduce.tasks.speculative.execution</name>
  <value>false</value>
  <description>If true, then multiple instances of some reduce tasks
               may be executed in parallel.</description>
</property>  <name>mapred.map.tasks.speculative.execution</name>
  <value>false</value>
  <description>If true, then multiple instances of some map tasks
               may be executed in parallel.</description>
</property>  <name>mapred.child.java.opts</name>
  <value>-Xmx2000m</value>
</property>  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>4</value>
  <description>
    the core number of host
  </description>
</property>  <name>mapred.map.tasks</name>
  <value>4</value>
</property>  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>4</value>
    <description>
    define mapred.map tasks to be number of slave hosts.the best number is the  number of slave hosts plus the core numbers of per host
    </description>
</property>  <name>mapred.reduce.tasks</name>
  <value>4</value>
  <description>
    define mapred.reduce tasks to be number of slave hosts.the best number is the  number of slave hosts plus the core numbers of per host
  </description>
</property>  <name>mapred.output.compression.type</name>
  <value>BLOCK</value>
  <description>If the job outputs are to compressed as SequenceFiles, how should they be compressed? Should be one of NONE, RECORD or BLOCK.
  </description>
</property>  <name>mapred.output.compress</name>
  <value>true</value>
  <description>Should the job outputs be compressed?
  </description>
</property>  <name>mapred.compress.map.output</name>
  <value>true</value>
  <description>Should the outputs of the maps be compressed before being                sent across the network. Uses SequenceFile compression.
  </description>
</property>  <name>mapred.system.dir</name>
  <value>/home/ysc/mapreduce/system</value>
</property>  <name>mapred.local.dir</name>
  <value>/home/ysc/mapreduce/local</value>
</property>
 10、vi conf/hadoop-env.sh
  追加:
export JAVA_HOME=/home/ysc/jdk1.7.0_05
  export HADOOP_HEAPSIZE=2000
  #替换掉默认的垃圾回收器,因为默认的垃圾回收器在多线程环境下会有更多的wait等待
  export HADOOP_OPTS="-server -Xmn256m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70"
 11、复制HADOOP文件
  scp -r /home/ysc/hadoop-1.1.1 ysc@devcluster02:/home/ysc/hadoop-1.1.1
  scp -r /home/ysc/hadoop-1.1.1 ysc@devcluster03:/home/ysc/hadoop-1.1.1
 12、sudo vi /etc/profile
  追加并重启系统:
  export PATH=/home/ysc/hadoop-1.1.1/bin:$PATH
 13、格式化名称节点并启动集群
  hadoop namenode -format
  start-all.sh
 14、cd /home/ysc/workspace/nutch1.5.1/runtime/deploy
  mkdir urls
  echo http://news.163.com > urls/url
  hadoop dfs -put urls urls
  bin/nutch crawl urls -dir data -depth 2 -topN 100
 15、访问 http://localhost:50030 可以查看 JobTracker 的运行状态。访问 http://localhost:50060 可以查看 TaskTracker 的运行状态。访问 http://localhost:50070 可以查看 NameNode 以及整个分布式文件系统的状态,浏览分布式文件系统中的文件以及 log 等
 16、通过stop-all.sh停止集群
 17、如果NameNode和SecondaryNameNode不在同一台机器上,则在SecondaryNameNode的conf/hdfs-site.xml文件中加入配置:
   <property>
     <name>dfs.http.address</name>
     <value>namenode:50070</value>
   </property> 1、服务器端(安装到master devcluster01上)
  1)、ssh devcluster01
  2)、addgroup ganglia
           adduser --ingroup ganglia ganglia
  3)、sudo apt-get install  ganglia-monitor ganglia-webfront gmetad
   //补充:在Ubuntu10.04上,ganglia-webfront这个package名字叫ganglia-webfrontend
   //如果install出错,则运行sudo apt-get update,如果update出错,则删除出错路径
  4)、vi /etc/ganglia/gmond.conf
   先找到setuid = yes,改成setuid =no;
   在找到cluster块中的name,改成name =”hadoop-cluster”;
  5)、sudo apt-get install rrdtool
  6)、vi /etc/ganglia/gmetad.conf
   在这个配置文件中增加一些datasource,即其他2个被监控的节点,增加以下内容:
   data_source “hadoop-cluster” devcluster01:8649 devcluster02:8649 devcluster03:8649
   gridname "Hadoop"
 2、数据源端(安装到所有slaves上)
  1)、ssh devcluster02
   addgroup ganglia
   adduser --ingroup ganglia ganglia 
   sudo apt-get install  ganglia-monitor

  2)、ssh devcluster03
   addgroup ganglia
   adduser --ingroup ganglia ganglia 
   sudo apt-get install  ganglia-monitor

  3)、ssh devcluster01
   scp /etc/ganglia/gmond.conf devcluster02:/etc/ganglia/gmond.conf
   scp /etc/ganglia/gmond.conf devcluster03:/etc/ganglia/gmond.conf
 3、配置WEB
  1)、ssh devcluster01
  2)、sudo ln -s /usr/share/ganglia-webfrontend /var/www/ganglia
  3)、vi /etc/apache2/apache2.conf
   添加:
   ServerName devcluster01
 4、重启服务
  1)、ssh devcluster02
   sudo /etc/init.d/ganglia-monitor restart
   ssh devcluster03
   sudo /etc/init.d/ganglia-monitor restart
  2)、ssh devcluster01
   sudo /etc/init.d/ganglia-monitor restart
   sudo /etc/init.d/gmetad restart
   sudo /etc/init.d/apache2 restart
 5、访问页面
  http:// devcluster01/ganglia
 6、集成hadoop
  1)、ssh devcluster01
  2)、cd /home/ysc/hadoop-1.1.1
  3)、vi conf/hadoop-metrics2.properties
  # 大于0.20以后的版本用ganglia31  *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
  *.sink.ganglia.period=10
  # default for supportsparse is false
  *.sink.ganglia.supportsparse=true
 *.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both
 *.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40
  #广播IP地址,这是缺省的,统一设该值(只能用组播地址239.2.11.71)
  namenode.sink.ganglia.servers=239.2.11.71:8649
  datanode.sink.ganglia.servers=239.2.11.71:8649
  jobtracker.sink.ganglia.servers=239.2.11.71:8649
  tasktracker.sink.ganglia.servers=239.2.11.71:8649
  maptask.sink.ganglia.servers=239.2.11.71:8649
  reducetask.sink.ganglia.servers=239.2.11.71:8649
  dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
  dfs.period=10
  dfs.servers=239.2.11.71:8649
  mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
  mapred.period=10
  mapred.servers=239.2.11.71:8649
  jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
  jvm.period=10
  jvm.servers=239.2.11.71:8649
  4)、scp conf/hadoop-metrics2.properties root@devcluster02:/home/ysc/hadoop-1.1.1/conf/hadoop-metrics2.properties
  5)、scp conf/hadoop-metrics2.properties root@devcluster03:/home/ysc/hadoop-1.1.1/conf/hadoop-metrics2.properties
  6)、stop-all.sh
  7)、start-all.sh
 7、集成hbase
  1)、ssh devcluster01
  2)、cd /home/ysc/hbase-0.92.2
  3)、vi conf/hadoop-metrics.properties(只能用组播地址239.2.11.71)
   hbase.extendedperiod = 3600
   hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
   hbase.period=10
   hbase.servers=239.2.11.71:8649
   jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
   jvm.period=10
   jvm.servers=239.2.11.71:8649
   rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
   rpc.period=10
   rpc.servers=239.2.11.71:8649
  4)、scp conf/hadoop-metrics.properties root@devcluster02:/home/ysc/ hbase-0.92.2/conf/hadoop-metrics.properties
  5)、scp conf/hadoop-metrics.properties root@devcluster03:/home/ysc/ hbase-0.92.2/conf/hadoop-metrics.properties
  6)、stop-hbase.sh
  7)、start-hbase.sh 1、wget http://snappy.googlecode.com/files/snappy-1.0.5.tar.gz
 2、tar -xzvf snappy-1.0.5.tar.gz
 3、cd snappy-1.0.5
 4、./configure
 5、make
 6、make install
 7、scp /usr/local/lib/libsnappy* devcluster01:/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/
 scp /usr/local/lib/libsnappy* devcluster02:/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/
 scp /usr/local/lib/libsnappy* devcluster03:/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/
 8、vi /etc/profile
  追加:
  export LD_LIBRARY_PATH=/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64
 9、修改mapred-site.xml
  <property>
    <name>mapred.output.compression.type</name>
    <value>BLOCK</value>
    <description>If the job outputs are to compressed as SequenceFiles, how should
        they be compressed? Should be one of NONE, RECORD or BLOCK.
    </description>
  </property>    <name>mapred.output.compress</name>
    <value>true</value>
    <description>Should the job outputs be compressed?
    </description>
  </property>    <name>mapred.compress.map.output</name>
    <value>true</value>
    <description>Should the outputs of the maps be compressed before being
        sent across the network. Uses SequenceFile compression.
    </description>
  </property>    <name>mapred.map.output.compression.codec</name>
    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
    <description>If the map outputs are compressed, how should they be
        compressed?
    </description>
  </property>    <name>mapred.output.compression.codec</name>
    <value>org.apache.hadoop.io.compress.SnappyCodec</value>
    <description>If the job outputs are compressed, how should they be compressed?
    </description>
  </property> 1、wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
 2、tar -zxvf lzo-2.06.tar.gz
 3、cd lzo-2.06
 4、./configure --enable-shared
 5、make
 6、make install
 7、scp /usr/local/lib/liblzo2.* devcluster01:/lib/x86_64-linux-gnu
 scp /usr/local/lib/liblzo2.* devcluster02:/lib/x86_64-linux-gnu
 scp /usr/local/lib/liblzo2.* devcluster03:/lib/x86_64-linux-gnu
 8、wget http://hadoop-gpl-compression.apache-extras.org.codespot.com/files/hadoop-gpl-compression-0.1.0-rc0.tar.gz
 9、tar -xzvf hadoop-gpl-compression-0.1.0-rc0.tar.gz
 10、cd hadoop-gpl-compression-0.1.0
 11、cp lib/native/Linux-amd64-64/* /home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/
 12、cp hadoop-gpl-compression-0.1.0.jar /home/ysc/hadoop-1.1.1/lib/(这里hadoop集群的版本要和compression使用的版本一致)
 13、scp -r /home/ysc/hadoop-1.1.1/lib devcluster02:/home/ysc/hadoop-1.1.1/
 scp -r /home/ysc/hadoop-1.1.1/lib devcluster03:/home/ysc/hadoop-1.1.1/
 14、vi /etc/profile
  追加:
  export LD_LIBRARY_PATH=/home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64
 15、修改core-site.xml
  <property>
    <name>io.compression.codecs</name>
    <value>com.hadoop.compression.lzo.LzoCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec</value>
    <description>A list of the compression codec classes that can be used
        for compression/decompression.</description>
  </property>    <name>io.compression.codec.lzo.class</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
  </property>    <name>fs.trash.interval</name>
    <value>1440</value>
    <description>Number of minutes between trash checkpoints.
    If zero, the trash feature is disabled.
    </description>
  </property>
 16、修改mapred-site.xml
  <property>
    <name>mapred.output.compression.type</name>
    <value>BLOCK</value>
    <description>If the job outputs are to compressed as SequenceFiles, how should
        they be compressed? Should be one of NONE, RECORD or BLOCK.
    </description>
  </property>    <name>mapred.output.compress</name>
    <value>true</value>
    <description>Should the job outputs be compressed?
    </description>
  </property>    <name>mapred.compress.map.output</name>
    <value>true</value>
    <description>Should the outputs of the maps be compressed before being
        sent across the network. Uses SequenceFile compression.
    </description>
  </property>    <name>mapred.map.output.compression.codec</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
    <description>If the map outputs are compressed, how should they be
        compressed?
    </description>
  </property>    <name>mapred.output.compression.codec</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
    <description>If the job outputs are compressed, how should they be compressed?
    </description>
  </property> 1、ssh devcluster01
 2、cd /home/ysc
 3、wget http://mirror.bjtu.edu.cn/apache/zookeeper/stable/zookeeper-3.4.5.tar.gz
 4、tar -zxvf  zookeeper-3.4.5.tar.gz
 5、cd zookeeper-3.4.5
 6、cp conf/zoo_sample.cfg  conf/zoo.cfg
 7、vi conf/zoo.cfg
  修改:dataDir=/home/ysc/zookeeper
  添加:
   server.1=devcluster01:2888:3888
   server.2=devcluster02:2888:3888
   server.3=devcluster03:2888:3888
   maxClientCnxns=100
 8、scp -r  zookeeper-3.4.5  devcluster01:/home/ysc
 scp -r  zookeeper-3.4.5  devcluster02:/home/ysc
 scp -r  zookeeper-3.4.5  devcluster03:/home/ysc
 9、分别在三台机器上面执行:
  ssh devcluster01
  mkdir /home/ysc/zookeeper(注:dataDir是zookeeper的数据目录,需要手动创建)
  echo 1 > /home/ysc/zookeeper/myid
  ssh devcluster02
  mkdir /home/ysc/zookeeper
  echo 2 > /home/ysc/zookeeper/myid
  ssh devcluster03
  mkdir /home/ysc/zookeeper
  echo 3 > /home/ysc/zookeeper/myid
 10、分别在三台机器上面执行:
  cd /home/ysc/zookeeper-3.4.5
  bin/zkServer.sh start
  bin/zkCli.sh -server devcluster01:2181
  bin/zkServer.sh status1、 nutch-2.1使用gora-0.2.1, gora-0.2.1使用hbase-0.90.4,hbase-0.90.4和hadoop-1.1.1不兼容,hbase-0.94.4和gora- 0.2.1不兼容,hbase-0.92.2没问题。hbase存在系统时间同步的问题,并且误差要再30s以内。
 sudo apt-get install ntp
 sudo ntpdate -u 210.72.145.44
2、HBase是数据库,会在同一时间使用很多的文件句柄。大多数linux系统使用的默认值1024是不能满足的。还需要修改 hbase 用户的 nproc,在压力下,如果过低会造成 OutOfMemoryError异常。
 vi /etc/security/limits.conf
 添加:
   ysc soft nproc 32000
   ysc hard nproc 32000
   ysc soft nofile 32768
   ysc hard nofile 32768
 vi /etc/pam.d/common-session
 添加:
   session required  pam_limits.so
 3、登陆master,下载并解压hbase
  ssh devcluster01
  cd /home/ysc
  wget http://apache.etoak.com/hbase/hbase-0.92.2/hbase-0.92.2.tar.gz
  tar -zxvf hbase-0.92.2.tar.gz
  cd hbase-0.92.2
 4、修改配置文件hbase-env.sh
  vi conf/hbase-env.sh
  追加:
  export JAVA_HOME=/home/ysc/jdk1.7.0_05
  export HBASE_MANAGES_ZK=false
  export HBASE_HEAPSIZE=10000
  #替换掉默认的垃圾回收器,因为默认的垃圾回收器在多线程环境下会有更多的wait等待
  export HBASE_OPTS="-server -Xmn256m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70"
 5、修改配置文件hbase-site.xml
  vi conf/hbase-site.xml
  <property> 
   <name>hbase.rootdir</name> 
   <value>hdfs://devcluster01:9000/hbase</value>    
  </property>
  <property> 
   <name>hbase.cluster.distributed</name> 
   <value>true</value> 
  </property> 
  <property>  
   <name>hbase.zookeeper.quorum</name>       
   <value>devcluster01,devcluster02,devcluster03</value>  
  </property>
  <property>
   <name>hfile.block.cache.size</name>
   <value>0.25</value>
   <description>
    Percentage of maximum heap (-Xmx setting) to allocate to block cache
    used by HFile/StoreFile. Default of 0.25 means allocate 25%.
    Set to 0 to disable but it's not recommended.
   </description>
  </property>
  <property>
   <name>hbase.regionserver.global.memstore.upperLimit</name>
   <value>0.4</value>
   <description>Maximum size of all memstores in a region server before new
     updates are blocked and flushes are forced. Defaults to 40% of heap
   </description>
  </property>
    <property>
   <name>hbase.regionserver.global.memstore.lowerLimit</name>
   <value>0.35</value>
   <description>When memstores are being forced to flush to make room in
    memory, keep flushing until we hit this mark. Defaults to 35% of heap.
    This value equal to hbase.regionserver.global.memstore.upperLimit causes
    the minimum possible flushing to occur when updates are blocked due to
    memstore limiting.
   </description>
    </property>
  <property>
   <name>hbase.hregion.majorcompaction</name>
   <value>0</value>
   <description>The time (in miliseconds) between 'major' compactions of all
    HStoreFiles in a region.  Default: 1 day.
    Set to 0 to disable automated major compactions.
   </description>
  </property>
 6、修改配置文件regionservers
  vi conf/regionservers
  devcluster01
  devcluster02
  devcluster03
 7、因为HBase建立在Hadoop之上,Hadoop使用的hadoop*.jar和HBase使用的 必须 一致。所以要将 HBase lib 目录下的hadoop*.jar替换成Hadoop里面的那个,防止版本冲突。
  cp  /home/ysc/hadoop-1.1.1/hadoop-core-1.1.1.jar  /home/ysc/hbase-0.92.2/lib
  rm  /home/ysc/hbase-0.92.2/lib/hadoop-core-1.0.3.jar
 8、复制文件到regionservers
  scp -r /home/ysc/hbase-0.92.2 devcluster01:/home/ysc
  scp -r /home/ysc/hbase-0.92.2 devcluster02:/home/ysc
  scp -r /home/ysc/hbase-0.92.2 devcluster03:/home/ysc
 9、启动hadoop并创建目录
  hadoop fs -mkdir /hbase
 10、管理HBase集群:
  启动初始 HBase 集群:
   bin/start-hbase.sh
  停止HBase 集群:
   bin/stop-hbase.sh
  启动额外备份主服务器,可以启动到 9 个备份服务器 (总数10 个):
   bin/local-master-backup.sh start 1
   bin/local-master-backup.sh start 2 3
  启动更多 regionservers, 支持到 99 个额外regionservers (总100个):
   bin/local-regionservers.sh start 1
   bin/local-regionservers.sh start 2 3 4 5
  停止备份主服务器:
   cat /tmp/hbase-ysc-1-master.pid |xargs kill -9
  停止单独 regionserver:
   bin/local-regionservers.sh stop 1
  使用HBase命令行模式:
   bin/hbase shell
 11、web界面
  http://devcluster01:60010
  http://devcluster01:60030
 12、如运行nutch2.1则方法一:
  cp conf/hbase-site.xml /home/ysc/nutch-2.1/conf
  cd /home/ysc/nutch-2.1
  ant
  cd runtime/deploy
  unzip -d apache-nutch-2.1 apache-nutch-2.1.job
  rm  apache-nutch-2.1.job
  cd apache-nutch-2.1
  rm lib/hbase-0.90.4.jar
  cp /home/ysc/hbase-0.92.2/hbase-0.92.2.jar  lib
  zip -r ../apache-nutch-2.1.job ./*
  cd ..
  rm -r apache-nutch-2.1
 13、如运行nutch2.1则方法二:
  cp conf/hbase-site.xml /home/ysc/nutch-2.1/conf
  cd /home/ysc/nutch-2.1
  cp /home/ysc/hbase-0.92.2/hbase-0.92.2.jar  lib
  ant
  cd runtime/deploy
  zip -d apache-nutch-2.1.job lib/hbase-0.90.4.jar 1、vi conf/gora-hbase-mapping.xml
  在family上面添加属性:compression="SNAPPY"
 2、mkdir /home/ysc/hbase-0.92.2/lib/native/Linux-amd64-64
 3、cp /home/ysc/hadoop-1.1.1/lib/native/Linux-amd64-64/* /home/ysc/hbase-0.92.2/lib/native/Linux-amd64-64
 4、vi /home/ysc/hbase-0.92.2/conf/hbase-site.xml
  增加:
                <property>
                        <name>hbase.regionserver.codecs</name>
                        <value>snappy</value>
                </property> 1、wget http://apache.etoak.com/accumulo/1.4.2/accumulo-1.4.2-dist.tar.gz
 2、tar -xzvf accumulo-1.4.2-dist.tar.gz
 3、cd accumulo-1.4.2
 4、cp conf/examples/3GB/standalone/* conf
 5、vi conf/accumulo-env.sh
  export HADOOP_HOME=/home/ysc/cluster3
  export ZOOKEEPER_HOME=/home/ysc/zookeeper-3.4.5
  export JAVA_HOME=/home/jdk1.7.0_01
  export ACCUMULO_HOME=/home/ysc/accumulo-1.4.2
 6、vi conf/slaves
  devcluster01
  devcluster02
  devcluster03
 7、vi conf/masters
  devcluster01
 8、vi conf/accumulo-site.xml
  <property>
    <name>instance.zookeeper.host</name>
    <value>host6:2181,host8:2181</value>
    <description>comma separated list of zookeeper servers</description>
  </property>    <name>logger.dir.walog</name>
    <value>walogs</value>
    <description>The directory used to store write-ahead logs on the local filesystem. It is possible to specify a comma-separated list of directories.</description>
  </property>    <name>instance.secret</name>
    <value>ysc</value>
    <description>A secret unique to a given instance that all servers must know in order to communicate with one another.
        Change it before initialization. To change it later use ./bin/accumulo org.apache.accumulo.server.util.ChangeSecret [oldpasswd] [newpasswd],
        and then update this file.
    </description>
  </property>    <name>tserver.memory.maps.max</name>
    <value>3G</value>
  </property>    <name>tserver.cache.data.size</name>
    <value>50M</value>
  </property>    <name>tserver.cache.index.size</name>
    <value>512M</value>
  </property>    <name>trace.password</name>
    <!--
   change this to the root user's password, and/or change the user below
     -->
    <value>ysc</value>
  </property>    <name>trace.user</name>
    <value>root</value>
  </property>
 9、bin/accumulo init
 10、bin/start-all.sh
 11、bin/stop-all.sh
 12、web访问:http://devcluster01:50095/ 1、cd  /home/ysc/nutch-2.1
 2、vi  conf/gora.properties
  增加:
  gora.datastore.default=org.apache.gora.accumulo.store.AccumuloStore
  gora.datastore.accumulo.mock=false
  gora.datastore.accumulo.instance=accumulo
  gora.datastore.accumulo.zookeepers=host6,host8
  gora.datastore.accumulo.user=root
  gora.datastore.accumulo.password=ysc
 3、vi  conf/nutch-site.xml
  增加:
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.accumulo.store.AccumuloStore</value>
  </property>
 4、vi ivy/ivy.xml
  增加:
  <dependency org="org.apache.gora" name="gora-accumulo" rev="0.2.1" conf="*->default" />
 5、升级accumulo
  cp /home/ysc/accumulo-1.4.2/lib/accumulo-core-1.4.2.jar  /home/ysc/nutch-2.1/lib
  cp /home/ysc/accumulo-1.4.2/lib/accumulo-start-1.4.2.jar  /home/ysc/nutch-2.1/lib
  cp /home/ysc/accumulo-1.4.2/lib/cloudtrace-1.4.2.jar  /home/ysc/nutch-2.1/lib
 6、ant
 7、cd runtime/deploy
 8、删除旧jar
  zip -d apache-nutch-2.1.job lib/accumulo-core-1.4.0.jar
  zip -d apache-nutch-2.1.job lib/accumulo-start-1.4.0.jar
  zip -d apache-nutch-2.1.job lib/cloudtrace-1.4.2.jar 1、vi /etc/hosts(注意:需要登录到每一台机器上面,将localhost解析到实际地址)
  192.168.1.1       localhost
 2、wget http://labs.mop.com/apache-mirror/cassandra/1.2.0/apache-cassandra-1.2.0-bin.tar.gz
 3、tar -xzvf  apache-cassandra-1.2.0-bin.tar.gz
 4、cd apache-cassandra-1.2.0
 5、vi conf/cassandra-env.sh
  增加:
  MAX_HEAP_SIZE="4G"
  HEAP_NEWSIZE="800M"
 6、vi conf/log4j-server.properties
  修改:
  log4j.appender.R.File=/home/ysc/cassandra/system.log
 7、vi conf/cassandra.yaml
  修改:
  cluster_name: 'Cassandra  Cluster'
  data_file_directories:
      - /home/ysc/cassandra/data
  commitlog_directory: /home/ysc/cassandra/commitlog
  saved_caches_directory: /home/ysc/cassandra/saved_caches  listen_address: 192.168.1.1
  rpc_address: 192.168.1.1  thrift_max_message_length_in_mb: 1024
 8、vi bin/stop-server
  增加:
  user=`whoami`
  pgrep -u $user -f cassandra | xargs kill -9
 9、复制cassandra到其他节点:
  cd ..
  scp -r apache-cassandra-1.2.0 devcluster02:/home/ysc
  scp -r apache-cassandra-1.2.0 devcluster03:/home/ysc
  分别在devcluster02和devcluster03上面修改:
  vi conf/cassandra.yaml
   listen_address: 192.168.1.2
   rpc_address: 192.168.1.2
  vi conf/cassandra.yaml
   listen_address: 192.168.1.3
   rpc_address: 192.168.1.3
 10、分别在3个节点上面运行
  bin/cassandra
  bin/cassandra -f   参数 -f 的作用是让 Cassandra 以前端程序方式运行,这样有利于调试和观察日志信息,而在实际生产环境中这个参数是不需要的(即 Cassandra 会以 daemon 方式运行)
 11、bin/nodetool -host devcluster01 ring
        bin/nodetool -host devcluster01 info
 12、bin/stop-server
 13、bin/cassandra-cli 1、cd  /home/ysc/nutch-2.1
 2、vi  conf/gora.properties
  增加:
  gora.cassandrastore.servers=host2:9160,host6:9160,host8:9160
 3、vi  conf/nutch-site.xml
  增加:
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.cassandra.store.CassandraStore</value>
  </property>
 4、vi ivy/ivy.xml
  增加:
  <dependency org="org.apache.gora" name="gora-cassandra" rev="0.2.1" conf="*->default" />
 5、升级cassandra
  cp /home/ysc/apache-cassandra-1.2.0/lib/apache-cassandra-1.2.0.jar  /home/ysc/nutch-2.1/lib
  cp /home/ysc/apache-cassandra-1.2.0/lib/apache-cassandra-thrift-1.2.0.jar  /home/ysc/nutch-2.1/lib
  cp /home/ysc/apache-cassandra-1.2.0/lib/jline-1.0.jar  /home/ysc/nutch-2.1/lib
 6、ant
 7、cd runtime/deploy
 8、删除旧jar
  zip -d apache-nutch-2.1.job lib/cassandra-thrift-1.1.2.jar
  zip -d apache-nutch-2.1.job lib/jline-0.9.1.jar 1、apt-get install mysql-server mysql-client
 2、vi /etc/mysql/my.cnf
  修改:
  bind-address            = 221.194.43.2
  在[client]下增加:
  default-character-set=utf8
  在[mysqld]下增加:
  default-character-set=utf8
 3、mysql –uroot –pysc
  SHOW VARIABLES LIKE '%character%';
 4、service mysql restart
 5、mysql –uroot –pysc
  GRANT ALL PRIVILEGES ON *.* TO root@"%" IDENTIFIED BY "ysc";
 6、vi conf/gora-sql-mapping.xml
  修改字段的长度
  <primarykey column="id" length="333"/>
  <field name="content" column="content" />
  <field name="text" column="text" length="19892"/>
 7、启动nutch之后登陆mysql
   ALTER TABLE webpage MODIFY COLUMN content MEDIUMBLOB;
   ALTER TABLE webpage MODIFY COLUMN text MEDIUMTEXT;
   ALTER TABLE webpage MODIFY COLUMN title MEDIUMTEXT;
   ALTER TABLE webpage MODIFY COLUMN reprUrl MEDIUMTEXT;
   ALTER TABLE webpage MODIFY COLUMN baseUrl MEDIUMTEXT;
   ALTER TABLE webpage MODIFY COLUMN typ MEDIUMTEXT;
   ALTER TABLE webpage MODIFY COLUMN inlinks MEDIUMBLOB;
   ALTER TABLE webpage MODIFY COLUMN outlinks MEDIUMBLOB; 1、cd  /home/ysc/nutch-2.1
 2、vi  conf/gora.properties
  增加:
   gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
 gora.sqlstore.jdbc.url=jdbc:mysql://host2:3306/nutch?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=utf8
  gora.sqlstore.jdbc.user=root
  gora.sqlstore.jdbc.password=ysc
 3、vi  conf/nutch-site.xml
  增加:
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.sql.store.SqlStore </value>
  </property>    <name>encodingdetector.charset.min.confidence</name>
    <value>1</value>
    <description>A integer between 0-100 indicating minimum confidence value
    for charset auto-detection. Any negative value disables auto-detection.
    </description>
  </property>
 4、vi ivy/ivy.xml
  增加:
  <dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/> 1、cd  /home/ysc/nutch-2.1
 2、vi  conf/gora.properties
  增加:
  gora.datafileavrostore.output.path=datafileavrostore
  gora.datafileavrostore.input.path=datafileavrostore
 3、vi  conf/nutch-site.xml
  增加:
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.avro.store.DataFileAvroStore</value>
  </property>    <name>encodingdetector.charset.min.confidence</name>
    <value>1</value>
    <description>A integer between 0-100 indicating minimum confidence value
    for charset auto-detection. Any negative value disables auto-detection.
    </description>
  </property> 1、cd  /home/ysc/nutch-2.1
 2、vi  conf/gora.properties
  增加:
  gora.avrostore.codec.type=BINARY
  gora.avrostore.input.path=avrostore
  gora.avrostore.output.path=avrostore
 3、vi  conf/nutch-site.xml
  增加:
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.avro.store.AvroStore</value>
  </property>    <name>encodingdetector.charset.min.confidence</name>
    <value>1</value>
    <description>A integer between 0-100 indicating minimum confidence value
    for charset auto-detection. Any negative value disables auto-detection.
    </description>
  </property> 配置tomcat:
 1、wget http://www.fayea.com/apache-mirror/tomcat/tomcat-7/v7.0.35/bin/apache-tomcat-7.0.35.tar.gz
 2、tar -xzvf apache-tomcat-7.0.35.tar.gz
 3、cd apache-tomcat-7.0.35
 4、vi conf/server.xml
 增加URIEncoding="UTF-8":
  <Connector port="8080" protocol="HTTP/1.1"
       connectionTimeout="20000"
       redirectPort="8443" URIEncoding="UTF-8"/>
 5、mkdir conf/Catalina
 6、mkdir conf/Catalina/localhost
 7、vi conf/Catalina/localhost/solr.xml
 增加:
  <Context path="/solr">
   <Environment name="solr/home" type="java.lang.String" value="/home/ysc/solr/configuration/" override="false"/>
  </Context>
 8、cd .. 1、wget http://mirrors.tuna.tsinghua.edu.cn/apache/lucene/solr/4.1.0/solr-4.1.0.tgz
 2、tar -xzvf solr-4.1.0.tgz 1、mkdir /home/ysc/solr
 2、cp -r solr-4.1.0/example/solr  /home/ysc/solr/configuration
 3、unzip solr-4.1.0/example/webapps/solr.war -d /home/ysc/apache-tomcat-7.0.35/webapps/solr 1、复制schema:
  cp /home/ysc/nutch-1.6/conf/schema-solr4.xml /home/ysc/solr/configuration/collection1/conf/schema.xml
 2、vi /home/ysc/solr/configuration/collection1/conf/schema.xml
  在<fields>下增加:
  <field name="_version_" type="long" indexed="true" stored="true"/> 1、wget http://mmseg4j.googlecode.com/files/mmseg4j-1.9.1.v20130120-SNAPSHOT.zip
 2、unzip mmseg4j-1.9.1.v20130120-SNAPSHOT.zip
 3、cp mmseg4j-1.9.1-SNAPSHOT/dist/* /home/ysc/apache-tomcat-7.0.35/webapps/solr/WEB-INF/lib
 4、unzip mmseg4j-1.9.1-SNAPSHOT/dist/mmseg4j-core-1.9.1-SNAPSHOT.jar -d  mmseg4j-1.9.1-SNAPSHOT/dist/mmseg4j-core-1.9.1-SNAPSHOT
 5、mkdir /home/ysc/dic
 6、cp   mmseg4j-1.9.1-SNAPSHOT/dist/mmseg4j-core-1.9.1-SNAPSHOT/data/* /home/ysc/dic
 7、vi /home/ysc/solr/configuration/collection1/conf/schema.xml
  将文件中的
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  和
  <tokenizer class="solr.StandardTokenizerFactory"/>
  替换为
  <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="/home/ysc/dic"/> 1、wget http://apache.spd.co.il/apr/apr-1.4.6.tar.gz
 2、tar -xzvf apr-1.4.6.tar.gz
 3、cd apr-1.4.6
 4、./configure
 5、make
 6、make  install 2、tar -xzvf apr-util-1.5.1.tar.gz
 3、cd apr-util-1.5.1
 4、./configure --with-apr=/usr/local/apr
 5、make
 6、make  install 2、tar -zxvf tomcat-native-1.1.24-src.tar.gz
 3、cd tomcat-native-1.1.24-src/jni/native
 4、./configure --with-apr=/usr/local/apr \
                --with-java-home=/home/ysc/jdk1.7.0_01 \
                --with-ssl=no \
                --prefix=/home/ysc/apache-tomcat-7.0.35
 5、make
 6、make  install
 7、vi /etc/profile
 增加:
 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/ysc/apache-tomcat-7.0.35/lib:/usr/local/apr/lib
 8、source /etc/profile cd apache-tomcat-7.0.35
 bin/catalina.sh start
 http://devcluster01:8080/solr/ 服务端:
 1、apt-get install apache2 nagios3 nagios-nrpe-plugin
  输入密码:nagiosadmin
 2、apt-get install nagios3-doc
 3、vi /etc/nagios3/conf.d/hostgroups_nagios2.cfg
   define hostgroup {
     hostgroup_name  nagios-servers
     alias           nagios servers
     members         devcluster01,devcluster02,devcluster03
   }
 4、cp  /etc/nagios3/conf.d/localhost_nagios2.cfg /etc/nagios3/conf.d/devcluster01_nagios2.cfg
  vi /etc/nagios3/conf.d/devcluster01_nagios2.cfg
  替换:
   g/localhost/s//devcluster01/g
   g/127.0.0.1/s//192.168.1.1/g
 5、cp  /etc/nagios3/conf.d/localhost_nagios2.cfg /etc/nagios3/conf.d/devcluster02_nagios2.cfg
  vi /etc/nagios3/conf.d/devcluster02_nagios2.cfg
  替换:
   g/localhost/s//devcluster02/g
   g/127.0.0.1/s//192.168.1.2/g
 6、cp  /etc/nagios3/conf.d/localhost_nagios2.cfg /etc/nagios3/conf.d/devcluster03_nagios2.cfg
  vi /etc/nagios3/conf.d/devcluster03_nagios2.cfg
  替换:
   g/localhost/s//devcluster03/g
   g/127.0.0.1/s//192.168.1.3/g  将hostgroup_name改为nagios-servers
  增加:
   # check that web services are running
   define service {
     hostgroup_name                  nagios-servers
     service_description             HTTP
     check_command                   check_http
     use                             generic-service
     notification_interval           0 ; set > 0 if you want to be renotified
   }   define service {
     hostgroup_name                  nagios-servers
     service_description             SSH
     check_command                   check_ssh
     use                             generic-service
     notification_interval           0 ; set > 0 if you want to be renotified
   }
 8、vi /etc/nagios3/conf.d/extinfo_nagios2.cfg
  将hostgroup_name改为nagios-servers
  增加:
   define hostextinfo{
     hostgroup_name   nagios-servers
     notes            nagios-servers
   #       notes_url        http://webserver.localhost.localdomain/hostinfo.pl?host=netware1
     icon_image       base/debian.png
     icon_image_alt   Debian GNU/Linux
     vrml_image       debian.png
     statusmap_image  base/debian.gd2
     }
 9、sudo /etc/init.d/nagios3 restart
 10、访问http://devcluster01/nagios3/
  用户名:nagiosadmin密码:nagiosadmin 1、apt-get install nagios-nrpe-server
 2、vi /etc/nagios/nrpe.cfg
  替换:
  g/127.0.0.1/s//192.168.1.1/g
 3、sudo /etc/init.d/nagios-nrpe-server restart 1、wget http://download.splunk.com/releases/5.0.2/splunk/linux/splunk-5.0.2-149561-Linux-x86_64.tgz
 2、tar -zxvf splunk-5.0.2-149561-Linux-x86_64.tgz
 3、cd splunk
 4、bin/splunk start --answer-yes --no-prompt --accept-license
 5、访问http://devcluster01:8000
  用户名:admin 密码:changeme
 6、添加数据 -> 从 UDP 端口 -> UDP 端口 *: 1688 -> 来源类型 从列表 log4j -> 保存
 7、配置hadoop
  vi /home/ysc/hadoop-1.1.1/conf/log4j.properties
  修改:
   log4j.rootLogger=${hadoop.root.logger}, EventCounter, SYSLOG
  增加:
   log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender 
   log4j.appender.SYSLOG.facility=local1 
   log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout 
   log4j.appender.SYSLOG.layout.ConversionPattern=%p %c{2}: %m%n 
   log4j.appender.SYSLOG.SyslogHost=host6:1688
   log4j.appender.SYSLOG.threshold=INFO 
   log4j.appender.SYSLOG.Header=true
   log4j.appender.SYSLOG.FacilityPrinting=true 
 8、配置hbase
  vi /home/ysc/hbase-0.92.2/conf/log4j.properties
  修改:
   log4j.rootLogger=${hbase.root.logger},SYSLOG
  增加:
   log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender 
   log4j.appender.SYSLOG.facility=local1 
   log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout 
   log4j.appender.SYSLOG.layout.ConversionPattern=%p %c{2}: %m%n 
   log4j.appender.SYSLOG.SyslogHost=host6:1688
   log4j.appender.SYSLOG.threshold=INFO 
   log4j.appender.SYSLOG.Header=true
   log4j.appender.SYSLOG.FacilityPrinting=true
 9、配置nutch
  vi /home/lanke/ysc/nutch-2.1-hbase/conf/log4j.properties
  修改:
   log4j.rootLogger=INFO,DRFA,SYSLOG
  增加:
   log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender 
   log4j.appender.SYSLOG.facility=local1 
   log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout 
   log4j.appender.SYSLOG.layout.ConversionPattern=%p %c{2}: %m%n 
   log4j.appender.SYSLOG.SyslogHost=host6:1688
   log4j.appender.SYSLOG.threshold=INFO 
   log4j.appender.SYSLOG.Header=true
   log4j.appender.SYSLOG.FacilityPrinting=true
 10、启动hadoop和hbase
  start-all.sh
  start-hbase.sh 1、wget http://labs.mop.com/apache-mirror/pig/pig-0.11.0/pig-0.11.0.tar.gz
 2、tar -xzvf pig-0.11.0.tar.gz
 3、cd pig-0.11.0
 4、vi /etc/profile
  增加:
  export PIG_HOME=/home/ysc/pig-0.11.0
  export PATH=$PIG_HOME/bin:$PATH
 5、source /etc/profile
 6、cp conf/log4j.properties.template conf/log4j.properties
 7、vi conf/log4j.properties
 8、pig 1、wget http://mirrors.cnnic.cn/apache/hive/hive-0.10.0/hive-0.10.0.tar.gz
 2、tar -xzvf hive-0.10.0.tar.gz
 3、cd hive-0.10.0
 4、vi /etc/profile
  增加:
  export HIVE_HOME=/home/ysc/hive-0.10.0
  export PATH=$HIVE_HOME/bin:$PATH
 5、source /etc/profile
 6、cp conf/hive-log4j.properties.template conf/hive-log4j.properties
 7、vi conf/hive-log4j.properties
  替换:
  log4j.appender.EventCounter=org.apache.hadoop.metrics.jvm.EventCounter
  为:
  log4j.appender.EventCounter=org.apache.hadoop.log.metrics.EventCounter二十二、配置Hadoop2.x集群
 1、wget http://labs.mop.com/apache-mirror/hadoop/common/hadoop-2.0.2-alpha/hadoop-2.0.2-alpha.tar.gz
 2、tar -xzvf hadoop-2.0.2-alpha.tar.gz
 3、cd hadoop-2.0.2-alpha
 4、vi etc/hadoop/hadoop-env.sh
  追加:
export JAVA_HOME=/home/ysc/jdk1.7.0_05
  export HADOOP_HEAPSIZE=2000
 5、vi etc/hadoop/core-site.xml
  <property>
   <name>fs.defaultFS</name>
   <value>hdfs://devcluster01:9000</value>
   <description>
      Where to find the Hadoop Filesystem through the network.
      Note 9000 is not the default port.
      (This is slightly changed from previous versions which didnt have "hdfs")
   </description>
   </property>
   <property>
    <name>io.file.buffer.size</name>
    <value>131072</value>
    <description>The size of buffer for use in sequence files.
    The size of this buffer should probably be a multiple of hardware
    page size (4096 on Intel x86), and it determines how much data is
    buffered during read and write operations.</description>
  </property>
 6、vi etc/hadoop/mapred-site.xml
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>    <name>mapred.job.reduce.input.buffer.percent</name>
    <value>1</value>
    <description>The percentage of memory- relative to the maximum heap size- to
    retain map outputs during the reduce. When the shuffle is concluded, any
    remaining map outputs in memory must consume less than this threshold before
    the reduce can begin.
    </description>
  </property>    <name>mapred.job.shuffle.input.buffer.percent</name>
    <value>1</value>
    <description>The percentage of memory to be allocated from the maximum heap
    size to storing map outputs during the shuffle.
    </description>
  </property>    <name>mapred.inmem.merge.threshold</name>
    <value>0</value>
    <description>The threshold, in terms of the number of files
    for the in-memory merge process. When we accumulate threshold number of files
    we initiate the in-memory merge and spill to disk. A value of 0 or less than
    0 indicates we want to DON'T have any threshold and instead depend only on
    the ramfs's memory consumption to trigger the merge.
    </description>
  </property>    <name>io.sort.factor</name>
    <value>100</value>
    <description>The number of streams to merge at once while sorting
    files.  This determines the number of open file handles.</description>
  </property>    <name>io.sort.mb</name>
    <value>240</value>
    <description>The total amount of buffer memory to use while sorting
    files, in megabytes.  By default, gives each merge stream 1MB, which
    should minimize seeks.</description>
  </property>
    <property>
      <name>mapred.map.output.compression.codec</name>
      <value>org.apache.hadoop.io.compress.SnappyCodec</value>
      <description>If the map outputs are compressed, how should they be
          compressed?
      </description>
    </property>      <name>mapred.output.compression.codec</name>
      <value>org.apache.hadoop.io.compress.SnappyCodec</value>
      <description>If the job outputs are compressed, how should they be compressed?
      </description>
    </property>
  <property>
    <name>mapred.output.compression.type</name>
    <value>BLOCK</value>
    <description>If the job outputs are to compressed as SequenceFiles, how should
        they be compressed? Should be one of NONE, RECORD or BLOCK.
    </description>
  </property>
  <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx2000m</value>
  </property>    <name>mapred.output.compress</name>
    <value>true</value>
    <description>Should the job outputs be compressed?
    </description>
  </property>    <name>mapred.compress.map.output</name>
    <value>true</value>
    <description>Should the outputs of the maps be compressed before being
        sent across the network. Uses SequenceFile compression.
    </description>
  </property>    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>5</value>
  </property>    <name>mapred.map.tasks</name>
    <value>15</value>
  </property>    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>5</value>
   <description>
   define mapred.map tasks to be number of slave hosts.the best number is the  number of slave hosts plus the core numbers of per host
   </description>
  </property>    <name>mapred.reduce.tasks</name>
    <value>15</value>
    <description>
   define mapred.reduce tasks to be number of slave hosts.the best number is the  number of slave hosts plus the core numbers of per host
    </description>
  </property>
  <property>
    <name>mapred.system.dir</name>
    <value>/home/ysc/mapreduce/system</value>
  </property>    <name>mapred.local.dir</name>
    <value>/home/ysc/mapreduce/local</value>
  </property>    <name>mapreduce.job.counters.max</name>
    <value>12000</value>
    <description>Limit on the number of counters allowed per job.
    </description>
  </property>
 7、vi etc/hadoop/yarn-site.xml
  <property>   
    <name>yarn.resourcemanager.resource-tracker.address</name>  
    <value>devcluster01:8031</value>
   </property>  
   <property> 
    <name>yarn.resourcemanager.address</name>    
    <value>devcluster01:8032</value> 
   </property>
   <property>   
    <name>yarn.resourcemanager.scheduler.address</name> 
    <value>devcluster01:8030</value>
   </property>
   <property> 
    <name>yarn.resourcemanager.admin.address</name> 
    <value>devcluster01:8033</value>  
   </property>  
   <property>   
    <name>yarn.resourcemanager.webapp.address</name>   
    <value>devcluster01:8088</value> 
   </property> 
   <property>  
    <description>Classpath for typical applications.</description>
    <name>yarn.application.classpath</name> 
    <value>      
    $HADOOP_CONF_DIR,     
    $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,   
    $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,      
    $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,  
    $YARN_HOME/*,$YARN_HOME/lib/*  
    </value> 
   </property>
   <property> 
    <name>yarn.nodemanager.aux-services</name> 
    <value>mapreduce.shuffle</value> 
   </property>  
   <property>   
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> 
    <value>org.apache.hadoop.mapred.ShuffleHandler</value> 
   </property> 
   <property>  
    <name>yarn.nodemanager.local-dirs</name>     <value>/home/ysc/h2/data/1/yarn/local,/home/ysc/h2/data/2/yarn/local,/home/ysc/h2/data/3/yarn/local</value> 
   </property>
   <property>
    <name>yarn.nodemanager.log-dirs</name>      <value>/home/ysc/h2/data/1/yarn/logs,/home/ysc/h2/data/2/yarn/logs,/home/ysc/h2/data/3/yarn/logs</value> 
   </property> 
   <property>  
    <description>Where to aggregate logs</description>
    <name>yarn.nodemanager.remote-app-log-dir</name>   
    <value>/home/ysc/h2/var/log/hadoop-yarn/apps</value>
   </property>   
   <property>   
    <name>mapreduce.jobhistory.address</name>  
    <value>devcluster01:10020</value>
   </property>  
   <property>   
    <name>mapreduce.jobhistory.webapp.address</name>  
    <value>devcluster01:19888</value>
   </property>  
 8、vi etc/hadoop/hdfs-site.xml
  <property> 
   <name>dfs.permissions.superusergroup</name> 
   <value>root</value>
  </property>
  <property>
    <name>dfs.name.dir</name>
    <value>/home/ysc/dfs/filesystem/name</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>/home/ysc/dfs/filesystem/data</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.block.size</name>
    <value>6710886400</value>
    <description>The default block size for new files.</description>
  </property>
 9、启动hadoop
  bin/hdfs namenode -format
  sbin/start-dfs.sh
  sbin/start-yarn.sh
 10、访问管理页面
  http://devcluster01:8088
  http://devcluster01:50070

这篇关于转:nutch相干框架安装使用最佳指南的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1087865

相关文章

Zookeeper安装和配置说明

一、Zookeeper的搭建方式 Zookeeper安装方式有三种,单机模式和集群模式以及伪集群模式。 ■ 单机模式:Zookeeper只运行在一台服务器上,适合测试环境; ■ 伪集群模式:就是在一台物理机上运行多个Zookeeper 实例; ■ 集群模式:Zookeeper运行于一个集群上,适合生产环境,这个计算机集群被称为一个“集合体”(ensemble) Zookeeper通过复制来实现

CentOS7安装配置mysql5.7 tar免安装版

一、CentOS7.4系统自带mariadb # 查看系统自带的Mariadb[root@localhost~]# rpm -qa|grep mariadbmariadb-libs-5.5.44-2.el7.centos.x86_64# 卸载系统自带的Mariadb[root@localhost ~]# rpm -e --nodeps mariadb-libs-5.5.44-2.el7

Centos7安装Mongodb4

1、下载源码包 curl -O https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-rhel70-4.2.1.tgz 2、解压 放到 /usr/local/ 目录下 tar -zxvf mongodb-linux-x86_64-rhel70-4.2.1.tgzmv mongodb-linux-x86_64-rhel70-4.2.1/

中文分词jieba库的使用与实景应用(一)

知识星球:https://articles.zsxq.com/id_fxvgc803qmr2.html 目录 一.定义: 精确模式(默认模式): 全模式: 搜索引擎模式: paddle 模式(基于深度学习的分词模式): 二 自定义词典 三.文本解析   调整词出现的频率 四. 关键词提取 A. 基于TF-IDF算法的关键词提取 B. 基于TextRank算法的关键词提取

使用SecondaryNameNode恢复NameNode的数据

1)需求: NameNode进程挂了并且存储的数据也丢失了,如何恢复NameNode 此种方式恢复的数据可能存在小部分数据的丢失。 2)故障模拟 (1)kill -9 NameNode进程 [lytfly@hadoop102 current]$ kill -9 19886 (2)删除NameNode存储的数据(/opt/module/hadoop-3.1.4/data/tmp/dfs/na

Hadoop数据压缩使用介绍

一、压缩原则 (1)运算密集型的Job,少用压缩 (2)IO密集型的Job,多用压缩 二、压缩算法比较 三、压缩位置选择 四、压缩参数配置 1)为了支持多种压缩/解压缩算法,Hadoop引入了编码/解码器 2)要在Hadoop中启用压缩,可以配置如下参数

Makefile简明使用教程

文章目录 规则makefile文件的基本语法:加在命令前的特殊符号:.PHONY伪目标: Makefilev1 直观写法v2 加上中间过程v3 伪目标v4 变量 make 选项-f-n-C Make 是一种流行的构建工具,常用于将源代码转换成可执行文件或者其他形式的输出文件(如库文件、文档等)。Make 可以自动化地执行编译、链接等一系列操作。 规则 makefile文件

使用opencv优化图片(画面变清晰)

文章目录 需求影响照片清晰度的因素 实现降噪测试代码 锐化空间锐化Unsharp Masking频率域锐化对比测试 对比度增强常用算法对比测试 需求 对图像进行优化,使其看起来更清晰,同时保持尺寸不变,通常涉及到图像处理技术如锐化、降噪、对比度增强等 影响照片清晰度的因素 影响照片清晰度的因素有很多,主要可以从以下几个方面来分析 1. 拍摄设备 相机传感器:相机传

Centos7安装JDK1.8保姆版

工欲善其事,必先利其器。这句话同样适用于学习Java编程。在开始Java的学习旅程之前,我们必须首先配置好适合的开发环境。 通过事先准备好这些工具和配置,我们可以避免在学习过程中遇到因环境问题导致的代码异常或错误。一个稳定、高效的开发环境能够让我们更加专注于代码的学习和编写,提升学习效率,减少不必要的困扰和挫折感。因此,在学习Java之初,投入一些时间和精力来配置好开发环境是非常值得的。这将为我

pdfmake生成pdf的使用

实际项目中有时会有根据填写的表单数据或者其他格式的数据,将数据自动填充到pdf文件中根据固定模板生成pdf文件的需求 文章目录 利用pdfmake生成pdf文件1.下载安装pdfmake第三方包2.封装生成pdf文件的共用配置3.生成pdf文件的文件模板内容4.调用方法生成pdf 利用pdfmake生成pdf文件 1.下载安装pdfmake第三方包 npm i pdfma