java使用phantomJs抓取动态页面

本文主要是介绍java使用phantomJs抓取动态页面，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

随时随地技术实战干货，充分利用闲暇时间，请关注源代码社区公众号和技术交流群。

from：http://blog.csdn.net/kaka0930/article/details/68941932

1. phantomjs的镜像网站：http://npm.taobao.org/dist/phantomjs/

2. phantomjs内置webkit内核，也就是chrome的内核。可以无界面加载页面，指的是和浏览器上面的页面一致，也就是解析完js的页面。所以需要爬取或者获得动态页面的，这算是利器。

3.之前自己也试了HttpUnit，不行的。网上找到的例子自己运行不了。报错太多。但是有没有文档，因为HttpUnit是2008年出的。官网上面啥也没有。所以我也没有资料参考，就放弃了。

4. 开始使用phantomjs，发现phantomjs算是动态爬取网页的主流。当然，所谓动态爬取从来不是问题，问题是速度。直接使用webkit等浏览器内核还是比较麻烦，而且速度不理想。

5. 自己使用的java + phantomjs在window上面开发。放到ubuntu上面。

首先是安装，其实window版下载解压即可。但是如果你想要直接在cmd可以使用phantomjs的命令，请把bin下面的phantomjs.exe文件路径添加到path里面。此处程序不要依赖path路径。也就是直接使用绝对路径。当然绝对路径里面使用了项目的相对路径。这样是为了更好的迁移。phantomJS的使用过程就是java程序调用phantomJS调用js文件来获取指定页面，然后传回相应的内容。

先给出代码：java端

[java] view plain copy

public class JSUtil
{
// 如果要更换运行环境，请注意exePath最后的phantom.exe需要更改。因为这个只能在window版本上运行。前面的路径名
// 也需要和exePath里面的保持一致。否则无法调用
private static String projectPath = System.getProperty("user.dir");
private static String jsPath = projectPath + File.separator + "huicong.js";
private static String exePath = projectPath + File.separator + "phantomjs" + File.separator + "bin" + File.separator
+ "phantomjs.exe";
public static void main(String[] args) throws IOException, SAXException
{
// 测试调用。传入url即可
String html = getParseredHtml2("http://huisheng99.b2b.hc360.com/");
System.out.println("html: " + html);
}
// 调用phantomjs程序，并传入js文件，并通过流拿回需要的数据。
public static String getParseredHtml2(String url) throws IOException
{
Runtime rt = Runtime.getRuntime();
Process p = rt.exec(exePath + " " + jsPath + " " + url);
InputStream is = p.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
StringBuffer sbf = new StringBuffer();
String tmp = "";
while ((tmp = br.readLine()) != null)
{
sbf.append(tmp);
}
String[] result = sbf.toString().split("companyServiceMod");
String result2 = "";
if(result.length >= 2)
{
result2 = result[1];
if(result2.length() > 200)
{
result2 = result2.substring(0, 200);
}
}
//System.out.println("resut2: "+result2);
return result2;
}
}

然后是js文件，

[javascript] view plain copy

var page = require('webpage').create(),
system = require('system'),
t, address;
//写入文件，用来测试。正式版本可以注释掉用来提高速度。
var fs = require("fs");
//读取命令行参数，也就是js文件路径。
if (system.args.length === 1) {
console.log('Usage: loadspeed.js <some URL>');
//这行代码很重要。凡是结束必须调用。否则phantomjs不会停止
phantom.exit();
}
page.settings.loadImages = false; //为了提升加载速度，不加载图片
page.settings.resourceTimeout = 10000;//超过10秒放弃加载
//此处是用来设置截图的参数。不截图没啥用
page.viewportSize = {
width: 1280,
height: 800
};
block_urls = ['baidu.com'];//为了提升速度，屏蔽一些需要时间长的。比如百度广告
page.onResourceRequested = function(requestData, request){
for(url in block_urls) {
if(requestData.url.indexOf(block_urls[url]) !== -1) {
request.abort();
//console.log(requestData.url + " aborted");
return;
}
}
}
t = Date.now();//看看加载需要多久。
address = system.args[1];
page.open(address, function(status) {
if (status !== 'success') {
console.log('FAIL to load the address');
} else {
t = Date.now() - t;
//此处原来是为了提取相应的元素。只要可以用document的，还是看可以用。但是自己的无法用document，只能在用字符分割在java里。
// var ua = page.evaluate(function() {
// return document.getElementById('companyServiceMod').innerHTML;
// });
// fs.write("qq.html", ua, 'w');
// console.log("测试qq: "+ua);
//console.log就是传输回去的内容。
console.log('Loading time ' + t + ' msec');
console.log(page.content);
setTimeout(function(){ phantom.exit(); }, 6000);
}
phantom.exit();
});

请把js文件放到java的程序里面指定的路径。二者要一直。建议就是项目的根目录下面。

此处我是放在了项目的根目录下面。文件名是huicong.js

6. 有一个巨大的问题，就是速度。官网解释如下：

stackoverflow给出的，如果截图，10秒算是正常。可以体会一下其速度。

然后自己查了一下stackoverflow，找到了一个很好的回答。

http://stackoverflow.com/questions/42703760/phantomjs-open-too-slow

表示感谢。具体就是三点：

6.1. 换个好点的电脑。

6.2. 不加载图片。参考上面的js文件。

6.3. 屏蔽相关广告等。参考上面的js文件。自己用了，成功吧时间压缩到2s。

7.自己是为了提取一个div里面的qq链接。但是没有找到怎么用dom来做。所以就直接传回整个page，然后手动用字符串解析。这里也许可以用各种selector。但是自己没有研究。

1. phantomjs的镜像网站：http://npm.taobao.org/dist/phantomjs/

5. 自己使用的java + phantomjs在window上面开发。放到ubuntu上面。

先给出代码：java端

[java] view plain copy

public class JSUtil
{
// 如果要更换运行环境，请注意exePath最后的phantom.exe需要更改。因为这个只能在window版本上运行。前面的路径名
// 也需要和exePath里面的保持一致。否则无法调用
private static String projectPath = System.getProperty("user.dir");
private static String jsPath = projectPath + File.separator + "huicong.js";
private static String exePath = projectPath + File.separator + "phantomjs" + File.separator + "bin" + File.separator
+ "phantomjs.exe";
public static void main(String[] args) throws IOException, SAXException
{
// 测试调用。传入url即可
String html = getParseredHtml2("http://huisheng99.b2b.hc360.com/");
System.out.println("html: " + html);
}
// 调用phantomjs程序，并传入js文件，并通过流拿回需要的数据。
public static String getParseredHtml2(String url) throws IOException
{
Runtime rt = Runtime.getRuntime();
Process p = rt.exec(exePath + " " + jsPath + " " + url);
InputStream is = p.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
StringBuffer sbf = new StringBuffer();
String tmp = "";
while ((tmp = br.readLine()) != null)
{
sbf.append(tmp);
}
String[] result = sbf.toString().split("companyServiceMod");
String result2 = "";
if(result.length >= 2)
{
result2 = result[1];
if(result2.length() > 200)
{
result2 = result2.substring(0, 200);
}
}
//System.out.println("resut2: "+result2);
return result2;
}
}