Java开发笔记Ⅱ(Jsoup爬虫)

2024-06-19 18:12
文章标签 java 开发 笔记 爬虫 jsoup

本文主要是介绍Java开发笔记Ⅱ(Jsoup爬虫),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

Jsoup 爬虫

Java 也能写爬虫!!!

Jsoup重要对象如下:

Document:文档对象,每个html页面都是一个Document对象

Element:元素对象,一个Document对象里有多个Element对象

Node:节点对象,用于存储数据,标签名称、属性都是节点对象

Jsoup的主要方法如下:

static Connection connect(String url) 创建URL连接

static Document parse(File in, String charsetName) 解析文件为 Document 对象

static Document parse(String html) 解析html代码为 Document 对象

(虽然上边是最主要的方法,但是下边这段代码中,是用 document对象 + css 选择器来获取的信息)

爬虫示例(豆瓣)

   /*** 通过访问接口获取代理IP*/public void initIPPool() {System.out.println("开始获取IP...");Process proc;try {// 这个代码之前是python改的,这里偷懒直接调用,这个文件贴在后边proc = Runtime.getRuntime().exec("python getIP.py");BufferedReader in = new BufferedReader(new InputStreamReader(proc.getInputStream()));String line = null;while ((line = in.readLine()) != null) {System.out.println(line);}in.close();proc.waitFor();} catch (Exception e) {System.out.println(e.toString());}System.out.println("成功获取代理IP");}/*** 从存储代理IP的文件获取代理IP*/public void loadIPPool() {File file = new File("ipPool.txt");List<String> list = new ArrayList<String>();synchronized (this) {BufferedReader reader = null;try {reader = new BufferedReader(new FileReader(file));String tempString = null;// 一次读入一行,直到读入null为文件结束while ((tempString = reader.readLine()) != null) {list.add(tempString);}reader.close();} catch (IOException e) {e.printStackTrace();} finally {if (reader != null) {try {reader.close();} catch (IOException e1) {System.out.println(e1.toString());}}}}System.out.println(list);myIPPool = list.toArray(new String[list.size()]);System.out.println("成功载入IP代理池");}public String crawlOnce(Integer start) {StringBuilder finalResult = new StringBuilder();Random random = new Random();// 请求地址String url ="http://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=&start=" +(start);HttpGet request = new HttpGet(url);String proxyIp = myIPPool[random.nextInt(myIPPool.length)];while (proxyIp.split(":").length != 2) {// 在代理ip池里随机获取一个ipproxyIp = myIPPool[random.nextInt(myIPPool.length)];}HttpHost proxy = new HttpHost(proxyIp.split(":")[0],Integer.parseInt(proxyIp.split(":")[1]));SSLContextBuilder builder = new SSLContextBuilder();// 全部信任 不做身份鉴定PoolingHttpClientConnectionManager cm = null;SSLConnectionSocketFactory sslsf = null;try {builder.loadTrustMaterial(null, new TrustStrategy() {@Overridepublic boolean isTrusted(X509Certificate[] x509Certificates, String s) throws CertificateException {return true;}});sslsf = new SSLConnectionSocketFactory(builder.build(), new String[]{"SSLv2Hello","SSLv3", "TLSv1", "TLSv1.2"}, null, NoopHostnameVerifier.INSTANCE);Registry<ConnectionSocketFactory> registry =RegistryBuilder.<ConnectionSocketFactory>create().register("http",new PlainConnectionSocketFactory()).register("https", sslsf).build();cm = new PoolingHttpClientConnectionManager(registry);cm.setMaxTotal(200);//max connection} catch (Exception e) {System.out.println(e.toString());return "";}//设置认证CredentialsProvider provider = new BasicCredentialsProvider();//第一个参数对应代理httpHost,第二个参数设置代理的用户名和密码,如果代理不需要用户名和密码,填空provider.setCredentials(new AuthScope(proxy), new UsernamePasswordCredentials("", ""));//实例化CloseableHttpClient对象CloseableHttpClient httpClient = HttpClients.custom().setSSLSocketFactory(sslsf).setConnectionManager(cm).setConnectionManagerShared(true).setDefaultCredentialsProvider(provider).build();RequestConfig config = RequestConfig.custom().setProxy(proxy).setConnectTimeout(CONNECTION_TIME_OUT).setConnectionRequestTimeout(CONNECTION_TIME_OUT).setSocketTimeout(CONNECTION_TIME_OUT).build();request.setConfig(config);//添加请求头request.addHeader("User-Agent", myUAPool[random.nextInt(myUAPool.length)]);request.addHeader("Cookie", myCookies[random.nextInt(myCookies.length)]);request.addHeader("Accept-Language", "zh-CN,zh;q=0.9");request.addHeader("Sec-Fetch-Mode", "cors");request.addHeader("Sec-Fetch-Site", "same-origin");HttpResponse response = null;BufferedReader rd = null;try {response = httpClient.execute(request);rd = new BufferedReader(new InputStreamReader(response.getEntity().getContent()));} catch (IOException e) {logError(start);return "";}String line = "";StringBuilder result = new StringBuilder();while (true) {try {line = rd.readLine();if (line == null) {break;}} catch (IOException e) {logError(start);break;}// 请求返回了html页面if (line.equals("") || line.charAt(0) == '<') {break;}result.append(line);}System.out.println((start) + "--result:" + result);JSONObject res = JSONObject.parseObject(String.valueOf(result));if (res == null || !res.containsKey("data")) {logError(start);return "";}JSONArray jsonArray = res.getJSONArray("data");for (int i = 0; i < jsonArray.size(); i++) {JSONObject jo = jsonArray.getJSONObject(i);// 通过详情链接爬取电影详情finalResult.append(crawlDetails(jo.getString("url")));}return finalResult.toString();}// 爬取详情private String crawlDetails(String url) {String result = "";Random random = new Random();try {String proxyIp = myIPPool[random.nextInt(myIPPool.length)];// myUAPool这里可以换几个浏览器把useragent手写在变量里Connection con = Jsoup.connect(url).proxy(proxyIp.split(":")[0], Integer.parseInt(proxyIp.split(":")[1])).timeout(10000).userAgent(myUAPool[random.nextInt(myUAPool.length)]).header("Accept-Language", "zh-CN,zh;q=0.9").header("Cookie", myCookies[random.nextInt(myCookies.length)]).timeout(CONNECTION_TIME_OUT); // 设置连接超时时间// 执行连接,获取页面Connection.Response response = con.execute();Document document = con.get();String info = document.select("#info").text();// IDresult += url.substring(33, url.length() - 1);// 标题result += "," + document.select("#content > h1 > span:nth-child(1)").text();// 年份result += "," + document.select("#content > h1 > span.year").text();// 导演result += "," + document.select("#info > span:nth-child(1) > span.attrs > a").text();// 编剧result += "," + document.select("#info > span:nth-child(3) > span.attrs").text();// 主演result += "," + document.select("#info > span.actor > span.attrs").text();// 类型result += "," + document.select("[property=v:genre]").text();// 产地result += "," + info.substring(info.indexOf("制片国家/地区: "), info.indexOf(" 语言:")).substring("制片国家/地区: ".length());// 语言if (info.contains(" 上映日期:")) {result += "," + info.substring(info.indexOf("语言: "), info.indexOf(" 上映日期:")).substring("语言: ".length());} else {result += "," + info.substring(info.indexOf("语言: ")).substring("语言: ".length());}// 片长result += "," + document.select("[property=v:genre]").attr("content");// 评分result += "," +document.select("#interest_sectl > div > div.rating_self.clearfix > strong").text();// 5result += "," + document.select("#interest_sectl > div.rating_wrap.clearbox > div" +".ratings-on-weight > div:nth-child(1) > span" +".rating_per").text();// 4result += "," + document.select("#interest_sectl > div.rating_wrap.clearbox > div" +".ratings-on-weight > div:nth-child(2) > span" +".rating_per").text();// 3result += "," + document.select("#interest_sectl > div.rating_wrap.clearbox > div" +".ratings-on-weight > div:nth-child(3) > span" +".rating_per").text();// 2result += "," + document.select("#interest_sectl > div.rating_wrap.clearbox > div" +".ratings-on-weight > div:nth-child(4) > span" +".rating_per").text();// 1result += "," + document.select("#interest_sectl > div.rating_wrap.clearbox > div" +".ratings-on-weight > div:nth-child(5) > span" +".rating_per").text();// 评分人数result += "," + document.select("[property=v:votes]").text();// 评论数result +="," + document.select("#comments-section > div.mod-hd > h2 > span > a").text();System.out.println(proxyIp + " " + result);} catch (IOException e) {System.out.println(e.toString());}return result + "\n";}

获取代理IP的代码

# coding=UTF-8import requests
import jsonclass FreeIP():def __init__(self):# 代理ip网站self.url = "http://proxylist.fatezero.org/proxy.list"self.headers = {"User-Agent": "这里改为浏览器的useragent"}def check_ip(self, ip_list):correct_ip = []for ip in ip_list:if len(correct_ip) > 10:  # 可以根据自己的需求进行更改或者注释掉breakip_port = "{}:{}".format(ip["host"], ip["port"])proxies = {'https': ip_port}try:# 如果请求该网址,返回的IP地址与代理IP一致,则认为代理成功response = requests.get('https://icanhazip.com/', proxies=proxies,timeout=3).text  # 可以更改timeout时间if response.strip() == ip["host"]:# print("可用的IP地址为:{}".format(ip_port))correct_ip.append(ip_port)except:# print("不可用的IP地址为:{}".format(ip_port))return correct_ipdef run(self):response = requests.get(url=self.url).content.decode()ip_list = []proxies_list = response.split('\n')for proxy_str in proxies_list:try:proxy = {}proxy_json = json.loads(proxy_str)if proxy_json["anonymity"] == "high_anonymous" and proxy_json["type"] == "https":host = proxy_json['host']port = proxy_json['port']proxy["host"] = hostproxy["port"] = portip_list.append(proxy)except:correct_ip = self.check_ip(ip_list)file_path = 'ipPool.txt'# 写入这个文件with open(file_path, mode='w', encoding='utf-8') as file_obj:for i in correct_ip:file_obj.write(i + "\n")if __name__ == '__main__':ip = FreeIP()ip.run()

这篇关于Java开发笔记Ⅱ(Jsoup爬虫)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1075766

相关文章

Java编译生成多个.class文件的原理和作用

《Java编译生成多个.class文件的原理和作用》作为一名经验丰富的开发者,在Java项目中执行编译后,可能会发现一个.java源文件有时会产生多个.class文件,从技术实现层面详细剖析这一现象... 目录一、内部类机制与.class文件生成成员内部类(常规内部类)局部内部类(方法内部类)匿名内部类二、

SpringBoot实现数据库读写分离的3种方法小结

《SpringBoot实现数据库读写分离的3种方法小结》为了提高系统的读写性能和可用性,读写分离是一种经典的数据库架构模式,在SpringBoot应用中,有多种方式可以实现数据库读写分离,本文将介绍三... 目录一、数据库读写分离概述二、方案一:基于AbstractRoutingDataSource实现动态

Springboot @Autowired和@Resource的区别解析

《Springboot@Autowired和@Resource的区别解析》@Resource是JDK提供的注解,只是Spring在实现上提供了这个注解的功能支持,本文给大家介绍Springboot@... 目录【一】定义【1】@Autowired【2】@Resource【二】区别【1】包含的属性不同【2】@

springboot循环依赖问题案例代码及解决办法

《springboot循环依赖问题案例代码及解决办法》在SpringBoot中,如果两个或多个Bean之间存在循环依赖(即BeanA依赖BeanB,而BeanB又依赖BeanA),会导致Spring的... 目录1. 什么是循环依赖?2. 循环依赖的场景案例3. 解决循环依赖的常见方法方法 1:使用 @La

Java枚举类实现Key-Value映射的多种实现方式

《Java枚举类实现Key-Value映射的多种实现方式》在Java开发中,枚举(Enum)是一种特殊的类,本文将详细介绍Java枚举类实现key-value映射的多种方式,有需要的小伙伴可以根据需要... 目录前言一、基础实现方式1.1 为枚举添加属性和构造方法二、http://www.cppcns.co

Elasticsearch 在 Java 中的使用教程

《Elasticsearch在Java中的使用教程》Elasticsearch是一个分布式搜索和分析引擎,基于ApacheLucene构建,能够实现实时数据的存储、搜索、和分析,它广泛应用于全文... 目录1. Elasticsearch 简介2. 环境准备2.1 安装 Elasticsearch2.2 J

Java中的String.valueOf()和toString()方法区别小结

《Java中的String.valueOf()和toString()方法区别小结》字符串操作是开发者日常编程任务中不可或缺的一部分,转换为字符串是一种常见需求,其中最常见的就是String.value... 目录String.valueOf()方法方法定义方法实现使用示例使用场景toString()方法方法

Java中List的contains()方法的使用小结

《Java中List的contains()方法的使用小结》List的contains()方法用于检查列表中是否包含指定的元素,借助equals()方法进行判断,下面就来介绍Java中List的c... 目录详细展开1. 方法签名2. 工作原理3. 使用示例4. 注意事项总结结论:List 的 contain

Java实现文件图片的预览和下载功能

《Java实现文件图片的预览和下载功能》这篇文章主要为大家详细介绍了如何使用Java实现文件图片的预览和下载功能,文中的示例代码讲解详细,感兴趣的小伙伴可以跟随小编一起学习一下... Java实现文件(图片)的预览和下载 @ApiOperation("访问文件") @GetMapping("

Spring Boot + MyBatis Plus 高效开发实战从入门到进阶优化(推荐)

《SpringBoot+MyBatisPlus高效开发实战从入门到进阶优化(推荐)》本文将详细介绍SpringBoot+MyBatisPlus的完整开发流程,并深入剖析分页查询、批量操作、动... 目录Spring Boot + MyBATis Plus 高效开发实战:从入门到进阶优化1. MyBatis