scrapy--子类CrawlSpider中间件

2024-08-29 11:04

本文主要是介绍scrapy--子类CrawlSpider中间件,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

 免责声明:本文仅做分享参考~

目录

CrawlSpider

介绍

xj.py

中间件

部分middlewares.py

wyxw.py 

完整的middlewares.py


CrawlSpider

介绍

CrawlSpider类:定义了一些规则来做数据爬取,从爬取的网页中获取链接并进行继续爬取.

创建方式:scrapy genspider -t crawl 爬虫名 爬虫域名
scrapy genspider -t crawl zz zz.com(子类就是牛,因为可以继承啊,还可以有自己的方法.)
spider类--》ZzscSpider类
spider类--》CrawLSpider类--》XjSpider类
CrawLSpider类:定义了一些规则来做数据爬取,从爬取的网页中获取链接并进行继续爬取

这个子类注意好 规则 , 有助于帮你跟进追踪类似数据(多页,类似html结构),  是这个子类最大的特点.

--正则匹配 !!!

新疆xxedu

xj.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Ruleclass XjSpider(CrawlSpider):name = "xj"# allowed_domains = ["xj.com"]start_urls = ["https://www.xjie.edu.cn/tzgg1.htm"]# 规则,根据正则匹配: \ 转义, ^ 开始, $ 结束, () 分组, [] 范围, | 或, + 1次或多次, * 0次或多次# 获取详情页链接:# rules = (Rule(LinkExtractor(allow=r"Items/"), callback="parse_item", follow=True),)# follow属性为True,会自动跟进详情页链接,并调用parse_item方法处理.rules = (Rule(LinkExtractor(allow=r"info/1061/.*?\.htm"),callback="parse_item",follow=False,),Rule(LinkExtractor(allow=r"tzgg1/.*?\.htm"), follow=True),)count = 0# 匹配100次,就进入parse_item方法100次.def parse_item(self, response):# print(response.request.headers) #查看伪装的请求头self.count += 1# 基于详情页,获取标题:title = response.xpath("//h3/text()").get()print(title, self.count)  # 竟然也自动拿到了分页的url.item = {}# item["domain_id"] = response.xpath('//input[@id="sid"]/@value').get()# item["name"] = response.xpath('//div[@id="name"]').get()# item["description"] = response.xpath('//div[@id="description"]').get()return item

中间件

scrapy中有两个中间件:

下载中间件:位于引擎和下载器中间.

爬虫中间件:位于引擎和爬虫中间.(一般不用)


下载中间件的作用:

用来篡改请求和响应,比如篡改请求:加一些请求头,加代理等等;

篡改响应就是更改响应的内容.

# 下载器

拦截请求 --伪装

拦截响应 --修改返回内容

 

 


部分middlewares.py

settings.py里面的是全局设置,在这你可以自己定义爬虫的伪装~~

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlfrom scrapy import signals# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter# 引擎把请求对象交给下载器,下载器把响应对象交给引擎.class ScrapyDemo1SpiderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(self, response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(self, response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, or item objects.for i in result:yield idef process_spider_exception(self, response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Request or item objects.passdef process_start_requests(self, start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info("Spider opened: %s" % spider.name)class ScrapyDemo1DownloaderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return s# 发起请求执行的方法def process_request(self, request, spider):# Called for each request that goes through the downloader# middleware.# Must either:# - return None: continue processing this request# - or return a Response object# - or return a Request object# - or raise IgnoreRequest: process_exception() methods of#   installed downloader middleware will be called## if spider.name == "wyxw": # 判断,如果是指定爬虫,则修改请求头.request.headers["User-Agent"] = ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")
# cookie伪装# request.cookies["name"] = "value"# 如果更换ip地址,使用代理:# request.meta['proxy'] = 'https://ip:端口'print(request)return None# 得到响应执行的方法#def process_response(self, request, response, spider):# Called with the response returned from the downloader.# Must either;# - return a Response object# - return a Request object# - or raise IgnoreRequestreturn response# 处理异常执行的方法def process_exception(self, request, exception, spider):# Called when a download handler or a process_request()# (from other downloader middleware) raises an exception.# Must either:# - return None: continue processing this exception# - return a Response object: stops process_exception() chain# - return a Request object: stops process_exception() chainpassdef spider_opened(self, spider):spider.logger.info("Spider opened: %s" % spider.name)

为什么要用中间件的原因:

网页是需要滚动加载数据的,所以借助自动化工具去拿到所有的页面数据,利用scrapy中 间件整合自动化工具.

# 网易新闻爬取四个板块数据

# 国内 国际 军事 航空 因为它们的页面结构一致,取值方式一致.

 四个版块的数据 通过访问url 只能拿到50条数据 其中有一个版块需要手动点击 加载更多才可以继续把数据加载完整 其它的几个版块几乎都是一直滚动滚动条,直到数据完全加载为止

解决问题?--> 滚动条的操作 requests scrapy也没有,只有自动化技术才可以实现滚动滚动条 怎么样才可以scrapy结合自动化技术呢?

使用自动化技术 加载所有数据 !


wyxw.py 

爬虫文件

import scrapy"""新闻页面的数据是通过滚动翻页加载
如果要获取所有的数据:
解决办法:传统的方式:通过网络面板看分页加载的请求是什么自动化工具的方式:让自动化工具实现鼠标滚动"""class WyxwSpider(scrapy.Spider):name = "wyxw"# allowed_domains = ["wyxw.com"]start_urls = ["https://news.163.com/"]count = 0def parse(self, response):#  解析四个版块的urllis = response.xpath('//div[@class="ns_area list"]/ul/li')# 获取四个版块的li# lis[1:3] lis[1] lis[2]# lis[4:6] lis[4] lis[5]target_li = lis[1:3] + lis[4:6]# print(target_li)# 循环li 拿到每一个,li的a标签的hraf属性值 再发起请求for li in target_li:href = li.xpath("./a/@href").get()# 发起请求yield scrapy.Request(url=href, callback=self.news_parse)# 解析每个版块的数据# HtmlResponse(元素) 还是 scrapy的response?是自己返回的HtmlResponse,所以要以元素面板为主# 因为数据是通过自动化工具获取到的,自动化获取的数据都是元素面板数据def news_parse(self, response):divs = response.xpath('//div[@class="ndi_main"]/div')  # 从元素面板看的# divs = response.xpath('//div[@class="hidden"]/div')  # 从响应内容看的for div in divs:self.count += 1title = div.xpath(".//h3/a/text()").get()  # 从元素面板看的# title = div.xpath('./a/text()').get()  # 从响应内容看的print(title, self.count)

完整的middlewares.py

process_request 函数里面 伪装请求.

process_response 函数里面 拦截请求,返回我们想要的数据.

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import scrapy
from scrapy import signals# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
from DrissionPage._pages.chromium_page import ChromiumPage
from scrapy.http import HtmlResponseclass Scrapy4SpiderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(self, response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(self, response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, or item objects.for i in result:yield idef process_spider_exception(self, response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Request or item objects.passdef process_start_requests(self, start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info("Spider opened: %s" % spider.name)class Scrapy4DownloaderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return s# 发起请求执行的方法def process_request(self, request, spider):c_dict = {'cookiesu': '291715258858774', ' device_id': '2bad6b34106f8be1d2762204c306aa5b',' smidV2': '20240509204738adc267d3b66fca8adf0d37b8ac9a1e8800dbcffc52451ef70', ' s': 'ak145lfb3s',' __utma': '1.1012417897.1720177183.1720177183.1720435739.2',' __utmz': '1.1720435739.2.2.utmcsr=baidu|utmccn=(organic)|utmcmd=organic',' acw_tc': '2760827c17230343015564537e9504eebc062a87e8d3ba4425778641e1164e',' xq_a_token': 'fb0f503ef881090db449e976c330f1f2d626c371', ' xqat': 'fb0f503ef881090db449e976c330f1f2d626c371',' xq_r_token': '967c806e113fbcb1e314d5ef2dc20f1dd8e66be3',' xq_id_token': 'eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTcyNTMyNDc3NSwiY3RtIjoxNzIzMDM0MzAxODI4LCJjaWQiOiJkOWQwbjRBWnVwIn0.I0pylYRqjdDOE0QbAuLFr5rUCFG1Sj13kTEHAdap2fEse0i12LG7a-14rhvQKsK9G0F7VZSWDYBsI82mqHimsBbWvgefkOy-g8V_nhb-ntGBubbVHlfjv7y-3GDatzcZPfgYvu7m0wEu77PcWdKqJR-KwZsZjQVwaKiHcFuvFUmpfYN942D6YY2MIzgWJxSQCX_t4f1YujRJRsKLvrq9QbVIqvJu-SpPT1SJfSPT9e7h_ERkU0QOsmgARJVfivkAAFM2cyb7HKJsHQSqVSU6hcIq4CMs5r90IsrZ4fOL5cUqaGNT58qkjx-flta27QhCIeHxexi1K95TKZTcD8EygA',' u': '291715258858774',' Hm_lvt_1db88642e346389874251b5a1eded6e3': '1722350318,1722863131,1722925588,1723034326',' Hm_lpvt_1db88642e346389874251b5a1eded6e3': '1723034326', ' HMACCOUNT': '13108745FF137EDD',' .thumbcache_f24b8bbe5a5934237bbc0eda20c1b6e7': 'YEsBB9JFA5Q4gQHJIe1Lx6JjvpZzcuUljYTfjFKm3lmCSpRZMpoNmnSBV0UptK3ripTe4xifyqRUZO/LEPx6Iw%3D%3D',' ssxmod_itna': 'WqIx9DgD0jD==0dGQDHWWoeeqBKorhBoC7ik8qGN6xYDZDiqAPGhDC4bUxD5poPoqWW3pi4kiYr23PZF2E2GaeBm8EXTDU4i8DCwiK=ODem=D5xGoDPxDeDAQKiTDY4DdjpNv=DEDeKDRDAQDzLdyDGfBDYP9QqDgSqDBGOdDKqGgzTxD0TxNaiqq8GKKvkd5qjbDAwGgniq9D0UdxBLxAax9+j9kaBUg8ZaPT2jx5eGuDG6DOqGmSfb3zdNPvAhWmY4sm75Ymbq4n75e8YervGPrPuDNh0wKY7DoBGp=GDLMxDfT0bD',' ssxmod_itna2': 'WqIx9DgD0jD==0dGQDHWWoeeqBKorhBoC7ik4A=W=e4D/D0hq7P7phOF4WG2WCxjKD2WYD=='}if spider.name == 'xueqiu':# ua  Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0request.headers['referer'] = 'https://xueqiu.com/'request.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0'# 失效 request.headers['cookie'] = 'cookiesu=291715258858774; device_id=2bad6b34106f8be1d2762204c306aa5b; smidV2=20240509204738adc267d3b66fca8adf0d37b8ac9a1e8800dbcffc52451ef70; s=ak145lfb3s; __utma=1.1012417897.1720177183.1720177183.1720435739.2; __utmz=1.1720435739.2.2.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; acw_tc=276077b217228631300332345e2956486032bd992e9f60b5689d9b248c1491; xq_a_token=fb0f503ef881090db449e976c330f1f2d626c371; xq_r_token=967c806e113fbcb1e314d5ef2dc20f1dd8e66be3; xq_id_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTcyNTMyNDc3NSwiY3RtIjoxNzIyODYzMDg5MDIwLCJjaWQiOiJkOWQwbjRBWnVwIn0.AdH2II1N2kGggG7ZgmP8MOgNPdxMoewCSK-gyWYBkw7zFExn6gaYV6YU8ReNkp2F5CBohxjZYyVyLtn98MJfk9dDwIe8ypTgXLkI_a5R1O1og5Fy6BeFv6FUJqgp8EVt8EvHBOYfRNl9iGtgrO3V_R0fJXq1aJTpV8lopNwEAzQbHRK58uXcbaoOwkUcX8MOv6XR-eGqnHYRSJ35P769atb6vF05LqutQphcairWpGGgWJc9fMhVBym_GkOxy4_AWaURWf8Zpge7dJQszkCo-ljPbBP94vz3zM_PTnussZV3jeTRmacaJcHTee6mlE00hrtrAFZNf7UIjnpqbdzvjw; u=291715258858774; Hm_lvt_1db88642e346389874251b5a1eded6e3=1720435739,1722235054,1722350318,1722863131; HMACCOUNT=13108745FF137EDD; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1722863143; .thumbcache_f24b8bbe5a5934237bbc0eda20c1b6e7=ha9Q037kb9An+HmEECE3+OwwxbDqAMnKP2QzosxWbRaWA89HDsKTwe/0XwLZXhS9S5OvxrWGFu6LMFW2iVI9xw%3D%3D; ssxmod_itna=QqmxgD0Cq4ciDXYDHiuYYK0=e4D5Dk3DA2LOni44qGNjKYDZDiqAPGhDC4fUerbrq5=7oC03qeiB4hvpkKrmDKijF/qqeDHxY=DUa7aeDxpq0rD74irDDxD3wxneD+D04kguOqi3DhxGQD3qGylR=DA3tDbh=uDiU/DDUOB4G2D7UyiDDli0TeA8mk7CtUCQ03xGUxIqqQ7qDMUeGX87Fe7db86PhMIwaHamPuKCiDtqD94m=DbRL3vB6lWW+r7hGrGBvNGGYQKDqNQ7eeKDmjvfhu2GDile4tKOG5nBpan/zeDDfj0bD===; ssxmod_itna2=QqmxgD0Cq4ciDXYDHiuYYK0=e4D5Dk3DA2LOni4A=c==D/QKDFxnAO9aP7QmHGcDYK4xD==='#     request.cookies['键'] = '值'#     request.cookies['cookiesu'] = '291715258858774'for key,value in c_dict.items():request.cookies[key] = value# print(request)# 如果更换ip地址,使用代理# request.meta['proxy'] = 'https://ip:端口'return None# 得到响应执行的方法# 结合自动化工具def process_response(self, request, response, spider):# url必须是四个版块的其中一个if spider.name == 'wyxw':url = request.url# url = 'https://news.163.com/domestic/'dp = ChromiumPage()dp.get(url)# 滚动到页面底部 一次性不能完成while True:is_block = dp.ele('.load_more_tip').attr('style')if is_block == 'display: block;':#     已经到最后了breakdp.scroll.to_bottom()#     国内页面需要额外手动点击try:# 只有一个页面能够点击加载更多,加入异常处理,如果出现了异常,进行下一次循环click_tag = dp.ele('text:加载更多')click_tag.click()except:continue#     创建响应对象,把自动化工具得到的html数据放入到对象中# 返回给引擎,由引擎交给spider#     HtmlResponse响应对象'''class HtmlResponse:def __init__(self,request,url,body):HtmlResponse(url=url,request=request,body=dp.html)   '''return HtmlResponse(url=url,request=request,body=dp.html,encoding='utf-8')return response# 遇到异常执行的方法def process_exception(self, request, exception, spider):# Called when a download handler or a process_request()# (from other downloader middleware) raises an exception.# Must either:# - return None: continue processing this exception# - return a Response object: stops process_exception() chain# - return a Request object: stops process_exception() chainpassdef spider_opened(self, spider):spider.logger.info("Spider opened: %s" % spider.name)

这篇关于scrapy--子类CrawlSpider中间件的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1117679

相关文章

Python3中Sanic中间件的使用

《Python3中Sanic中间件的使用》Sanic框架中的中间件是一种强大的工具,本文就来介绍Python3中Sanic中间件的使用,具有一定的参考价值,感兴趣的可以了解一下... 目录Sanic 中间件的工作流程中间件的使用1. 全局中间件2. 路由中间件3. 异常处理中间件4. 异步中间件5. 优先级

开源分布式数据库中间件

转自:https://www.csdn.net/article/2015-07-16/2825228 MyCat:开源分布式数据库中间件 为什么需要MyCat? 虽然云计算时代,传统数据库存在着先天性的弊端,但是NoSQL数据库又无法将其替代。如果传统数据易于扩展,可切分,就可以避免单机(单库)的性能缺陷。 MyCat的目标就是:低成本地将现有的单机数据库和应用平滑迁移到“云”端

c++11工厂子类实现自注册的两种方法

文章目录 一、产品类构建1. 猫基类与各品种猫子类2.狗基类与各品种狗子类 二、工厂类构建三、客户端使用switch-case实现调用不同工厂子类四、自注册方法一:公开注册函数显式注册五、自注册方法二:构造函数隐形注册总结 一、产品类构建 1. 猫基类与各品种猫子类 class Cat {public:virtual void Printer() = 0;};class

获取所有classpath指定包下类的所有子类

1.问题 开发过程中,有时需要找到所有classpath下,特定包下某个类的所有子类,如何做到? 2. 实现 比较常见的解决方案是自己遍历目录,查找所有.class文件。 下面这个方法使用spring工具类实现,简化过程,不再需要自己遍历目录 /*** 获取在指定包下某个class的所有非抽象子类** @param parentClass 父类* @param packagePat

基于shard-jdbc中间件,实现数据分库分表

一、水平分割 1、水平分库 1)、概念: 以字段为依据,按照一定策略,将一个库中的数据拆分到多个库中。 2)、结果 每个库的结构都一样;数据都不一样; 所有库的并集是全量数据; 2、水平分表 1)、概念 以字段为依据,按照一定策略,将一个表中的数据拆分到多个表中。 2)、结果 每个表的结构都一样;数据都不一样; 所有表的并集是全量数据; 二、Shard-jdbc 中间件 1、架构图 2、特点

zdppy+vue3+onlyoffice文档管理系统实战 20240906 上课笔记 整合权限校验中间件

基于角色方法的中间件基本用法 import zdppy_api as apiimport zdppy_apimidauthasync def index(request):return api.resp.success()async def login(request):token = zdppy_apimidauth.get_role_token(role="admin")return ap

泛型第二课,派生子类、属性类型、方法重写、泛型擦除

子类(实现类) 子类与父类|接口一样使用泛型子类指定具体的类型子类与父类|接口 同时擦除类型子类泛型,父类|接口 擦除错误:不能子类擦除,父类|接口泛型 package com.pkushutong.genericity3;/*** 父类为泛型类* 1、属性* 2、方法* * 要么同时擦除,要么子类大于等于父类的类型* 不能子类擦除,父类泛型* 1、属性类型* 父类中,随父类型定

python scrapy爬虫框架 抓取BOSS直聘平台 数据可视化统计分析

使用python scrapy实现BOSS直聘数据抓取分析 前言   随着金秋九月的悄然而至,我们迎来了业界俗称的“金九银十”跳槽黄金季,周围的朋友圈中弥漫着探索新机遇的热烈氛围。然而,作为深耕技术领域的程序员群体,我们往往沉浸在代码的浩瀚宇宙中,享受着解决技术难题的乐趣,却也不经意间与职场外部的风云变幻保持了一定的距离,对行业动态或许仅有一鳞半爪的了解,甚至偶有盲区。   但正是这份对技术

【大数据Java基础-JAVA 面向对象14】面向对象的特征二:继承性 (三) 关键字:super以及子类对象实例化全过程

关键字:super 1.super 关键字可以理解为:父类的 2.可以用来调用的结构: 属性、方法、构造器 3.super调用属性、方法: 3.1 我们可以在子类的方法或构造器中。通过使用"super.属性"或"super.方法"的方式,显式的调用父类中声明的属性或方法。但是,通常情况下,我们习惯省略"super." 3.2 特殊情况:当子类和父类中定义了同名的属性时,我们要想在子类中调用父类

阿里中间件——diamond

一、前言        最近工作不忙闲来无事,仔细分析了公司整个项目架构,发现用到了很多阿里巴巴集团开源的框架,今天要介绍的是中间件diamond. 二、diamond学习笔记       1、diamond简介       diamond是一个管理持久配置(持久配置是指配置数据会持久化到磁盘和数据库中)的系统。无可厚非,淘宝内部正在使用diamond,在淘宝内部的绝大多数系统的配置都是由