python3_scrapy爬取斗鱼手机端数据包(颜值模块主播照片)

本文主要是介绍python3_scrapy爬取斗鱼手机端数据包(颜值模块主播照片),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

1.手机端数据包的抓取(响应教程可自行百度)


2.具体程序编写

items.py

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapyclass DouyuItem(scrapy.Item):# define the fields for your item here like:# 1.主播所在版块game_name = scrapy.Field()# 2.主播房间号room_id = scrapy.Field()# 3.主播链接vertical_src = scrapy.Field()# 4.主播昵称nickname = scrapy.Field()# 存储照片名称信息5.message = scrapy.Field()

douyu.py

# -*- coding: utf-8 -*-
import scrapy
import json
from Douyu.items import DouyuItemclass DouyuSpider(scrapy.Spider):name = 'douyu'allowed_domains = ['douyucdn.cn']# 0.抓取斗鱼手机端app中颜值分类的数据包baseURL = "http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset="offset = 0start_urls = [baseURL+str(offset)]def parse(self, response):# 1.将已经编码的json字符串解码为python对象(此处的Python对象时一个字典)# 1.data键对应的值是一个列表,列表中每一个元素又是一个字典data_list = json.loads(response.body)["data"]# 2.判断data是否为空(为空终止,不为空继续翻页)if not data_list:returnfor data in data_list:item = DouyuItem()item["game_name"] = data["game_name"]item["room_id"] = data["room_id"]item["vertical_src"] = data["vertical_src"]item["nickname"] = data["nickname"]item["message"] = data["game_name"]+data["room_id"]+data["nickname"]# 3.返回item对象给管道进行处理yield item# 4.JSON文件只能通过翻页来爬取,不能通过xpath()提取self.offset += 20# 5.将请求发回给引擎,引擎返回给调度器,调度器将请求入队列yield scrapy.Request(self.baseURL+str(self.offset), callback=self.parse)

pipelines.py

# -*- coding: utf-8 -*-
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
# 1.导入scrapy包中pipelines文件夹下images.py文件中ImagesPipeline类
import os
import scrapy
from Douyu.settings import IMAGES_STORE as images_store
from scrapy.pipelines.images import ImagesPipelineclass DouyuPipeline(ImagesPipeline):def get_media_requests(self, item, info):"""图片下载"""image_link = item["vertical_src"]# 2.管道将图片链接交给下载器下载yield scrapy.Request(image_link)def item_completed(self, results, item, info):"""图片重命名"""# 3.取出results中图片信息中:图片的存储路径# results是一个列表,列表中一个元组,元组中两个元素,第二个元素又是一个字典 / md5码# 推导式写法取值(最外层的列表相当于results列表)img_path = [x["path"] for ok, x in results if ok]# 4.为下载图片重命名# 4.os.rename(src_name, dst_name)os.rename(images_store+img_path[0], images_store + item["message"]+".jpg")return item

settings.py

# -*- coding: utf-8 -*-# Scrapy settings for Douyu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'Douyu'SPIDER_MODULES = ['Douyu.spiders']
NEWSPIDER_MODULE = 'Douyu.spiders'# 1.设置保存下载图片的路径images_store
# 1.另外一种路径方式IMAGES_STORE = "E:\\Python\\课程代码\\Douyu\\Douyu\\spiders\\"
IMAGES_STORE = "E:/Python/课程代码/Douyu/Douyu/spiders/"# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 2.配置手机移动端user_agent
USER_AGENT = 'Mozilla/5.0 (iPhone 84; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.0 MQQBrowser/7.8.0 Mobile/14G60 Safari/8536.25 MttCustomUA/2 QBWebViewType/1 WKType/1'# Obey robots.txt rules
# 3.禁用robots协议(设置为False或者直接注释掉)
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Douyu.middlewares.DouyuSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'Douyu.middlewares.DouyuDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
# 4.启用管道文件
ITEM_PIPELINES = {'Douyu.pipelines.DouyuPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

start.py

# -*- coding:utf-8 -*-
from scrapy import cmdlinecmdline.execute("scrapy crawl douyu".split())

3.图片爬取结果展示



这篇关于python3_scrapy爬取斗鱼手机端数据包(颜值模块主播照片)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/263309

相关文章

python: 多模块(.py)中全局变量的导入

文章目录 global关键字可变类型和不可变类型数据的内存地址单模块(单个py文件)的全局变量示例总结 多模块(多个py文件)的全局变量from x import x导入全局变量示例 import x导入全局变量示例 总结 global关键字 global 的作用范围是模块(.py)级别: 当你在一个模块(文件)中使用 global 声明变量时,这个变量只在该模块的全局命名空

深入探索协同过滤:从原理到推荐模块案例

文章目录 前言一、协同过滤1. 基于用户的协同过滤(UserCF)2. 基于物品的协同过滤(ItemCF)3. 相似度计算方法 二、相似度计算方法1. 欧氏距离2. 皮尔逊相关系数3. 杰卡德相似系数4. 余弦相似度 三、推荐模块案例1.基于文章的协同过滤推荐功能2.基于用户的协同过滤推荐功能 前言     在信息过载的时代,推荐系统成为连接用户与内容的桥梁。本文聚焦于

Python3 BeautifulSoup爬虫 POJ自动提交

POJ 提交代码采用Base64加密方式 import http.cookiejarimport loggingimport urllib.parseimport urllib.requestimport base64from bs4 import BeautifulSoupfrom submitcode import SubmitCodeclass SubmitPoj():de

cell phone teardown 手机拆卸

tweezer 镊子 screwdriver 螺丝刀 opening tool 开口工具 repair 修理 battery 电池 rear panel 后盖 front and rear cameras 前后摄像头 volume button board 音量键线路板 headphone jack 耳机孔 a cracked screen 破裂屏 otherwise non-functiona

Jenkins构建Maven聚合工程,指定构建子模块

一、设置单独编译构建子模块 配置: 1、Root POM指向父pom.xml 2、Goals and options指定构建模块的参数: mvn -pl project1/project1-son -am clean package 单独构建project1-son项目以及它所依赖的其它项目。 说明: mvn clean package -pl 父级模块名/子模块名 -am参数

寻迹模块TCRT5000的应用原理和功能实现(基于STM32)

目录 概述 1 认识TCRT5000 1.1 模块介绍 1.2 电气特性 2 系统应用 2.1 系统架构 2.2 STM32Cube创建工程 3 功能实现 3.1 代码实现 3.2 源代码文件 4 功能测试 4.1 检测黑线状态 4.2 未检测黑线状态 概述 本文主要介绍TCRT5000模块的使用原理,包括该模块的硬件实现方式,电路实现原理,还使用STM32类

python内置模块datetime.time类详细介绍

​​​​​​​Python的datetime模块是一个强大的日期和时间处理库,它提供了多个类来处理日期和时间。主要包括几个功能类datetime.date、datetime.time、datetime.datetime、datetime.timedelta,datetime.timezone等。 ----------动动小手,非常感谢各位的点赞收藏和关注。----------- 使用datet

C8T6超绝模块--EXTI

C8T6超绝模块–EXTI 大纲 控制流程结构体分析EXTI实现按键 具体案例 控制流程 这里是流程框图,具体可以去看我STM32专栏的EXTI的具体分析 结构体分析 typedef struct {uint32_t EXTI_Line; // 中断/事件线EXTIMode_TypeDef EXTI_Mode; // EXTI 模式EXTITrigger_TypeDef EXTI_

1、创建多模块的maven springboot项目

现在的java的项目都是多模块的,这次也跟个风。 目标:实现下述结构 项目AcedBoot, 子模块:         aced-api 对外提供接口,         aced-web 给前端提供接口,         aced-service 服务层,         aced-dao 数据底层,包含数据库mapper和实体类entity,         aced-commo

Vue2电商项目(二) Home模块的开发;(还需要补充js节流和防抖的回顾链接)

文章目录 一、Home模块拆分1. 三级联动组件TypeNav2. 其余组件 二、发送请求的准备工作1. axios的二次封装2. 统一管理接口API----跨域3. nprogress进度条 三、 vuex模块开发四、TypeNav三级联动组件开发1. 动态展示三级联动数据2. 三级联动 动态背景(1)、方式一:CSS样式(2)、方式二:JS 3. 控制二三级数据隐藏与显示--绑定styl