本文主要是介绍Python实战---使用多线程爬取斗图啦表情包,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
使用多线程爬取斗图啦表情包
目标
爬取前一百页的表情包。
话不多说,直接上爬取结果:
只是作为练习,所以中途就把程序关掉了,可以看出来多线程爬取是真的快。
思路
1、先写出不使用多线程爬取页面的代码
2、使用多线程的生产者和消费者模式来爬取。
实现代码
'''
@Description: 爬取斗图啦的表情包
@Author: sikaozhifu
@Date: 2020-06-11 14:20:53
@LastEditTime: 2020-06-11 15:36:21
@LastEditors: Please set LastEditors
'''
import requests
from lxml import etree
import os
from urllib import request
import re
import threading
from queue import Queueclass Producer(threading.Thread):def __init__(self, page_queue, img_queue, *args, **kwargs):super(Producer, self).__init__(*args, **kwargs)self.page_queue = page_queueself.img_queue = img_queuedef run(self):while True:if self.page_queue.empty():breakurl = self.page_queue.get()self.pares_url(url)def pares_url(self, url):headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'}response = requests.get(url, headers=headers)text = response.texthtml = etree.HTML(text)imgs = html.xpath('//div[@class = "page-content text-center"]//img[@class != "gif"]')for img in imgs:img_url = img.get('data-original')alt = img.get('alt')# 要用正则表达式去掉win10下不满足命名规则的特殊符号alt = re.sub(r'[\??。\.!!,,\/:\*<>|"]', '', alt)suffix = os.path.splitext(img_url)[1]filename = alt + suffixself.img_queue.put((img_url, filename))class Consumer(threading.Thread):def __init__(self, page_queue, img_queue, *args, **kwargs):super(Consumer, self).__init__(*args, **kwargs)self.page_queue = page_queueself.img_queue = img_queuedef run(self):while True:if self.img_queue.empty() and self.page_queue.empty():breakimg_url, filename = self.img_queue.get()request.urlretrieve(img_url, 'images/'+filename)print(filename + '下载完成!')def main():page_queue = Queue(100)img_queue = Queue(1000)for x in range(1, 101):url = 'https://www.doutula.com/photo/list/?page=%s' % xpage_queue.put(url)for x in range(15):t = Producer(page_queue, img_queue)t.start()for x in range(64):t = Consumer(page_queue, img_queue)t.start()if __name__ == "__main__":main()
也可以在控制台打印爬取的图片名字,简直快到飞起:
总结
多线程爬取就是舒服哇, i 了 i 了。
这篇关于Python实战---使用多线程爬取斗图啦表情包的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!