Pyhon 爬虫框架 looter

2024-02-27 08:38
文章标签 框架 爬虫 pyhon looter

本文主要是介绍Pyhon 爬虫框架 looter,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

知名的pyspider,scrapy就不说了,今天说说这个 looter

安装

先安装好python3,需要3.6以上,然后执行 pip install looter

λ looter -h
Looter, a python package designed for web crawler lovers :)
Author: alphardex  QQ:2582347430
If any suggestion, please contact me.
Thank you for cooperation!Usage:looter genspider <name> [--async]looter shell [<url>]looter (-h | --help | --version)Options:-h --help        Show this screen.--version        Show version.--async          Use async instead of concurrent.

图片爬虫

λ looter shell https://konachan.com/postAvailable objects:url           The url of the site you crawled.res           The response of the site.tree          The element source tree to be parsed.Available functions:fetch         Send HTTP request to the site and parse it as a tree. [has async version]view          View the page in your browser. (test rendering)links         Get the links of the page.save          Save what you crawled as a file. (json or csv)Examples:Get all the <li> elements of a <ul> table:>>> items = tree.css('ul li')Get the links with a regex pattern:>>> items = links(res, pattern=r'.*/(jpeg|image)/.*')For more info, plz refer to documentation:[looter]: https://looter.readthedocs.io/en/latest/imgs = tree.css('a.directlink::attr(href)').extract()
>>> imgs[1:10]
['https://konachan.com/jpeg/c67d38b73df6e32199127998fc0f3338/Konachan.com%20-%20283270%20ass%20bed%20blush%20breasts%20clover_%28sakura_gamer%29%20game_cg%20nipples%20pussy_juice%20red_hair%20sakura_gamer%20wanaca%20winged_cloud.jpg', 'https://konachan.com/image/a0952daaf9aa94cd676901203680fec4/Konachan.com%20-%20283269%20aliasing%20anus%20azur_lane%20blush%20breasts%20cum%20gray_hair%20group%20long_hair%20nipples%20nude%20penis%20pussy%20rak_%28kuraga%29%20red_eyes%20twintails%20uncensored.jpg', 'https://konachan.com/image/e8ea71c93a895d87338ebf17e3aef5b3/Konachan.com%20-%20283268%20aliasing%20anthropomorphism%20azur_lane%20blush%20breasts%20gray_hair%20group%20long_hair%20nipples%20nude%20penis%20pussy%20rak_%28kuraga%29%20red_eyes%20sex%20twintails%20uncensored.jpg', 'https://konachan.com/image/8ffb6f968ffe372ea90a339934a9749d/Konachan.com%20-%20283267%20bed%20blush%20brown_eyes%20brown_hair%20condom%20inanaki_shiki%20long_hair%20navel%20no_bra%20open_shirt%20original%20panties%20tie%20underwear.jpg', 'https://konachan.com/jpeg/0d1de5c59eaf6fc717d63912e076de1d/Konachan.com%20-%20283266%20ass%20bed%20black_hair%20brown_eyes%20long_hair%20matsuzaki_miyuki%20original%20ponytail%20shorts.jpg', 'https://konachan.com/jpeg/7b34654c53e43879f20a8fd642c32acc/Konachan.com%20-%20283264%20aqua_eyes%20bed%20blonde_hair%20blush%20breasts%20censored%20dark_skin%20navel%20nipples%20no_bra%20original%20penis%20pubic_hair%20pussy%20sex%20shirt_lift%20spread_legs%20tan_lines.jpg', 'https://konachan.com/image/00a0eb43c07e9361679b5389e284ef7f/Konachan.com%20-%20283263%20ass%20ball%20brown_eyes%20cameltoe%20dress%20erect_nipples%20gray_hair%20kokkoro%20loli%20panties%20pizanuko%20pointed_ears%20princess_connect%21%20underwear%20upskirt%20wristwear.jpg', 'https://konachan.com/jpeg/889214118e9a891c63f0cb759d809775/Konachan.com%20-%20283262%202girls%20animal%20bow%20brown_eyes%20brown_hair%20clouds%20dress%20feathers%20flowers%20gloves%20green_eyes%20headdress%20idolmaster%20loli%20ribbons%20rose%20short_hair%20sky%20tiara.jpg', 'https://konachan.com/image/c7a3f7f9d6a2c1dc17c4c13733f72aed/Konachan.com%20-%20283261%20bikini_top%20black_hair%20blue_eyes%20boots%20chain%20flat_chest%20gloves%20hoodie%20inosia%20kuroi_mato%20long_hair%20magic%20navel%20scar%20shorts%20signed%20sword%20twintails%20weapon.jpg']
Path('konachan.txt').write_text('\n'.join(imgs))
wget -i konachan.txt

image.png

抓取 v2

import time
import looter as lt
from pprint import pprint
from concurrent import futuresdomain = 'https://www.v2ex.com'
total = []def crawl(url):tree = lt.fetch(url)items = tree.css('#TopicsNode .cell')for item in items:data = {}data['title'] = item.css('span.item_title a::text').extract_first()data['author'] = item.css('span.small.fade strong a::text').extract_first()data['source'] = f"{domain}{item.css('span.item_title a::attr(href)').extract_first()}"reply = item.css('a.count_livid::text').extract_first()data['reply'] = int(reply) if reply else 0pprint(data)total.append(data)time.sleep(1)if __name__ == '__main__':tasklist = [f'{domain}/go/python?p={n}' for n in range(1, 10)][crawl(task) for task in tasklist]lt.save(total, name='v2ex.csv', sort_by='reply', order='desc')

抓取10页python主题的数据,按照回复数倒序排列
image.png
image.png

,author,reply,source,title
0,chinesehuazhou,127,https://www.v2ex.com/t/562327#reply127,10 行 Python 代码,批量压缩图片 500 张,简直太强大了(内有公号宣传,不喜勿进)
1,chinesehuazhou,103,https://www.v2ex.com/t/557286#reply103,len(x) 击败 x.len(),从内置函数看 Python 的设计思想(内有公号宣传,不喜勿进)
2,nfroot,73,https://www.v2ex.com/t/555249#reply73,面对 Python 的强大和难用性表示深深的迷茫,莫非打开方式不对?
3,css3,58,https://www.v2ex.com/t/554724#reply58,你们用什么工具来管理 Python 的库啊?
4,Northxw,54,https://www.v2ex.com/t/558529#reply54,花式反爬之某众点评网
5,akmonde,48,https://www.v2ex.com/t/559926#reply48,Python 项目移植到其他机器,要求全 Linux 系统适配
6,kayseen,47,https://www.v2ex.com/t/562683#reply47,这道 Python 题目有大神会做吗?
7,hellomacos,41,https://www.v2ex.com/t/562413#reply41,老生常谈的问题:如何学好 Python

公众号:苏生不惑
扫描二维码关注

这篇关于Pyhon 爬虫框架 looter的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/751843

相关文章

Python结合Flask框架构建一个简易的远程控制系统

《Python结合Flask框架构建一个简易的远程控制系统》这篇文章主要为大家详细介绍了如何使用Python与Flask框架构建一个简易的远程控制系统,能够远程执行操作命令(如关机、重启、锁屏等),还... 目录1.概述2.功能使用系统命令执行实时屏幕监控3. BUG修复过程1. Authorization

SpringBoot集成图片验证码框架easy-captcha的详细过程

《SpringBoot集成图片验证码框架easy-captcha的详细过程》本文介绍了如何将Easy-Captcha框架集成到SpringBoot项目中,实现图片验证码功能,Easy-Captcha是... 目录SpringBoot集成图片验证码框架easy-captcha一、引言二、依赖三、代码1. Ea

Gin框架中的GET和POST表单处理的实现

《Gin框架中的GET和POST表单处理的实现》Gin框架提供了简单而强大的机制来处理GET和POST表单提交的数据,通过c.Query、c.PostForm、c.Bind和c.Request.For... 目录一、GET表单处理二、POST表单处理1. 使用c.PostForm获取表单字段:2. 绑定到结

Python爬虫selenium验证之中文识别点选+图片验证码案例(最新推荐)

《Python爬虫selenium验证之中文识别点选+图片验证码案例(最新推荐)》本文介绍了如何使用Python和Selenium结合ddddocr库实现图片验证码的识别和点击功能,感兴趣的朋友一起看... 目录1.获取图片2.目标识别3.背景坐标识别3.1 ddddocr3.2 打码平台4.坐标点击5.图

修改若依框架Token的过期时间问题

《修改若依框架Token的过期时间问题》本文介绍了如何修改若依框架中Token的过期时间,通过修改`application.yml`文件中的配置来实现,默认单位为分钟,希望此经验对大家有所帮助,也欢迎... 目录修改若依框架Token的过期时间修改Token的过期时间关闭Token的过期时js间总结修改若依

MyBatis框架实现一个简单的数据查询操作

《MyBatis框架实现一个简单的数据查询操作》本文介绍了MyBatis框架下进行数据查询操作的详细步骤,括创建实体类、编写SQL标签、配置Mapper、开启驼峰命名映射以及执行SQL语句等,感兴趣的... 基于在前面几章我们已经学习了对MyBATis进行环境配置,并利用SqlSessionFactory核

cross-plateform 跨平台应用程序-03-如果只选择一个框架,应该选择哪一个?

跨平台系列 cross-plateform 跨平台应用程序-01-概览 cross-plateform 跨平台应用程序-02-有哪些主流技术栈? cross-plateform 跨平台应用程序-03-如果只选择一个框架,应该选择哪一个? cross-plateform 跨平台应用程序-04-React Native 介绍 cross-plateform 跨平台应用程序-05-Flutte

Spring框架5 - 容器的扩展功能 (ApplicationContext)

private static ApplicationContext applicationContext;static {applicationContext = new ClassPathXmlApplicationContext("bean.xml");} BeanFactory的功能扩展类ApplicationContext进行深度的分析。ApplicationConext与 BeanF

数据治理框架-ISO数据治理标准

引言 "数据治理"并不是一个新的概念,国内外有很多组织专注于数据治理理论和实践的研究。目前国际上,主要的数据治理框架有ISO数据治理标准、GDI数据治理框架、DAMA数据治理管理框架等。 ISO数据治理标准 改标准阐述了数据治理的标准、基本原则和数据治理模型,是一套完整的数据治理方法论。 ISO/IEC 38505标准的数据治理方法论的核心内容如下: 数据治理的目标:促进组织高效、合理地

Python3 BeautifulSoup爬虫 POJ自动提交

POJ 提交代码采用Base64加密方式 import http.cookiejarimport loggingimport urllib.parseimport urllib.requestimport base64from bs4 import BeautifulSoupfrom submitcode import SubmitCodeclass SubmitPoj():de