本文主要是介绍scrapy 爬取诗词 记录code,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
创建项目
scrapy startproject poems
创建爬虫应用
cd poems\poems\spidersscrapy genspider 名字 域名
scrapy genspider poem_spider www.gushiwen.org在poem_spider.py中 修改启始URL
start_urls = ['https://www.gushiwen.org/default_1.aspx']
在items中定义数据结构
class PoemsItem(scrapy.Item):title = scrapy.Field() # 题目dynasty = scrapy.Field() # 朝代author = scrapy.Field() # 作者content = scrapy.Field() # 内容tags = scrapy.Field() # 标签 tags
settings中设置
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3648.400 QQBrowser/10.4.3319.400"
设置一个启动文件main.py
from scrapy.cmdline import executeimport sys
import ossys.path.append(os.path.dirname(os.path.abspath(__file__)))execute(["scrapy","crawl","poem_spider"])
编写爬虫
shell调试
scrapy shell https://www.gushiwen.org/default_1.aspx
# -*- coding: utf-8 -*-
import scrapy
from poems.items import PoemsItemclass PoemSpiderSpider(scrapy.Spider):name = 'poem_spider' # 爬虫名allowed_domains = ['www.gushiwen.org'] # 允许的域名start_urls = ['https://www.gushiwen.org/default_1.aspx'] # 入口urldef parse(self, response):docs = response.css(".left .sons")for doc in docs:poem_tiem = PoemsItem()poem_tiem['title'] = doc.css("b::text").extract()[0]poem_tiem['dynasty'],poem_tiem['author'] = doc.css(".source a::text").extract()poem_tiem['content'] = "".join(doc.css(".contson::text").extract()).strip()poem_tiem['tags'] = ",".join(doc.css(".tag a::text").extract())yield poem_tiemnext_link = response.css(".pagesright .amore::attr(href)")if next_link:next_link = next_link[0].extract()yield scrapy.Request("https://www.gushiwen.org" + next_link)
保存到json文件里
scrapy crawl poem_spider -o test.json保存到scv里
scrapy crawl poem_spider -o test.csv
这篇关于scrapy 爬取诗词 记录code的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!