爬取网络小说内容

本文主要是介绍爬取网络小说内容，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

相关代码

# @Time: 2024/1/27 16:26
# @Author: 马龙强
# @File: 爬取飞卢小说内容.py
# @software: PyCharm
"""
网址：https://b.faloo.com/724903_1.html
数据：小说内容 / 章节名字
分析数据内容vip内容 获取小说图片 通过文字识别获取内容"""
"""
代码实现步骤
"""
import requests
import re
import parsel
#请求 小说目录页
link = 'https://b.faloo.com/724903.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
#获取每章的数据内容
html_data = requests.get(url=link,headers=headers).text
selector_1 = parsel.Selector(html_data)
#小说名字
name = selector_1.css('#novelName::text').get()
#小说章节url
href = selector_1.css('.DivTd3 a::attr(href)').getall()
# print(href)
for index in href:#模拟浏览器headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'}#请求链接url = 'https:' + index#发送请求response = requests.get(url=url,headers=headers)# print(response)# print(response.text)#获取数据#解析数据"""1.  re正则title = re.findall('<h1>玄幻：我！天命大反派   (.*?)</h1>',response.text)[0]2.  css选择器selector = parsel.Selector(response.text)title = selector.css('.c_l_title h1::text').get()3.  xpath节点提取title = selector.xpath('//*[@class="c_l_title"]/h1/text()').get()get 提取第一个标签数据   返回字符串getall 获取所有 多个 返回列表"""#把response.text 转成可解析对象selector = parsel.Selector(response.text)#提取标题title = selector.css('.c_l_title h1::text').get().split('  ')[-1]#提取小说内容# content = selector.css('.noveContent p::text').get()#   str.join()  #把列表合并成可解析对象#提取小说内容 把列表合并成字符串content = '\n'.join(selector.css('.noveContent p::text').getall())#提取小说内容"""保存数据，小说内容保存本地文件 txttitle + '.txt' 文件名，文件格式a 追加保存"""with open(name + '.txt',mode='a',encoding='utf-8') as f:#写入数据f.write(title)f.write('\n')f.write(content)f.write('\n')print(title)