AI网络爬虫-自动获取百度实时热搜榜

本文主要是介绍AI网络爬虫-自动获取百度实时热搜榜，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

工作任务和目标：自动获取百度实时热搜榜的标题和热搜指数

标题：<div class="c-single-text-ellipsis"> 东部战区台岛战巡演练模拟动画 </div>

第一步，在deepseek中输入如下提示词：

你是一个Python爬虫专家，完成以下网页爬取的Python脚本任务：

在F:\aivideo文件夹里面新建一个Excel文件：topbaidu.xlsx

设置chromedriver的路径为："D:\Program Files\chromedriver125\chromedriver.exe"

用selenium打开网页：https://top.baidu.com/board?tab=realtime；

请求标头为：

Accept:

text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7

Accept-Encoding:

gzip, deflate, br, zstd

Accept-Language:

zh-CN,zh;q=0.9,en;q=0.8

Cache-Control:

max-age=0

Connection:

keep-alive

Host:

http://top.baidu.com

Referer:

https://top.baidu.com/board?platform=pc&tab=homepage&sa=pc_index_homepage_all

Sec-Ch-Ua:

"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"

Sec-Ch-Ua-Mobile:

Sec-Ch-Ua-Platform:

"Windows"

Sec-Fetch-Dest:

document

Sec-Fetch-Mode:

navigate

Sec-Fetch-Site:

same-origin

Sec-Fetch-User:

Upgrade-Insecure-Requests:

User-Agent:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36

解析源代码，并打印输出；

定位class="c-single-text-ellipsis"的div标签，提取其文本内容作为热搜标题，保存到topbaidu.xlsx的第1列；

定位class="hot-index_1Bl1a"的div标签，提取其文本内容作为热搜指数，保存到topbaidu.xlsx的第2列；

注意：

每一步都要输出相关信息到屏幕；

每解析一个网页随机暂停1-10秒；

设置请求头，来应对反爬虫机制；

DataFrame.append 方法在 pandas 1.4.0 版本中已经被弃用，并且在后续版本中被移除。为了解决这个问题，我们可以使用 concat 函数来代替 append

第二步，查看审阅deepseek生成的Python代码：

import os

import time

import random

import pandas as pd

from selenium import webdriver

from selenium.webdriver.chrome.service import Service

from selenium.webdriver.common.by import By

from selenium.webdriver.chrome.options import Options

# 设置chromedriver的路径

chromedriver_path = "D:\\Program Files\\chromedriver125\\chromedriver.exe"

# 设置请求头

headers = {

'authority': 'top.baidu.com',

'method': 'GET',

'path': '/board?tab=realtime',

'scheme': 'https',

'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',

'accept-encoding': 'gzip, deflate, br, zstd',

'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8',

'cache-control': 'max-age=0',

'cookie': '__root_domain_v=.baidu.com; _qddaz=QD.484716194472545; _ntes_origin_from=sogou; _ntes_nuid=af525edefbd3d363cd5876dbe902d85b; s_n_f_l_n3=01cb466d244b7db51716517702302; _antanalysis_s_id=1716517702995; UserProvince=%u5168%u56FD; ne_analysis_trace_id=1716518441938; vinfo_n_f_l_n3=01cb466d244b7db5.1.0.1716517702302.0.1716518672850',

'if-modified-since': 'Fri, 24 May 2024 02:40:05 GMT',

'priority': 'u=0, i',

'referer': 'https://www.sogou.com/link?url=hedJjaC291M4oaTBlXc5yCiioa5eDVgw',

'sec-ch-ua': '"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"',

'sec-ch-ua-mobile': '?0',

'sec-ch-ua-platform': '"Windows"',

'sec-fetch-dest': 'document',

'sec-fetch-mode': 'navigate',

'sec-fetch-site': 'cross-site',

'sec-fetch-user': '?1',

'upgrade-insecure-requests': '1',

'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'

}

# 设置Chrome选项

chrome_options = Options()

chrome_options.add_argument(f'user-agent={headers["user-agent"]}')

# 设置Chrome服务

service = Service(chromedriver_path)

# 启动Chrome浏览器

driver = webdriver.Chrome(service=service, options=chrome_options)

# 打开网页

url = 'https://top.baidu.com/board?tab=realtime'

driver.get(url)

# 等待页面加载完成

time.sleep(5)

# 获取网页源代码

html = driver.page_source

print("网页源代码：")

print(html)

# 创建Excel文件

excel_path = os.path.join('F:\\aivideo', 'topbaidu.xlsx')

df = pd.DataFrame(columns=['热搜标题', '热搜指数'])

df.to_excel(excel_path, index=False)

# 定位热搜标题和指数

titles = driver.find_elements(By.CLASS_NAME, 'c-single-text-ellipsis')

indices = driver.find_elements(By.CLASS_NAME, 'hot-index_1Bl1a')

# 保存热搜标题和指数到Excel文件

data = []

for title, index in zip(titles, indices):

title_text = title.text

index_text = index.text

# 输出相关信息

print(f"热搜标题: {title_text}")

print(f"热搜指数: {index_text}")

# 添加数据到列表

data.append({'热搜标题': title_text, '热搜指数': index_text})

# 随机暂停

time.sleep(random.randint(1, 10))

# 使用concat将数据添加到DataFrame

df = pd.concat([df, pd.DataFrame(data)], ignore_index=True)

df.to_excel(excel_path, index=False)

# 关闭浏览器

driver.quit()

print("热搜信息已保存到Excel文件。")

第三步，打开visual studio code软件，新建一个py文件，将Python代码复制到这个文件中，按下F5键运行程序：

程序运行结果：

这篇关于AI网络爬虫-自动获取百度实时热搜榜的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

AI网络爬虫-自动获取百度实时热搜榜

相关文章

python获取指定名字的程序的文件路径的两种方法

SpringBoot 获取请求参数的常用注解及用法

深入浅出Spring中的@Autowired自动注入的工作原理及实践应用

Debian 13升级后网络转发等功能异常怎么办? 并非错误而是管理机制变更

GSON框架下将百度天气JSON数据转JavaBean

Python与MySQL实现数据库实时同步的详细步骤

Java+AI驱动实现PDF文件数据提取与解析

基于Redis自动过期的流处理暂停机制

Python开发简易网络服务器的示例详解(新手入门)

Go语言网络故障诊断与调试技巧