本文主要是介绍评论抓取:Python爬取AppStore上的评论内容及星级,突破500条限制,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
之前看到网上有AppStore应用评论的爬虫,但是由于用的是官方提供的api,每个应用最多只能获取到500条评论,完全没法满足数据分析的需求。因此经过一些分析,写了一个可以获取更多评论的爬虫。
1 配置文件(config_api.json)
{"max_page": 5,"ids": ["要爬app的id", "要爬app的id"],"headers": {"User-Agent": "你自己的","Authorization": "你自己的"},"intervals": 2
}
首先解释一下配置文件:
max_page:要爬的最大评论页数,每页是10条评论;
ids:要爬取的应用id列表;
headers:浏览器发起请求的请求头;
intervals:每爬一页评论的间隔时间。
2 代码(spider.py)
import os
import csv
import json
import time
import requestsnext_url = Nonereview_path = 'reviews'
if not os.path.exists(review_path):os.mkdir(review_path)with open('config_api.json', 'r') as file:config = json.loads(file.read())pending_queue = config['ids']max_page = config['max_page']headers = config['headers']intervals = config['intervals']# 发送请求获取响应
def get_response(app_id, page):time.sleep(intervals)try:url = 'https://amp-api.apps.apple.com/v1/catalog/cn/apps/' + app_id +'/reviews?l=zh-Hans-CN&offset=' + str(page * 10) + '&platform=web&additionalPlatforms=appletv%2Cipad%2Ciphone%2Cmac'r = requests.get(url, headers=headers)r.raise_for_status()return r.json()except requests.exceptions.HTTPError:return 'HTTPError!'# 解析响应
def parse_response(r):global next_urlif "next" in r.keys():next_url = r['next']else:next_url = Nonefor item in r['data']:yield {"id": item['id'],"type": item['type'],"title": item['attributes']['title'],"userName": item['attributes']['userName'],"isEdited": item['attributes']['isEdited'],"review": item['attributes']['review'],"rating": item['attributes']['rating'],"date": item['attributes']['date']}# 写入 csv 文件
def write_to_file(app_id, item):with open(f'{review_path}/{app_id}.csv', 'a', encoding='utf-8-sig', newline='') as csv_file:fieldnames = ['id', 'type', 'title', 'userName', 'isEdited', 'review', 'rating', 'date']writer = csv.DictWriter(csv_file, fieldnames=fieldnames)writer.writerow(item)# 主函数
def main():while len(pending_queue):cur_id = pending_queue.pop()print(f'开始爬取 {cur_id}')for i in range(0, max_page):r = get_response(cur_id, i)print(f"第 {i+1} 页评论已获取")for item in parse_response(r):write_to_file(cur_id, item)print(f'第 {i} 页评论已存储')if not next_url:breakprint(f'结束爬取 {cur_id}')if __name__ == '__main__':main()
3 结果预览
4 结语
有问题或者建议可以留言,如果对你有帮助的话,也可以关注我的公众号,谢谢。
这篇关于评论抓取:Python爬取AppStore上的评论内容及星级,突破500条限制的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!