我的大数据之路 -- 爬取猫眼电影复联4的影评

本文主要是介绍我的大数据之路 -- 爬取猫眼电影复联4的影评，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

吐槽- - - 刚刚没电了，写的东西TM全没了，又要重写一遍。CSDN啊，你已经长大了，该学会自动保存了。

昨天和两位小伙伴去看了，总体感觉还是不错的。整个的过程中能引起观众笑的恐怕就只有浩克出现的那几段。
看3D带两副眼睛是真的难受。再加上临时出现一些人生大事（其实我不想发生的）。看完后脑袋愈发觉得疼痛，记昨晚第一次失眠。

脑袋还是有点疼，但是技术还是要学的。我很好奇观众对复联4的评价，所以今天就打算爬取猫眼电影关于复联4的影评。
具体实现如下，只做学习使用，不想给其服务器增加负担。

首先打开网页，传送门

发现只有少数几条浏览器，这怎么行呢？但是打开手机端复联4，却能看到所有的影评。

chrom浏览器是个好东西，它能把电脑版浏览器变成手机版浏览器。具体操作如下，点击F12–>然后点击红色小框框–>按F5刷新一下，两下也行。
在这里插入图片描述
点击如下图所示可以选择手机的类型，选择之后记得刷新。

在这里插入图片描述
然后一直往下拉，找到 “查看全部 \d+ 条评论”，点击它

之后一直往下拉就会出现各种的评论的JSON数据

接下来就需要寻找出影评JSON的url规律就行啦

http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=15&ts=0&type=3
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=15&limit=15&ts=1556790644827&type=3
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=30&limit=15&ts=1556790644827&type=3
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=45&limit=15&ts=1556790644827&type=3
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=60&limit=15&ts=1556790644827&type=3

发现规律了吗？变化的就只有 offset ，每一个url的offset增加15

现在开始写代码，建议登录进去，加上你的cookies
FuLian4.py

import requests
import json
import time
class FL4:def __init__(self):self.headers={'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Mobile Safari/537.36','Referer': 'http://m.maoyan.com/movie/248172/comments?_v_=yes','Connection': 'keep-alive','Cookie': 'lxsdk_cuid=16a77029578c8-09b499b0040059-39395704-1fa400-16a77029579c8; uuid_n_v=v1; iuuid=134B71006C9C11E984F25B6CA47A6EB12DA16CDD3CFA49059091A26926EFF957; webp=true; ci=20%2C%E5%B9%BF%E5%B7%9E; _lx_utm=utm_source%3DBaidu%26utm_medium%3Dorganic; _lxsdk=D41218F06C9A11E991BFF33D07D9D8F114AEFA62DB3048D1A0816CCD72F7EA47; __mta=217338537.1556774819301.1556783927287.1556788162297.7','Host': 'm.maoyan.com'}self.files=open('FuLian42.txt','w',encoding='utf-8')def req(self,url):response=[]try:response=requests.get(url=url,headers=self.headers)time.sleep(2)except ConnectionRefusedError:time.sleep(3)self.req(url)return responsedef get_json(self,response):data=json.loads(response.text).get('data')comments=data.get('comments')for comment in comments:infos={'userId':comment.get('userId'), #用户ID'nick':comment.get('nick'), #用户昵称'gender':comment.get('gender'), #用户性别'content':comment.get('content'), #用户评论'score':comment.get('score'), #用户评分'time':comment.get('time'), #用户评论时间'userLevel':comment.get('userLevel') #用户等级}info=json.dumps(infos,ensure_ascii=False)print(info)self.files.write(info)self.files.write('\n')def main(self):urls=['http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset={}&limit=15&ts=1556792832710&type=3'.format(i*15) for i in range(0,100)]for  url in urls:print(url)response=self.req(url)time.sleep(2)self.get_json(response)if __name__=='__main__':fl4=FL4()fl4.main()

以上代码是一种方法，但是只要超过数据达到1000条，猫眼大哥就立刻不给你爬取了。

对于想数据分析来说是1000条数据是远远不够的。
查看url，再来分析一遍。
url如下，可以看出来的是，url的构造当中除了offset还有一个ts，应该是时间戳没错了。

http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=15&limit=15&ts=1556844579617&type=3

再由输出的单条comments可以看到，里面有一个startTime和time，初次判断是时间戳。把它放在在线转换中查看如下图，确实是时间戳。

{'avatarUrl': 'https://img.meituan.net/maoyanuser/cf2d33a3e16435e47a3c4c8e69fb22ba4458.jpg', 'buyTicket': False, 'content': '不错，有点小感动，希望还有第五部。', 'gender': 2, 'id': 1065529952, 'imageUrls': [], 'likedByCurrentUser': False, 'major': False, 'movie': {'id': 0, 'sc': 0}, 'movieId': 248172, 'nick': '请勿打扰～', 'replyCount': 0, 'score': 8, 'spoiler': False, 'startTime': '1556844600000', 'tagList': [{'id': 1, 'name': '好评'}, {'id': 4, 'name': '购票'}], 'time': 1556844600000, 'upCount': 0, 'userId': 1649520068, 'userLevel': 2, 'vipType': 0}

在这里插入图片描述
把时间戳转换成日期查看一下

 def get_json(self,response):data=json.loads(response.text).get('data')comments=data.get('comments')for comment in comments:times=comment.get('time')  timeArray = time.localtime(times/1000)otherStyleTime = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)print(otherStyleTime)

在这里插入图片描述
由此就可以判断出：
1）后台的时间是按照每分钟的时间进行降序的。
2）由于每次抓取的时候不知道时间戳多少变化一次

解决思路如下：
1）由于每次可以得到很多的时间戳。
2）发出请求
3）记录第一个时间戳
4）记录第二个时间戳
5）当遇到第三个时间戳时，将ts设置为第二个时间戳，重新构建url
6）如果单次的请求都是遇到第三个时间戳，这时就通过修改offset参数继续抓取，直到遇到第三个时间戳
什么意思呢？
我来画个图解释一下吧，不能吐槽图不好看

在这里插入图片描述

看不懂的自己找规律

http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=0&type=3
获得到的时间 1556849940000
获得到的时间 1556849880000
获得到的时间 1556849880000
获得到的时间 1556849880000
获得到的时间 1556849880000
获得到的时间 1556849880000
获得到的时间 1556849880000
获得到的时间 1556849880000
获得到的时间 1556849880000
获得到的时间 1556849880000
获得到的时间 1556849880000
获得到的时间 1556849880000
获得到的时间 1556849820000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849880000&type=3
获得到的时间 1556849820000
获得到的时间 1556849820000
获得到的时间 1556849820000
获得到的时间 1556849760000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849820000&type=3
获得到的时间 1556849760000
获得到的时间 1556849760000
获得到的时间 1556849760000
获得到的时间 1556849760000
获得到的时间 1556849760000
获得到的时间 1556849760000
获得到的时间 1556849700000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849760000&type=3
获得到的时间 1556849760000
获得到的时间 1556849700000
获得到的时间 1556849700000
获得到的时间 1556849700000
获得到的时间 1556849640000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849700000&type=3
获得到的时间 1556849640000
获得到的时间 1556849640000
获得到的时间 1556849640000
获得到的时间 1556849640000
获得到的时间 1556849640000
获得到的时间 1556849640000
获得到的时间 1556849640000
获得到的时间 1556849580000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849640000&type=3
获得到的时间 1556849640000
获得到的时间 1556849580000
获得到的时间 1556849580000
获得到的时间 1556849580000
获得到的时间 1556849580000
获得到的时间 1556849580000
获得到的时间 1556849520000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849580000&type=3
获得到的时间 1556849520000
获得到的时间 1556849520000
获得到的时间 1556849400000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849520000&type=3
获得到的时间 1556849400000
获得到的时间 1556849400000
获得到的时间 1556849400000
获得到的时间 1556849400000
获得到的时间 1556849400000
获得到的时间 1556849340000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849400000&type=3
获得到的时间 1556849340000
获得到的时间 1556849340000
获得到的时间 1556849340000
获得到的时间 1556849340000
获得到的时间 1556849340000
获得到的时间 1556849340000
获得到的时间 1556849280000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849340000&type=3
获得到的时间 1556849280000
获得到的时间 1556849280000
获得到的时间 1556849280000
获得到的时间 1556849220000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849280000&type=3
获得到的时间 1556849220000
获得到的时间 1556849220000
获得到的时间 1556849160000
http://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=21&ts=1556849220000&type=3Process finished with exit code -1

再次构造url，发现，可以查看的条数增加到21条，一到22条就不行了
在这里插入图片描述
最终代码

import requests
import json
import time
import csvclass FL4:def __init__(self):self.headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Mobile Safari/537.36','Referer': 'http://m.maoyan.com/movie/248172/comments?_v_=yes','Connection': 'keep-alive','Cookie': 'lxsdk_cuid=16a77029578c8-09b499b0040059-39395704-1fa400-16a77029579c8; uuid_n_v=v1; iuuid=134B71006C9C11E984F25B6CA47A6EB12DA16CDD3CFA49059091A26926EFF957; webp=true; ci=20%2C%E5%B9%BF%E5%B7%9E; _lx_utm=utm_source%3DBaidu%26utm_medium%3Dorganic; _lxsdk=D41218F06C9A11E991BFF33D07D9D8F114AEFA62DB3048D1A0816CCD72F7EA47; __mta=217338537.1556774819301.1556783927287.1556788162297.7','Host': 'm.maoyan.com'}self.count = 1# 每次抓取评论数，猫眼最大支持21self.limit = 21self.movieId = '248172'self.ts = 0self.offset = 0def req(self):url = 'http://m.maoyan.com/review/v2/comments.json?movieId=' + self.movieId + '&userId=-1&offset=' + str(self.offset) + '&limit=21&ts=' + str(self.ts) + '&type=3'print(url)return urldef open_url(self,url):response=[]try:response = requests.get(url=url, headers=self.headers)time.sleep(2)except ConnectionRefusedError:time.sleep(3)self.req()return responsedef get_json(self, response):ts_duration = self.tsres = json.loads(response.text)data = res.get('data')comments = data.get('comments')for comment in comments:comment_time = comment['time']print('获得到的时间', comment_time)if self.ts == 0:self.ts = comment_timets_duration = comment_timeif comment_time != self.ts and self.ts == ts_duration:ts_duration = comment_timeif comment_time != ts_duration:self.ts = ts_durationself.offset = 0return self.req()# 这时第二次请求就是comments_time等于第一次请求的comments_timeelse:infos = {'userId': comment.get('userId'),  # 用户ID'nick': comment.get('nick'),  # 用户昵称'gender': comment.get('gender'),  # 用户性别'content': comment.get('content'),  # 用户评论'score': comment.get('score'),  # 用户评分'time': comment.get('time'),  # 时间'userLevel': comment.get('userLevel')  # 用户等级}info = json.dumps(infos, ensure_ascii=False)print(info)with open('FL4.txt','a',encoding='utf-8' )as f:f.write(info)f.write('\n')list=[infos['userId'],infos['nick'],infos['gender'],infos['content'],infos['score'],infos['time'],infos['userLevel']]with open('FL4.csv','a',newline='',encoding='utf-8') as c:film=csv.writer(c,delimiter=';')film.writerow(list)self.count += 1if res['paging']['hasMore']:self.offset += (self.limit+9)print('offset', self.offset)return self.req()else:return Nonedef save_csv(self, info):self.file_csv= csv.writer(info, delimiter=';')self.file_csv.writerow(info)def main(self):url=self.req()while True:try:data = self.open_url(url)if data:url = self.get_json(data)except Exception as e:print('error',e)if __name__ == '__main__':fl4 = FL4()fl4.main()

还可以使用多进程或者多线程，之后再说吧。

这篇关于我的大数据之路 -- 爬取猫眼电影复联4的影评的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！