本文主要是介绍pyspider爬虫框架之宝宝树需求,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
1 需求和分析
最近在做爬取宝宝树网站上商品信息的需求,原本以为很简单,没想到反爬还挺严重,研究了两天,发现有几个参数是经过JS加密的。通过分析,获取网站上的数据,需要constId
这个请求参数,然而这个constId
是经过三次网络请求得到的一个参数,最后一个请求是得到这个参数的关键请求,但是它依赖前两个请求,这几个请求的关键在于请求头里的“Param”参数,如下图所示:
通过查看network里请求的Initiator
参数的js源码,可知加密过程就在这些js文件中
2 破解
- 本人水平有限,请组里的Szpilman和煎饼两位大侠,通过JS调试,理清了加密过程。下面是JS的加密代码(在网站源码的const-id.js文件中):
// 随机获取31位的值,然后前补1构成lid。
function s() {for (var i = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ", c = 62, u = [], s = 31, d = 0; d < s; d++)u[d] = i["substr"](Math['floor'](Math["random"]() * c), 1);return u["join"]("")}
s()// 然后将html文档的"appKey": "7a0d42b97002353426c47d18f1cc0fbe",获取。构成第一次的加密的参数。
// lid是32位的字符串,要补1
//'{"lid": "1hNwgj22HZaf75p8rF97IicQBCRCx9Gz","appKey": "7a0d42b97002353426c47d18f1cc0fbe"}'var i = '', c, o, u, s, d, f, l, p = 0;
var S = "S0DOZN9bBJyPV-qczRa3oYvhGlUMrdjW7m2CkE5_FuKiTQXnwe6pg8fs4HAtIL1x="
for (a = '{"lid": "1hNwgj22HZaf75p8rF97IicQBCRCx9Gz","appKey": "7a0d42b97002353426c47d18f1cc0fbe"}'; p < a['length']; )c = a["charCodeAt"](p++), //charCodeAt() 方法可返回指定位置的字符的 Unicode 编码。这个返回值是 0 - 65535 之间的整数。o = a["charCodeAt"](p++),u = a["charCodeAt"](p++),s = c >> 2,d = (c & 3) << 4 | o >> 4,f = (o & 15) << 2 | u >> 6,l = u & parseInt('77', 8),isNaN(o) ? f = l = parseInt('100', 8) : isNaN(u) && (l = parseInt('100', 8)),i = i + S["charAt"](s) + S["charAt"](d) + S["charAt"](f) + S["charAt"](l)//'{_v": "1.42.0.435","ua": "470b4b3af8a1eea1eafd570cb672de33","language": "en-US","cd": 24,"pr": 1,"hc": 4,"res": "1680;1050","ar": "1680;1026","to": -480,"ss": 1,"ls": 1,"ind": 1,"od": 1,"cc": "unknown","np": "Linux x86_64","dnt": "unknown","rp": "9597ec5d235f00b31ac537ef03b028cf","can": "f19bbe07be0ce9deb3b7c6d067f2ba53","web": "fac25db4cf995e91f7b62096e793f568","adb": false,"hll": false,"hlr": false,"hlo": false,"hlb": false,"ts": "0;false;false","jf": "745caf07297ffff67e829c8e9f977188","inet": "10.15.100.114","appKey": "7a0d42b97002353426c47d18f1cc0fbe","lid": "1JpFx0vb3baqOZep3haHLKpXREuhff7V"}'
- 因为要写成python爬虫,所以得加密过程得改成python版,下面是本人,简单的进行了修改和测试(主要是测试得到的
Param
参数,再发送请求(即第三次请求)得到的constId
参数是否可以请求到数据),代码如下:
import randomdef s():i = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"n = 31c = 62 # len(i)b = []for each in range(n):b.append(i[random.randint(0,c-1)])return '1' + ''.join(b)lib = s()print(lib)
# lib = "1hNwgj22HZaf75p8rF97IicQBCRCx9Gz"
print(len(lib))a = '{"lid": "1hNwgj22HZaf75p8rF97IicQBCRCx9Gz","appKey": "7a0d42b97002353426c47d18f1cc0fbe"}'
print(len(a)) # 88
S = "S0DOZN9bBJyPV-qczRa3oYvhGlUMrdjW7m2CkE5_FuKiTQXnwe6pg8fs4HAtIL1x="
print(len(S))def get_params(aa):S = "S0DOZN9bBJyPV-qczRa3oYvhGlUMrdjW7m2CkE5_FuKiTQXnwe6pg8fs4HAtIL1x="n = len(aa)param = ''for i in range(0,n,3):c = ord(aa[i])if i+1 < n:o = ord(aa[i+
这篇关于pyspider爬虫框架之宝宝树需求的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!