爬虫 headless 访问知道创宇加速乐 CDN 网站

本文主要是介绍爬虫 headless 访问知道创宇加速乐 CDN 网站，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

通过 requests.get 直接请求网站首页，返回 521 错误提示码，返回结果是js代码。这是采用加速乐反爬技术，在访问前先判断客户端的cookie是否正确，如果不正确，返回521状态码和一段js代码，并且进行set-cookie操作，返回的js代码经过浏览器执行又会生成新的cookie，这两个cookie一起发送给服务器，才会返回正确的网页内容
试了下代码demo如下，有cookie就带上访问，没有就计算访问

import execjs
from selenium import webdriver
from selenium.webdriver.chrome.options import Optionschrome_options = Options()
# 在启动Chromedriver之前，为Chrome开启实验性功能参数excludeSwitches，它的值为['enable-automation'],可应对WebDriver检测
chrome_options.add_experimental_option('excludeSwitches', ['enable-automation'])
chrome_options.add_argument('--headless')
# chrome_options.add_argument(pro1)
chrome_options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')  # 取消沙盒模式
chrome_options.add_argument('--disable-setuid-sandbox')
# chrome_options.add_argument('--single-process') # 单进程运行
# chrome_options.add_argument('--process-per-tab') # 每个标签使用单独进程
# chrome_options.add_argument('--process-per-site') # 每个站点使用单独进程
# chrome_options.add_argument('--in-process-plugins') # 插件不启用单独进程
chrome_options.add_argument('--disable-popup-blocking') # 禁用弹出拦截
chrome_options.add_argument('--disable-images')  # 禁用图像
chrome_options.add_argument('--blink-settings=imagesEnabled=false')
chrome_options.add_argument('--incognito')  # 启动进入隐身模式
chrome_options.add_argument('--lang=zh-CN')  # 设置语言为简体中文
chrome_options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36')
chrome_options.add_argument('--hide-scrollbars')
chrome_options.add_argument('--disable-bundled-ppapi-flash')
chrome_options.add_argument('--mute-audio')
chrome_options.add_argument('lang=zh_CN.UTF-8')
# chrome_options.add_extension(r'C:\lhcis\lh_spider_service\website_check\hdmbdioamgdkppmocchpkjhbpfmpjiei-3.0.1-Crx4Chrome.com.crx')
# chrome_options.add_argument('--disable-extensions')
# chrome_options.add_argument('--disable-plugins')
DRIVER = webdriver.Chrome(executable_path="C:\lhcis\lh_spider_service\website_check\chromedriver.exe",chrome_options=chrome_options)DRIVER.get("http://www.")
cookie_list= DRIVER.get_cookies()
cookie_value_dict = None
for i in cookie_list:if i.get('name') == '__jsl_clearance':cookie_value_dict = i
if cookie_value_dict:DRIVER.add_cookie(cookie_value_dict)DRIVER.get("http://www.")print(DRIVER.page_source)
if not cookie_value_dict:js_str = DRIVER.page_sourcejs_code1 = js_str.replace("<html><head>", "")js_code1 = js_code1.rstrip('\n')js_code1 = js_code1.replace('</script>', '')js_code1 = js_code1.replace('<script>', '')index = js_code1.rfind('}')js_code1 = js_code1[0:index + 1]js_code1 = 'function getCookie() {' + js_code1 + '}'js_code1 = js_code1.replace('eval', 'return')js_code2 = execjs.compile(js_code1)code = js_code2.call('getCookie')code = 'var a' + code.split('document.cookie')[1].split("Path=/;'")[0] + "Path=/;';return a;"code = 'window = {}; \n' + codejs_final = "function getClearance(){" + code + "};"ctx = execjs.compile(js_final)jsl_clearance = ctx.call('getClearance')jsl_cle = jsl_clearance.split(';')[0].split('=')[1]print(f'make cookie: {jsl_cle}')DRIVER.add_cookie({'name':'__jsl_clearance','value':jsl_cle})DRIVER.get("http://www.")print(DRIVER.page_source)
DRIVER.quit()

参考：https://segmentfault.com/a/1190000018713681

这篇关于爬虫 headless 访问知道创宇加速乐 CDN 网站的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！