法国亚马逊商品采集Python爬虫

2024-02-07 00:10

本文主要是介绍法国亚马逊商品采集Python爬虫,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

看着身边做亚马逊铺货的朋友,花大时间收集商品信息,学着写个脚本帮忙解决下问题。他们日常主要是抓取商品价格,商品图片,商品介绍等。

商品图片应该是最难获取的到的。可以在js里可以获取到完整的商品大图

这个文章主要参考二爷记博客的文章:https://blog.csdn.net/minge89/article/details/106417047/

1、商品标题的获取

 

其实直接取title应该更简单,我这里是取得页面内容的标题。

 

亚马逊商品页面html标题代码:<title>Echo Dot (3ème génération), Enceinte connectée avec Alexa, Tissu anthracite: Amazon.fr</title>

商品标题的获取:req.xpath('//h1[@id="title"]/span[@id="productTitle"]/text()')
             

2、商品属性的获取

 

<ul class="a-unordered-list a-nostyle a-button-list a-vertical a-spacing-top-micro">

<li class="a-spacing-small videoCountTemplate aok-hidden"><span class="a-list-item">
<span id="videoCount_template" class="a-size-mini a-color-secondary video-count a-text-bold a-nowrap"> <hza:string id=""></hza:string></span>
</span></li>
<li class="a-spacing-small 360IngressTemplate pos-360 aok-hidden"><span class="a-list-item">
<span class="a-declarative" data-action="thumb-action" data-thumb-action="{}">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-3"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-3-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-3-announce">
<img alt="" src="https://images-na.ssl-images-amazon.com/images/G/08/HomeCustomProduct/360_icon_73x73v2._CB485971279_SS40_FMpng_RI_.png">
</span></span></span>
</span>
</span></li>

<li class="a-spacing-small template"><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-4"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-4-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-4-announce">
<span class="placeHolder"></span>
</span></span></span>
</span></li>
<li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle a-button-selected a-button-focus" id="a-autoid-5"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-5-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-5-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/51sWJTvgBfL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-6"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-6-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-6-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/41hX%2B2Es%2BvL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-7"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-7-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-7-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/51I5TLQy-JL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-8"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-8-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-8-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/51b2EY6IdsL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-9"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-9-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-9-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/41F9DlWvsrL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-10"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-10-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-10-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/51C-rk6qlOL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-11"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-11-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-11-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/41PZZf1xU6L._AC_US40_.jpg">
</span></span></span>
</span></li></ul>

 

先把所有轮播图的列表属性给提取出来,class=样式内容会根据商品品类不同会有变化:

req.xpath('//ul[@class="a-unordered-list a-nostyle a-button-list a-vertical a-spacing-top-micro"]/li')
             

商品颜色属性的获取

<ul class="a-unordered-list a-nostyle a-button-list a-declarative a-button-toggle-group a-horizontal a-spacing-top-micro swatches swatchesSquare imageSwatches" role="radiogroup" data-action="a-button-group" data-a-button-group="{&quot;name&quot;:&quot;twister_color_name&quot;}">

<li id="color_name_0" title="Cliquez pour sélectionner Tissu anthracite" data-defaultasin="B07PHPXHQS" data-dp-url="" class="swatchAvailable"><span class="a-list-item">
<div class="tooltip">
<span class="a-declarative" data-action="swatchthumb-action" data-swatchthumb-action="{&quot;dimIndex&quot;:1,&quot;dimValueIndex&quot;:0}">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-12" aria-checked="false"><span class="a-button-inner"><button class="a-button-text" type="button" id="a-autoid-12-announce">

<span class="xoverlay"></span>
<div class="">
<div class="">
<img src="https://m.media-amazon.com/images/I/61sD09wyFML._SS36_.jpg" alt="Tissu anthracite" style="height:36px; width:36px" class="imgSwatch">
</div>

<div class=" " style="">

</div>

</div>


</button></span></span>
</span>
</div>

</span></li>

<li id="color_name_1" title="Cliquez pour sélectionner Tissu prune" data-defaultasin="B07WLTKTXY" data-dp-url="/dp/B07WLTKTXY/ref=twister_B07H61CQCM?_encoding=UTF8&amp;psc=1" class="swatchSelect"><span class="a-list-item">
<div class="tooltip">
<span class="a-declarative" data-action="swatchthumb-action" data-swatchthumb-action="{&quot;dimIndex&quot;:1,&quot;dimValueIndex&quot;:1}">
<span class="a-button a-button-thumbnail a-button-toggle a-button-selected" id="a-autoid-13" aria-checked="true"><span class="a-button-inner"><button class="a-button-text" type="button" id="a-autoid-13-announce">

<span class="xoverlay"></span>
<div class="">
<div class="">
<img src="https://m.media-amazon.com/images/I/61mROAfn-NL._SS36_.jpg" alt="Tissu prune" style="height:36px; width:36px" class="imgSwatch">
</div>

<div class=" " style="">
</div>
</div>
</button></span></span>
</span>
</div>
</span></li>

<li id="color_name_2" title="Cliquez pour sélectionner Tissu sable" data-defaultasin="B07PDHSPXT" data-dp-url="/dp/B07PDHSPXT/ref=twister_B07H61CQCM?_encoding=UTF8&amp;psc=1" class="swatchAvailable"><span class="a-list-item">
<div class="tooltip">
<span class="a-declarative" data-action="swatchthumb-action" data-swatchthumb-action="{&quot;dimIndex&quot;:1,&quot;dimValueIndex&quot;:2}">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-14" aria-checked="false"><span class="a-button-inner"><button class="a-button-text" type="button" id="a-autoid-14-announce">

<span class="xoverlay"></span>
<div class="">
<div class="">
<img src="https://m.media-amazon.com/images/I/61FlVonHYyL._SS36_.jpg" alt="Tissu sable" style="height:36px; width:36px" class="imgSwatch">
</div>
<div class=" " style="">
</div>
</div>
</button></span></span>
</span>
</div>
</span></li>

</ul>

 

进行了简单的格式化处理

productColors=req.xpath('//li[@id="color_name_"]//text()')
productColor=''.join(Colors)


商品图片的的获取

主要是找到图片链接费了不少力气,写入到js中了,没办法,只能用正则获取到图片链接。

imgs_text=re.findall(r'ImageBlockATF(.+?)return data;',html,re.S)[0]
imgs=re.findall(r'"large":"(.+?)","main":',imgs_text,re.S)
             

图片有轮播图图片和鼠标划过的大图片

产品详情页面的图片

 

一个页面大概有3万多行代码,要挖掘出自己需要的数据,需要慢慢分析,最麻烦的应该是图片数据了。

 

附源码,仅供参考,学习,交流:

#法国亚马逊商品采集
#20200524 by 微信:huguo00289
#https://www.amazon.fr/dp/B07CNJTCBB/ref=twister_B07RVPW2GT?_encoding=UTF8&th=1
 
 

# -*- coding=utf-8 -*-
import requests
from fake_useragent import UserAgent
import re,os,time,random
from lxml import etree
def ua()
     ua=UserAgent();
    headers={"User-Agent":ua.random}
    return headers

def get_data(url):
    id=re.findall(r'dp/(.+?)/',url,re.S)[0]
    print(f'>>>您输入的商品链接id为:{id},正在采集,请稍后..')
    response=requests.get(url,headers=ua(),timeout=8)
    time.sleep(2)
    if response.status_code == 200:
         print(">>>恭喜,获取网页数据成功!")
         html=response.content.decode('utf-8')
with open(f'{id}.html','w',encoding='utf-8') as f:
f.write(html)
req=etree.HTML(html)
h1=req.xpath('//h1[@id="title"]/span[@id="productTitle"]/text()')
print(h1)
h1=h1[0].strip()
print(f'商品标题:{h1}')
productDescriptions=req.xpath('//div[@id="productDescription"]//text()')
productDescription=''.join(productDescriptions)
print(f'商品描述:{productDescription}')
imgs_text=re.findall(r'ImageBlockATF(.+?)return data;',html,re.S)[0]
imgs=re.findall(r'"large":"(.+?)","main":',imgs_text,re.S)
print(imgs)
text=f'商品标题:{h1}\n商品描述:{productDescription}\n商品图片{imgs}'
with open(f'{id}.txt','w',encoding='utf-8') as f:
 f.write(text)
print(f">>>恭喜,保存商品数据成功,已保存为{id}.txt")
lis=req.xpath('//ul[@class="a-unordered-list a-nostyle a-button-list a-declarative a-button-toggle-group a-horizontal a-spacing-top-micro swatches swatchesSquare"]/li')
if len(lis)>1:
print(f">>>商品存在分类属性,共有{len(lis)}分类!")
spans=req.xpath('//div[@class="twisterTextDiv text"]/span[@class="a-size-base"]/text()')
print(spans)

if __name__ == '__main__':
print("亚马逊采集工具-by 微信公众号:二爷记")
 print("BUG反馈 微信:huguo00289");
print("请输入要采集的网址,按回车运行");

try:
get_data(url)
 except Exception as e:
    if "port=443" in e:
print("获取网页链接超时,正在重试..")
get_data(url)
print("采集完毕!")
print("8s后,程序自动关闭,BUG反馈 微信:huguo00289")
time.sleep(8)

 

             

 

 

 

下面是美国亚马逊爬虫的参考代码

 

# -*- coding: utf-8 -*-
"""
File Name:     amzone
Description :
Author :       meng_zhihao
mail :       312141830@qq.com
date:          2019/5/8
"""
# 美国amazon
import requests,urllib
import datetime
from urllib.parse import quote, unquote
from selenium_operate import ChromeOperate
import re
import time
from crawl_tool_for_py3 import crawlerTool as ct
import os,base64
import xlsxwriter
from PIL import Image
DOMAIN = 'https://www.amazon.de'

HEADERS = { 'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_1_1 like Mac OS X) AppleWebKit/602.2.14 (KHTML, like Gecko) Mobile/14B100 MicroMessenger/6.3.22 NetType/WIFI Language/zh_CN'
            }
se = requests.session()

def img_resize(infile,outfile):
    im = Image.open(infile)
    # (x, y) = im.size  # read image size
    x_s = 120  # define standard width
    y_s = 160  # calc height based on standard width
    out = im.resize((x_s, y_s), Image.ANTIALIAS)  # resize image with high-quality
    out.save(outfile)


def gen_xls(item_infos):
    timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
    book = xlsxwriter.Workbook('amazon%s.xlsx'%timestamp)
    worksheet = book.add_worksheet('demo')
    worksheet.write_row(0,0, ['关键词','排名','宝贝图片','价格','宝贝类目','宝贝描述','宝贝链接'])
    worksheet.set_column('A:D', 15) # 列宽约等于8像素 行高约等于1.37像素
    worksheet.set_column('C:C', 20)
    worksheet.set_column('B:B', 10)
    worksheet.set_column('F:F', 50)
    for i in range(len(item_infos)):
        col = i+1
        try:
            item_info = item_infos[i]
            row =   [item_info['keyword'],item_info['rank'],'',item_info['price'],item_info['cat'],item_info['descriptions'],item_info['item_url']]
            worksheet.write_row(col,0, row)
            worksheet.set_row(col, 120)
            if 'item_pic_base64' in item_info:
                item_pic_base64 = item_info["item_pic_base64"]
                try:
                    if 'https:' in item_pic_base64:
                        data = ct.get(item_pic_base64)
                    else:
                        data = base64.b64decode(item_pic_base64)
                    with open('test.png', 'wb') as f:
                        f.write(data)
                    img_resize('test.png', 'img/tmp%s.png'%i)
                    worksheet.insert_image( col,2, 'img/tmp%s.png'%i) # 名字必须不同
                except Exception as e:
                    print(str(e))
        except Exception as e:
            print(str(e))
    print('完成结果数,%s'%col)
    book.close()


def extractor_page(page): # 解析宝贝页
    item_info = {"descriptions":""}
    descriptions = ct.getXpath('//div[@id="productDescription"]/p/text()',page)
    if not descriptions:
        descriptions = ct.getXpath( '//div[@id="aplus"]/div//p//text()', page)
    descriptions= ''.join([description.strip() for description in descriptions])
    item_info["descriptions"] = descriptions
    item_pic_base64 = ct.getXpath1( '//div[@id="imgTagWrapperId"]/img/@src', page).split('base64,')[-1]
    item_info["item_pic_base64"] = item_pic_base64
    price = ct.getXpath1( '//span[@id="priceblock_ourprice"]/text()', page)
    item_info["price"] = price
    cats =  ct.getXpath( '//div[@id="wayfinding-breadcrumbs_container"]//a/text()', page)
    item_info["cat"] = '/'.join([cat.strip() for cat in cats])
    for k in item_info:
        print(k)
    return item_info

if __name__ == '__main__':
    #start_url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&count=15&category=105'
    csv_rows=[]
    cookie = {}
    item_infos = []
    cop = ChromeOperate(executable_path=r'chromedriver.exe')
    cop.open(DOMAIN)
    with open('keywords.txt','r') as keyword_file:
        for line in keyword_file:
            line = line.strip()
            if not line:
                continue
            urls = [DOMAIN+'/s?k=%s&ref=nb_sb_noss_2'%quote(line),
                    # 'https://www.amazon.com/s?k=%s&ref=nb_sb_noss_2&page=2 ' % quote(line)
                    ]
            rank = 0
            for url in urls:
                # HEADERS.update({"Referer":url,"User-Agent":random.choice(USER_AGENT_POOL)})
                cop.open(url)
                page = cop.open_source()
                item_urls = ct.getXpath('//div[@class="sg-row"]//div[@class="sg-col-inner"]//h2/a/@href',page)
                if not item_urls:
                    print(page)
                for item_url in item_urls:
                    rank += 1
                    try:
                        if not 'qid' in item_url:
                            continue
                        else:
                            item_url = DOMAIN+item_url
                            cop.open(item_url)
                            page = cop.driver.page_source
                            if 'Kindle Edition' in page:
                                continue
                            item_info = extractor_page(page)
                            if 'Type the characters you see' in page  :
                                print('IP被封了',url)
                                time.sleep(10)
                                # print page
                                break
                            item_info['keyword'] = line
                            item_info['rank'] = rank
                            item_info['item_url'] = item_url.split('?')[0]
                            item_infos.append(item_info)
                    except Exception as e:
                        print(str(e))
    gen_xls(item_infos)
    cop.quit()
 

 

————————————————
版权声明:本文为CSDN博主「二爷记」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/minge89/article/details/106417047/

这篇关于法国亚马逊商品采集Python爬虫的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/685966

相关文章

使用Python绘制蛇年春节祝福艺术图

《使用Python绘制蛇年春节祝福艺术图》:本文主要介绍如何使用Python的Matplotlib库绘制一幅富有创意的“蛇年有福”艺术图,这幅图结合了数字,蛇形,花朵等装饰,需要的可以参考下... 目录1. 绘图的基本概念2. 准备工作3. 实现代码解析3.1 设置绘图画布3.2 绘制数字“2025”3.3

python使用watchdog实现文件资源监控

《python使用watchdog实现文件资源监控》watchdog支持跨平台文件资源监控,可以检测指定文件夹下文件及文件夹变动,下面我们来看看Python如何使用watchdog实现文件资源监控吧... python文件监控库watchdogs简介随着Python在各种应用领域中的广泛使用,其生态环境也

Python中构建终端应用界面利器Blessed模块的使用

《Python中构建终端应用界面利器Blessed模块的使用》Blessed库作为一个轻量级且功能强大的解决方案,开始在开发者中赢得口碑,今天,我们就一起来探索一下它是如何让终端UI开发变得轻松而高... 目录一、安装与配置:简单、快速、无障碍二、基本功能:从彩色文本到动态交互1. 显示基本内容2. 创建链

Java调用Python代码的几种方法小结

《Java调用Python代码的几种方法小结》Python语言有丰富的系统管理、数据处理、统计类软件包,因此从java应用中调用Python代码的需求很常见、实用,本文介绍几种方法从java调用Pyt... 目录引言Java core使用ProcessBuilder使用Java脚本引擎总结引言python

python 字典d[k]中key不存在的解决方案

《python字典d[k]中key不存在的解决方案》本文主要介绍了在Python中处理字典键不存在时获取默认值的两种方法,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,... 目录defaultdict:处理找不到的键的一个选择特殊方法__missing__有时候为了方便起见,

使用Python绘制可爱的招财猫

《使用Python绘制可爱的招财猫》招财猫,也被称为“幸运猫”,是一种象征财富和好运的吉祥物,经常出现在亚洲文化的商店、餐厅和家庭中,今天,我将带你用Python和matplotlib库从零开始绘制一... 目录1. 为什么选择用 python 绘制?2. 绘图的基本概念3. 实现代码解析3.1 设置绘图画

Python pyinstaller实现图形化打包工具

《Pythonpyinstaller实现图形化打包工具》:本文主要介绍一个使用PythonPYQT5制作的关于pyinstaller打包工具,代替传统的cmd黑窗口模式打包页面,实现更快捷方便的... 目录1.简介2.运行效果3.相关源码1.简介一个使用python PYQT5制作的关于pyinstall

使用Python实现大文件切片上传及断点续传的方法

《使用Python实现大文件切片上传及断点续传的方法》本文介绍了使用Python实现大文件切片上传及断点续传的方法,包括功能模块划分(获取上传文件接口状态、临时文件夹状态信息、切片上传、切片合并)、整... 目录概要整体架构流程技术细节获取上传文件状态接口获取临时文件夹状态信息接口切片上传功能文件合并功能小

python实现自动登录12306自动抢票功能

《python实现自动登录12306自动抢票功能》随着互联网技术的发展,越来越多的人选择通过网络平台购票,特别是在中国,12306作为官方火车票预订平台,承担了巨大的访问量,对于热门线路或者节假日出行... 目录一、遇到的问题?二、改进三、进阶–展望总结一、遇到的问题?1.url-正确的表头:就是首先ur

基于Python实现PDF动画翻页效果的阅读器

《基于Python实现PDF动画翻页效果的阅读器》在这篇博客中,我们将深入分析一个基于wxPython实现的PDF阅读器程序,该程序支持加载PDF文件并显示页面内容,同时支持页面切换动画效果,文中有详... 目录全部代码代码结构初始化 UI 界面加载 PDF 文件显示 PDF 页面页面切换动画运行效果总结主