法国亚马逊商品采集Python爬虫

2024-02-07 00:10

本文主要是介绍法国亚马逊商品采集Python爬虫,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

看着身边做亚马逊铺货的朋友,花大时间收集商品信息,学着写个脚本帮忙解决下问题。他们日常主要是抓取商品价格,商品图片,商品介绍等。

商品图片应该是最难获取的到的。可以在js里可以获取到完整的商品大图

这个文章主要参考二爷记博客的文章:https://blog.csdn.net/minge89/article/details/106417047/

1、商品标题的获取

 

其实直接取title应该更简单,我这里是取得页面内容的标题。

 

亚马逊商品页面html标题代码:<title>Echo Dot (3ème génération), Enceinte connectée avec Alexa, Tissu anthracite: Amazon.fr</title>

商品标题的获取:req.xpath('//h1[@id="title"]/span[@id="productTitle"]/text()')
             

2、商品属性的获取

 

<ul class="a-unordered-list a-nostyle a-button-list a-vertical a-spacing-top-micro">

<li class="a-spacing-small videoCountTemplate aok-hidden"><span class="a-list-item">
<span id="videoCount_template" class="a-size-mini a-color-secondary video-count a-text-bold a-nowrap"> <hza:string id=""></hza:string></span>
</span></li>
<li class="a-spacing-small 360IngressTemplate pos-360 aok-hidden"><span class="a-list-item">
<span class="a-declarative" data-action="thumb-action" data-thumb-action="{}">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-3"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-3-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-3-announce">
<img alt="" src="https://images-na.ssl-images-amazon.com/images/G/08/HomeCustomProduct/360_icon_73x73v2._CB485971279_SS40_FMpng_RI_.png">
</span></span></span>
</span>
</span></li>

<li class="a-spacing-small template"><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-4"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-4-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-4-announce">
<span class="placeHolder"></span>
</span></span></span>
</span></li>
<li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle a-button-selected a-button-focus" id="a-autoid-5"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-5-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-5-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/51sWJTvgBfL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-6"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-6-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-6-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/41hX%2B2Es%2BvL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-7"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-7-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-7-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/51I5TLQy-JL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-8"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-8-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-8-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/51b2EY6IdsL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-9"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-9-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-9-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/41F9DlWvsrL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-10"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-10-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-10-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/51C-rk6qlOL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-11"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-11-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-11-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/41PZZf1xU6L._AC_US40_.jpg">
</span></span></span>
</span></li></ul>

 

先把所有轮播图的列表属性给提取出来,class=样式内容会根据商品品类不同会有变化:

req.xpath('//ul[@class="a-unordered-list a-nostyle a-button-list a-vertical a-spacing-top-micro"]/li')
             

商品颜色属性的获取

<ul class="a-unordered-list a-nostyle a-button-list a-declarative a-button-toggle-group a-horizontal a-spacing-top-micro swatches swatchesSquare imageSwatches" role="radiogroup" data-action="a-button-group" data-a-button-group="{&quot;name&quot;:&quot;twister_color_name&quot;}">

<li id="color_name_0" title="Cliquez pour sélectionner Tissu anthracite" data-defaultasin="B07PHPXHQS" data-dp-url="" class="swatchAvailable"><span class="a-list-item">
<div class="tooltip">
<span class="a-declarative" data-action="swatchthumb-action" data-swatchthumb-action="{&quot;dimIndex&quot;:1,&quot;dimValueIndex&quot;:0}">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-12" aria-checked="false"><span class="a-button-inner"><button class="a-button-text" type="button" id="a-autoid-12-announce">

<span class="xoverlay"></span>
<div class="">
<div class="">
<img src="https://m.media-amazon.com/images/I/61sD09wyFML._SS36_.jpg" alt="Tissu anthracite" style="height:36px; width:36px" class="imgSwatch">
</div>

<div class=" " style="">

</div>

</div>


</button></span></span>
</span>
</div>

</span></li>

<li id="color_name_1" title="Cliquez pour sélectionner Tissu prune" data-defaultasin="B07WLTKTXY" data-dp-url="/dp/B07WLTKTXY/ref=twister_B07H61CQCM?_encoding=UTF8&amp;psc=1" class="swatchSelect"><span class="a-list-item">
<div class="tooltip">
<span class="a-declarative" data-action="swatchthumb-action" data-swatchthumb-action="{&quot;dimIndex&quot;:1,&quot;dimValueIndex&quot;:1}">
<span class="a-button a-button-thumbnail a-button-toggle a-button-selected" id="a-autoid-13" aria-checked="true"><span class="a-button-inner"><button class="a-button-text" type="button" id="a-autoid-13-announce">

<span class="xoverlay"></span>
<div class="">
<div class="">
<img src="https://m.media-amazon.com/images/I/61mROAfn-NL._SS36_.jpg" alt="Tissu prune" style="height:36px; width:36px" class="imgSwatch">
</div>

<div class=" " style="">
</div>
</div>
</button></span></span>
</span>
</div>
</span></li>

<li id="color_name_2" title="Cliquez pour sélectionner Tissu sable" data-defaultasin="B07PDHSPXT" data-dp-url="/dp/B07PDHSPXT/ref=twister_B07H61CQCM?_encoding=UTF8&amp;psc=1" class="swatchAvailable"><span class="a-list-item">
<div class="tooltip">
<span class="a-declarative" data-action="swatchthumb-action" data-swatchthumb-action="{&quot;dimIndex&quot;:1,&quot;dimValueIndex&quot;:2}">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-14" aria-checked="false"><span class="a-button-inner"><button class="a-button-text" type="button" id="a-autoid-14-announce">

<span class="xoverlay"></span>
<div class="">
<div class="">
<img src="https://m.media-amazon.com/images/I/61FlVonHYyL._SS36_.jpg" alt="Tissu sable" style="height:36px; width:36px" class="imgSwatch">
</div>
<div class=" " style="">
</div>
</div>
</button></span></span>
</span>
</div>
</span></li>

</ul>

 

进行了简单的格式化处理

productColors=req.xpath('//li[@id="color_name_"]//text()')
productColor=''.join(Colors)


商品图片的的获取

主要是找到图片链接费了不少力气,写入到js中了,没办法,只能用正则获取到图片链接。

imgs_text=re.findall(r'ImageBlockATF(.+?)return data;',html,re.S)[0]
imgs=re.findall(r'"large":"(.+?)","main":',imgs_text,re.S)
             

图片有轮播图图片和鼠标划过的大图片

产品详情页面的图片

 

一个页面大概有3万多行代码,要挖掘出自己需要的数据,需要慢慢分析,最麻烦的应该是图片数据了。

 

附源码,仅供参考,学习,交流:

#法国亚马逊商品采集
#20200524 by 微信:huguo00289
#https://www.amazon.fr/dp/B07CNJTCBB/ref=twister_B07RVPW2GT?_encoding=UTF8&th=1
 
 

# -*- coding=utf-8 -*-
import requests
from fake_useragent import UserAgent
import re,os,time,random
from lxml import etree
def ua()
     ua=UserAgent();
    headers={"User-Agent":ua.random}
    return headers

def get_data(url):
    id=re.findall(r'dp/(.+?)/',url,re.S)[0]
    print(f'>>>您输入的商品链接id为:{id},正在采集,请稍后..')
    response=requests.get(url,headers=ua(),timeout=8)
    time.sleep(2)
    if response.status_code == 200:
         print(">>>恭喜,获取网页数据成功!")
         html=response.content.decode('utf-8')
with open(f'{id}.html','w',encoding='utf-8') as f:
f.write(html)
req=etree.HTML(html)
h1=req.xpath('//h1[@id="title"]/span[@id="productTitle"]/text()')
print(h1)
h1=h1[0].strip()
print(f'商品标题:{h1}')
productDescriptions=req.xpath('//div[@id="productDescription"]//text()')
productDescription=''.join(productDescriptions)
print(f'商品描述:{productDescription}')
imgs_text=re.findall(r'ImageBlockATF(.+?)return data;',html,re.S)[0]
imgs=re.findall(r'"large":"(.+?)","main":',imgs_text,re.S)
print(imgs)
text=f'商品标题:{h1}\n商品描述:{productDescription}\n商品图片{imgs}'
with open(f'{id}.txt','w',encoding='utf-8') as f:
 f.write(text)
print(f">>>恭喜,保存商品数据成功,已保存为{id}.txt")
lis=req.xpath('//ul[@class="a-unordered-list a-nostyle a-button-list a-declarative a-button-toggle-group a-horizontal a-spacing-top-micro swatches swatchesSquare"]/li')
if len(lis)>1:
print(f">>>商品存在分类属性,共有{len(lis)}分类!")
spans=req.xpath('//div[@class="twisterTextDiv text"]/span[@class="a-size-base"]/text()')
print(spans)

if __name__ == '__main__':
print("亚马逊采集工具-by 微信公众号:二爷记")
 print("BUG反馈 微信:huguo00289");
print("请输入要采集的网址,按回车运行");

try:
get_data(url)
 except Exception as e:
    if "port=443" in e:
print("获取网页链接超时,正在重试..")
get_data(url)
print("采集完毕!")
print("8s后,程序自动关闭,BUG反馈 微信:huguo00289")
time.sleep(8)

 

             

 

 

 

下面是美国亚马逊爬虫的参考代码

 

# -*- coding: utf-8 -*-
"""
File Name:     amzone
Description :
Author :       meng_zhihao
mail :       312141830@qq.com
date:          2019/5/8
"""
# 美国amazon
import requests,urllib
import datetime
from urllib.parse import quote, unquote
from selenium_operate import ChromeOperate
import re
import time
from crawl_tool_for_py3 import crawlerTool as ct
import os,base64
import xlsxwriter
from PIL import Image
DOMAIN = 'https://www.amazon.de'

HEADERS = { 'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_1_1 like Mac OS X) AppleWebKit/602.2.14 (KHTML, like Gecko) Mobile/14B100 MicroMessenger/6.3.22 NetType/WIFI Language/zh_CN'
            }
se = requests.session()

def img_resize(infile,outfile):
    im = Image.open(infile)
    # (x, y) = im.size  # read image size
    x_s = 120  # define standard width
    y_s = 160  # calc height based on standard width
    out = im.resize((x_s, y_s), Image.ANTIALIAS)  # resize image with high-quality
    out.save(outfile)


def gen_xls(item_infos):
    timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
    book = xlsxwriter.Workbook('amazon%s.xlsx'%timestamp)
    worksheet = book.add_worksheet('demo')
    worksheet.write_row(0,0, ['关键词','排名','宝贝图片','价格','宝贝类目','宝贝描述','宝贝链接'])
    worksheet.set_column('A:D', 15) # 列宽约等于8像素 行高约等于1.37像素
    worksheet.set_column('C:C', 20)
    worksheet.set_column('B:B', 10)
    worksheet.set_column('F:F', 50)
    for i in range(len(item_infos)):
        col = i+1
        try:
            item_info = item_infos[i]
            row =   [item_info['keyword'],item_info['rank'],'',item_info['price'],item_info['cat'],item_info['descriptions'],item_info['item_url']]
            worksheet.write_row(col,0, row)
            worksheet.set_row(col, 120)
            if 'item_pic_base64' in item_info:
                item_pic_base64 = item_info["item_pic_base64"]
                try:
                    if 'https:' in item_pic_base64:
                        data = ct.get(item_pic_base64)
                    else:
                        data = base64.b64decode(item_pic_base64)
                    with open('test.png', 'wb') as f:
                        f.write(data)
                    img_resize('test.png', 'img/tmp%s.png'%i)
                    worksheet.insert_image( col,2, 'img/tmp%s.png'%i) # 名字必须不同
                except Exception as e:
                    print(str(e))
        except Exception as e:
            print(str(e))
    print('完成结果数,%s'%col)
    book.close()


def extractor_page(page): # 解析宝贝页
    item_info = {"descriptions":""}
    descriptions = ct.getXpath('//div[@id="productDescription"]/p/text()',page)
    if not descriptions:
        descriptions = ct.getXpath( '//div[@id="aplus"]/div//p//text()', page)
    descriptions= ''.join([description.strip() for description in descriptions])
    item_info["descriptions"] = descriptions
    item_pic_base64 = ct.getXpath1( '//div[@id="imgTagWrapperId"]/img/@src', page).split('base64,')[-1]
    item_info["item_pic_base64"] = item_pic_base64
    price = ct.getXpath1( '//span[@id="priceblock_ourprice"]/text()', page)
    item_info["price"] = price
    cats =  ct.getXpath( '//div[@id="wayfinding-breadcrumbs_container"]//a/text()', page)
    item_info["cat"] = '/'.join([cat.strip() for cat in cats])
    for k in item_info:
        print(k)
    return item_info

if __name__ == '__main__':
    #start_url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&count=15&category=105'
    csv_rows=[]
    cookie = {}
    item_infos = []
    cop = ChromeOperate(executable_path=r'chromedriver.exe')
    cop.open(DOMAIN)
    with open('keywords.txt','r') as keyword_file:
        for line in keyword_file:
            line = line.strip()
            if not line:
                continue
            urls = [DOMAIN+'/s?k=%s&ref=nb_sb_noss_2'%quote(line),
                    # 'https://www.amazon.com/s?k=%s&ref=nb_sb_noss_2&page=2 ' % quote(line)
                    ]
            rank = 0
            for url in urls:
                # HEADERS.update({"Referer":url,"User-Agent":random.choice(USER_AGENT_POOL)})
                cop.open(url)
                page = cop.open_source()
                item_urls = ct.getXpath('//div[@class="sg-row"]//div[@class="sg-col-inner"]//h2/a/@href',page)
                if not item_urls:
                    print(page)
                for item_url in item_urls:
                    rank += 1
                    try:
                        if not 'qid' in item_url:
                            continue
                        else:
                            item_url = DOMAIN+item_url
                            cop.open(item_url)
                            page = cop.driver.page_source
                            if 'Kindle Edition' in page:
                                continue
                            item_info = extractor_page(page)
                            if 'Type the characters you see' in page  :
                                print('IP被封了',url)
                                time.sleep(10)
                                # print page
                                break
                            item_info['keyword'] = line
                            item_info['rank'] = rank
                            item_info['item_url'] = item_url.split('?')[0]
                            item_infos.append(item_info)
                    except Exception as e:
                        print(str(e))
    gen_xls(item_infos)
    cop.quit()
 

 

————————————————
版权声明:本文为CSDN博主「二爷记」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/minge89/article/details/106417047/

这篇关于法国亚马逊商品采集Python爬虫的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/685966

相关文章

使用Python从PPT文档中提取图片和图片信息(如坐标、宽度和高度等)

《使用Python从PPT文档中提取图片和图片信息(如坐标、宽度和高度等)》PPT是一种高效的信息展示工具,广泛应用于教育、商务和设计等多个领域,PPT文档中常常包含丰富的图片内容,这些图片不仅提升了... 目录一、引言二、环境与工具三、python 提取PPT背景图片3.1 提取幻灯片背景图片3.2 提取

Python实现图片分割的多种方法总结

《Python实现图片分割的多种方法总结》图片分割是图像处理中的一个重要任务,它的目标是将图像划分为多个区域或者对象,本文为大家整理了一些常用的分割方法,大家可以根据需求自行选择... 目录1. 基于传统图像处理的分割方法(1) 使用固定阈值分割图片(2) 自适应阈值分割(3) 使用图像边缘检测分割(4)

一文带你搞懂Python中__init__.py到底是什么

《一文带你搞懂Python中__init__.py到底是什么》朋友们,今天我们来聊聊Python里一个低调却至关重要的文件——__init__.py,有些人可能听说过它是“包的标志”,也有人觉得它“没... 目录先搞懂 python 模块(module)Python 包(package)是啥?那么 __in

使用Python实现图像LBP特征提取的操作方法

《使用Python实现图像LBP特征提取的操作方法》LBP特征叫做局部二值模式,常用于纹理特征提取,并在纹理分类中具有较强的区分能力,本文给大家介绍了如何使用Python实现图像LBP特征提取的操作方... 目录一、LBP特征介绍二、LBP特征描述三、一些改进版本的LBP1.圆形LBP算子2.旋转不变的LB

Python中__init__方法使用的深度解析

《Python中__init__方法使用的深度解析》在Python的面向对象编程(OOP)体系中,__init__方法如同建造房屋时的奠基仪式——它定义了对象诞生时的初始状态,下面我们就来深入了解下_... 目录一、__init__的基因图谱二、初始化过程的魔法时刻继承链中的初始化顺序self参数的奥秘默认

Python实现特殊字符判断并去掉非字母和数字的特殊字符

《Python实现特殊字符判断并去掉非字母和数字的特殊字符》在Python中,可以通过多种方法来判断字符串中是否包含非字母、数字的特殊字符,并将这些特殊字符去掉,本文为大家整理了一些常用的,希望对大家... 目录1. 使用正则表达式判断字符串中是否包含特殊字符去掉字符串中的特殊字符2. 使用 str.isa

python中各种常见文件的读写操作与类型转换详细指南

《python中各种常见文件的读写操作与类型转换详细指南》这篇文章主要为大家详细介绍了python中各种常见文件(txt,xls,csv,sql,二进制文件)的读写操作与类型转换,感兴趣的小伙伴可以跟... 目录1.文件txt读写标准用法1.1写入文件1.2读取文件2. 二进制文件读取3. 大文件读取3.1

使用Python实现一个优雅的异步定时器

《使用Python实现一个优雅的异步定时器》在Python中实现定时器功能是一个常见需求,尤其是在需要周期性执行任务的场景下,本文给大家介绍了基于asyncio和threading模块,可扩展的异步定... 目录需求背景代码1. 单例事件循环的实现2. 事件循环的运行与关闭3. 定时器核心逻辑4. 启动与停

基于Python实现读取嵌套压缩包下文件的方法

《基于Python实现读取嵌套压缩包下文件的方法》工作中遇到的问题,需要用Python实现嵌套压缩包下文件读取,本文给大家介绍了详细的解决方法,并有相关的代码示例供大家参考,需要的朋友可以参考下... 目录思路完整代码代码优化思路打开外层zip压缩包并遍历文件:使用with zipfile.ZipFil

Python处理函数调用超时的四种方法

《Python处理函数调用超时的四种方法》在实际开发过程中,我们可能会遇到一些场景,需要对函数的执行时间进行限制,例如,当一个函数执行时间过长时,可能会导致程序卡顿、资源占用过高,因此,在某些情况下,... 目录前言func-timeout1. 安装 func-timeout2. 基本用法自定义进程subp