法国亚马逊商品采集Python爬虫

2024-02-07 00:10

本文主要是介绍法国亚马逊商品采集Python爬虫,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

看着身边做亚马逊铺货的朋友,花大时间收集商品信息,学着写个脚本帮忙解决下问题。他们日常主要是抓取商品价格,商品图片,商品介绍等。

商品图片应该是最难获取的到的。可以在js里可以获取到完整的商品大图

这个文章主要参考二爷记博客的文章:https://blog.csdn.net/minge89/article/details/106417047/

1、商品标题的获取

 

其实直接取title应该更简单,我这里是取得页面内容的标题。

 

亚马逊商品页面html标题代码:<title>Echo Dot (3ème génération), Enceinte connectée avec Alexa, Tissu anthracite: Amazon.fr</title>

商品标题的获取:req.xpath('//h1[@id="title"]/span[@id="productTitle"]/text()')
             

2、商品属性的获取

 

<ul class="a-unordered-list a-nostyle a-button-list a-vertical a-spacing-top-micro">

<li class="a-spacing-small videoCountTemplate aok-hidden"><span class="a-list-item">
<span id="videoCount_template" class="a-size-mini a-color-secondary video-count a-text-bold a-nowrap"> <hza:string id=""></hza:string></span>
</span></li>
<li class="a-spacing-small 360IngressTemplate pos-360 aok-hidden"><span class="a-list-item">
<span class="a-declarative" data-action="thumb-action" data-thumb-action="{}">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-3"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-3-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-3-announce">
<img alt="" src="https://images-na.ssl-images-amazon.com/images/G/08/HomeCustomProduct/360_icon_73x73v2._CB485971279_SS40_FMpng_RI_.png">
</span></span></span>
</span>
</span></li>

<li class="a-spacing-small template"><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-4"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-4-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-4-announce">
<span class="placeHolder"></span>
</span></span></span>
</span></li>
<li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle a-button-selected a-button-focus" id="a-autoid-5"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-5-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-5-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/51sWJTvgBfL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-6"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-6-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-6-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/41hX%2B2Es%2BvL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-7"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-7-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-7-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/51I5TLQy-JL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-8"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-8-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-8-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/51b2EY6IdsL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-9"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-9-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-9-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/41F9DlWvsrL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-10"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-10-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-10-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/51C-rk6qlOL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-11"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-11-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-11-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/41PZZf1xU6L._AC_US40_.jpg">
</span></span></span>
</span></li></ul>

 

先把所有轮播图的列表属性给提取出来,class=样式内容会根据商品品类不同会有变化:

req.xpath('//ul[@class="a-unordered-list a-nostyle a-button-list a-vertical a-spacing-top-micro"]/li')
             

商品颜色属性的获取

<ul class="a-unordered-list a-nostyle a-button-list a-declarative a-button-toggle-group a-horizontal a-spacing-top-micro swatches swatchesSquare imageSwatches" role="radiogroup" data-action="a-button-group" data-a-button-group="{&quot;name&quot;:&quot;twister_color_name&quot;}">

<li id="color_name_0" title="Cliquez pour sélectionner Tissu anthracite" data-defaultasin="B07PHPXHQS" data-dp-url="" class="swatchAvailable"><span class="a-list-item">
<div class="tooltip">
<span class="a-declarative" data-action="swatchthumb-action" data-swatchthumb-action="{&quot;dimIndex&quot;:1,&quot;dimValueIndex&quot;:0}">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-12" aria-checked="false"><span class="a-button-inner"><button class="a-button-text" type="button" id="a-autoid-12-announce">

<span class="xoverlay"></span>
<div class="">
<div class="">
<img src="https://m.media-amazon.com/images/I/61sD09wyFML._SS36_.jpg" alt="Tissu anthracite" style="height:36px; width:36px" class="imgSwatch">
</div>

<div class=" " style="">

</div>

</div>


</button></span></span>
</span>
</div>

</span></li>

<li id="color_name_1" title="Cliquez pour sélectionner Tissu prune" data-defaultasin="B07WLTKTXY" data-dp-url="/dp/B07WLTKTXY/ref=twister_B07H61CQCM?_encoding=UTF8&amp;psc=1" class="swatchSelect"><span class="a-list-item">
<div class="tooltip">
<span class="a-declarative" data-action="swatchthumb-action" data-swatchthumb-action="{&quot;dimIndex&quot;:1,&quot;dimValueIndex&quot;:1}">
<span class="a-button a-button-thumbnail a-button-toggle a-button-selected" id="a-autoid-13" aria-checked="true"><span class="a-button-inner"><button class="a-button-text" type="button" id="a-autoid-13-announce">

<span class="xoverlay"></span>
<div class="">
<div class="">
<img src="https://m.media-amazon.com/images/I/61mROAfn-NL._SS36_.jpg" alt="Tissu prune" style="height:36px; width:36px" class="imgSwatch">
</div>

<div class=" " style="">
</div>
</div>
</button></span></span>
</span>
</div>
</span></li>

<li id="color_name_2" title="Cliquez pour sélectionner Tissu sable" data-defaultasin="B07PDHSPXT" data-dp-url="/dp/B07PDHSPXT/ref=twister_B07H61CQCM?_encoding=UTF8&amp;psc=1" class="swatchAvailable"><span class="a-list-item">
<div class="tooltip">
<span class="a-declarative" data-action="swatchthumb-action" data-swatchthumb-action="{&quot;dimIndex&quot;:1,&quot;dimValueIndex&quot;:2}">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-14" aria-checked="false"><span class="a-button-inner"><button class="a-button-text" type="button" id="a-autoid-14-announce">

<span class="xoverlay"></span>
<div class="">
<div class="">
<img src="https://m.media-amazon.com/images/I/61FlVonHYyL._SS36_.jpg" alt="Tissu sable" style="height:36px; width:36px" class="imgSwatch">
</div>
<div class=" " style="">
</div>
</div>
</button></span></span>
</span>
</div>
</span></li>

</ul>

 

进行了简单的格式化处理

productColors=req.xpath('//li[@id="color_name_"]//text()')
productColor=''.join(Colors)


商品图片的的获取

主要是找到图片链接费了不少力气,写入到js中了,没办法,只能用正则获取到图片链接。

imgs_text=re.findall(r'ImageBlockATF(.+?)return data;',html,re.S)[0]
imgs=re.findall(r'"large":"(.+?)","main":',imgs_text,re.S)
             

图片有轮播图图片和鼠标划过的大图片

产品详情页面的图片

 

一个页面大概有3万多行代码,要挖掘出自己需要的数据,需要慢慢分析,最麻烦的应该是图片数据了。

 

附源码,仅供参考,学习,交流:

#法国亚马逊商品采集
#20200524 by 微信:huguo00289
#https://www.amazon.fr/dp/B07CNJTCBB/ref=twister_B07RVPW2GT?_encoding=UTF8&th=1
 
 

# -*- coding=utf-8 -*-
import requests
from fake_useragent import UserAgent
import re,os,time,random
from lxml import etree
def ua()
     ua=UserAgent();
    headers={"User-Agent":ua.random}
    return headers

def get_data(url):
    id=re.findall(r'dp/(.+?)/',url,re.S)[0]
    print(f'>>>您输入的商品链接id为:{id},正在采集,请稍后..')
    response=requests.get(url,headers=ua(),timeout=8)
    time.sleep(2)
    if response.status_code == 200:
         print(">>>恭喜,获取网页数据成功!")
         html=response.content.decode('utf-8')
with open(f'{id}.html','w',encoding='utf-8') as f:
f.write(html)
req=etree.HTML(html)
h1=req.xpath('//h1[@id="title"]/span[@id="productTitle"]/text()')
print(h1)
h1=h1[0].strip()
print(f'商品标题:{h1}')
productDescriptions=req.xpath('//div[@id="productDescription"]//text()')
productDescription=''.join(productDescriptions)
print(f'商品描述:{productDescription}')
imgs_text=re.findall(r'ImageBlockATF(.+?)return data;',html,re.S)[0]
imgs=re.findall(r'"large":"(.+?)","main":',imgs_text,re.S)
print(imgs)
text=f'商品标题:{h1}\n商品描述:{productDescription}\n商品图片{imgs}'
with open(f'{id}.txt','w',encoding='utf-8') as f:
 f.write(text)
print(f">>>恭喜,保存商品数据成功,已保存为{id}.txt")
lis=req.xpath('//ul[@class="a-unordered-list a-nostyle a-button-list a-declarative a-button-toggle-group a-horizontal a-spacing-top-micro swatches swatchesSquare"]/li')
if len(lis)>1:
print(f">>>商品存在分类属性,共有{len(lis)}分类!")
spans=req.xpath('//div[@class="twisterTextDiv text"]/span[@class="a-size-base"]/text()')
print(spans)

if __name__ == '__main__':
print("亚马逊采集工具-by 微信公众号:二爷记")
 print("BUG反馈 微信:huguo00289");
print("请输入要采集的网址,按回车运行");

try:
get_data(url)
 except Exception as e:
    if "port=443" in e:
print("获取网页链接超时,正在重试..")
get_data(url)
print("采集完毕!")
print("8s后,程序自动关闭,BUG反馈 微信:huguo00289")
time.sleep(8)

 

             

 

 

 

下面是美国亚马逊爬虫的参考代码

 

# -*- coding: utf-8 -*-
"""
File Name:     amzone
Description :
Author :       meng_zhihao
mail :       312141830@qq.com
date:          2019/5/8
"""
# 美国amazon
import requests,urllib
import datetime
from urllib.parse import quote, unquote
from selenium_operate import ChromeOperate
import re
import time
from crawl_tool_for_py3 import crawlerTool as ct
import os,base64
import xlsxwriter
from PIL import Image
DOMAIN = 'https://www.amazon.de'

HEADERS = { 'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_1_1 like Mac OS X) AppleWebKit/602.2.14 (KHTML, like Gecko) Mobile/14B100 MicroMessenger/6.3.22 NetType/WIFI Language/zh_CN'
            }
se = requests.session()

def img_resize(infile,outfile):
    im = Image.open(infile)
    # (x, y) = im.size  # read image size
    x_s = 120  # define standard width
    y_s = 160  # calc height based on standard width
    out = im.resize((x_s, y_s), Image.ANTIALIAS)  # resize image with high-quality
    out.save(outfile)


def gen_xls(item_infos):
    timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
    book = xlsxwriter.Workbook('amazon%s.xlsx'%timestamp)
    worksheet = book.add_worksheet('demo')
    worksheet.write_row(0,0, ['关键词','排名','宝贝图片','价格','宝贝类目','宝贝描述','宝贝链接'])
    worksheet.set_column('A:D', 15) # 列宽约等于8像素 行高约等于1.37像素
    worksheet.set_column('C:C', 20)
    worksheet.set_column('B:B', 10)
    worksheet.set_column('F:F', 50)
    for i in range(len(item_infos)):
        col = i+1
        try:
            item_info = item_infos[i]
            row =   [item_info['keyword'],item_info['rank'],'',item_info['price'],item_info['cat'],item_info['descriptions'],item_info['item_url']]
            worksheet.write_row(col,0, row)
            worksheet.set_row(col, 120)
            if 'item_pic_base64' in item_info:
                item_pic_base64 = item_info["item_pic_base64"]
                try:
                    if 'https:' in item_pic_base64:
                        data = ct.get(item_pic_base64)
                    else:
                        data = base64.b64decode(item_pic_base64)
                    with open('test.png', 'wb') as f:
                        f.write(data)
                    img_resize('test.png', 'img/tmp%s.png'%i)
                    worksheet.insert_image( col,2, 'img/tmp%s.png'%i) # 名字必须不同
                except Exception as e:
                    print(str(e))
        except Exception as e:
            print(str(e))
    print('完成结果数,%s'%col)
    book.close()


def extractor_page(page): # 解析宝贝页
    item_info = {"descriptions":""}
    descriptions = ct.getXpath('//div[@id="productDescription"]/p/text()',page)
    if not descriptions:
        descriptions = ct.getXpath( '//div[@id="aplus"]/div//p//text()', page)
    descriptions= ''.join([description.strip() for description in descriptions])
    item_info["descriptions"] = descriptions
    item_pic_base64 = ct.getXpath1( '//div[@id="imgTagWrapperId"]/img/@src', page).split('base64,')[-1]
    item_info["item_pic_base64"] = item_pic_base64
    price = ct.getXpath1( '//span[@id="priceblock_ourprice"]/text()', page)
    item_info["price"] = price
    cats =  ct.getXpath( '//div[@id="wayfinding-breadcrumbs_container"]//a/text()', page)
    item_info["cat"] = '/'.join([cat.strip() for cat in cats])
    for k in item_info:
        print(k)
    return item_info

if __name__ == '__main__':
    #start_url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&count=15&category=105'
    csv_rows=[]
    cookie = {}
    item_infos = []
    cop = ChromeOperate(executable_path=r'chromedriver.exe')
    cop.open(DOMAIN)
    with open('keywords.txt','r') as keyword_file:
        for line in keyword_file:
            line = line.strip()
            if not line:
                continue
            urls = [DOMAIN+'/s?k=%s&ref=nb_sb_noss_2'%quote(line),
                    # 'https://www.amazon.com/s?k=%s&ref=nb_sb_noss_2&page=2 ' % quote(line)
                    ]
            rank = 0
            for url in urls:
                # HEADERS.update({"Referer":url,"User-Agent":random.choice(USER_AGENT_POOL)})
                cop.open(url)
                page = cop.open_source()
                item_urls = ct.getXpath('//div[@class="sg-row"]//div[@class="sg-col-inner"]//h2/a/@href',page)
                if not item_urls:
                    print(page)
                for item_url in item_urls:
                    rank += 1
                    try:
                        if not 'qid' in item_url:
                            continue
                        else:
                            item_url = DOMAIN+item_url
                            cop.open(item_url)
                            page = cop.driver.page_source
                            if 'Kindle Edition' in page:
                                continue
                            item_info = extractor_page(page)
                            if 'Type the characters you see' in page  :
                                print('IP被封了',url)
                                time.sleep(10)
                                # print page
                                break
                            item_info['keyword'] = line
                            item_info['rank'] = rank
                            item_info['item_url'] = item_url.split('?')[0]
                            item_infos.append(item_info)
                    except Exception as e:
                        print(str(e))
    gen_xls(item_infos)
    cop.quit()
 

 

————————————————
版权声明:本文为CSDN博主「二爷记」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/minge89/article/details/106417047/

这篇关于法国亚马逊商品采集Python爬虫的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/685966

相关文章

Conda与Python venv虚拟环境的区别与使用方法详解

《Conda与Pythonvenv虚拟环境的区别与使用方法详解》随着Python社区的成长,虚拟环境的概念和技术也在不断发展,:本文主要介绍Conda与Pythonvenv虚拟环境的区别与使用... 目录前言一、Conda 与 python venv 的核心区别1. Conda 的特点2. Python v

Python使用python-can实现合并BLF文件

《Python使用python-can实现合并BLF文件》python-can库是Python生态中专注于CAN总线通信与数据处理的强大工具,本文将使用python-can为BLF文件合并提供高效灵活... 目录一、python-can 库:CAN 数据处理的利器二、BLF 文件合并核心代码解析1. 基础合

Python使用OpenCV实现获取视频时长的小工具

《Python使用OpenCV实现获取视频时长的小工具》在处理视频数据时,获取视频的时长是一项常见且基础的需求,本文将详细介绍如何使用Python和OpenCV获取视频时长,并对每一行代码进行深入解析... 目录一、代码实现二、代码解析1. 导入 OpenCV 库2. 定义获取视频时长的函数3. 打开视频文

Python中你不知道的gzip高级用法分享

《Python中你不知道的gzip高级用法分享》在当今大数据时代,数据存储和传输成本已成为每个开发者必须考虑的问题,Python内置的gzip模块提供了一种简单高效的解决方案,下面小编就来和大家详细讲... 目录前言:为什么数据压缩如此重要1. gzip 模块基础介绍2. 基本压缩与解压缩操作2.1 压缩文

Python设置Cookie永不超时的详细指南

《Python设置Cookie永不超时的详细指南》Cookie是一种存储在用户浏览器中的小型数据片段,用于记录用户的登录状态、偏好设置等信息,下面小编就来和大家详细讲讲Python如何设置Cookie... 目录一、Cookie的作用与重要性二、Cookie过期的原因三、实现Cookie永不超时的方法(一)

Python内置函数之classmethod函数使用详解

《Python内置函数之classmethod函数使用详解》:本文主要介绍Python内置函数之classmethod函数使用方式,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地... 目录1. 类方法定义与基本语法2. 类方法 vs 实例方法 vs 静态方法3. 核心特性与用法(1编程客

Python函数作用域示例详解

《Python函数作用域示例详解》本文介绍了Python中的LEGB作用域规则,详细解析了变量查找的四个层级,通过具体代码示例,展示了各层级的变量访问规则和特性,对python函数作用域相关知识感兴趣... 目录一、LEGB 规则二、作用域实例2.1 局部作用域(Local)2.2 闭包作用域(Enclos

Python实现对阿里云OSS对象存储的操作详解

《Python实现对阿里云OSS对象存储的操作详解》这篇文章主要为大家详细介绍了Python实现对阿里云OSS对象存储的操作相关知识,包括连接,上传,下载,列举等功能,感兴趣的小伙伴可以了解下... 目录一、直接使用代码二、详细使用1. 环境准备2. 初始化配置3. bucket配置创建4. 文件上传到os

使用Python实现可恢复式多线程下载器

《使用Python实现可恢复式多线程下载器》在数字时代,大文件下载已成为日常操作,本文将手把手教你用Python打造专业级下载器,实现断点续传,多线程加速,速度限制等功能,感兴趣的小伙伴可以了解下... 目录一、智能续传:从崩溃边缘抢救进度二、多线程加速:榨干网络带宽三、速度控制:做网络的好邻居四、终端交互

Python中注释使用方法举例详解

《Python中注释使用方法举例详解》在Python编程语言中注释是必不可少的一部分,它有助于提高代码的可读性和维护性,:本文主要介绍Python中注释使用方法的相关资料,需要的朋友可以参考下... 目录一、前言二、什么是注释?示例:三、单行注释语法:以 China编程# 开头,后面的内容为注释内容示例:示例:四