【爬虫】Scrapy Feed Exports

本文主要是介绍【爬虫】Scrapy Feed Exports，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

【原文链接】https://doc.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports

Feed exports

New in version 0.10.

One of the most frequently required features when implementing scrapers is being 能够正确保存爬取下来的数据, 并且常常, 这意味着生成一个有爬取下来数据的 “导出文件” (通常被称为 “export feed”), 这个文件会被其他系统消费.

Scrapy 提供了这个功能 out of the box with the Feed Exports, which allows you to 利用多种序列化格式和存储后端用爬取到的 items 生成一个 feed.

序列化格式

Feed exports 使用 Item exporters 对于爬取到的数据进行序列化. These formats are supported out of the box:

JSON
JSON lines
CSV
XML

但是你也可以通过 FEED_EXPORTERS setting 扩展可支持的格式.

JSON

FEED_FORMAT: json
Exporter used: JsonItemExporter
See this warning if you’re using JSON with large feeds.

JSON lines

FEED_FORMAT: jsonlines
Exporter used: JsonLinesItemExporter

CSV

FEED_FORMAT: csv
Exporter used: CsvItemExporter
想要指定要输出的列和输出顺序，使用 FEED_EXPORT_FIELDS. 其他 feed exporters 也可以使用这个选项, but it is important for CSV because unlike many other export formats CSV uses a fixed header.

XML

FEED_FORMAT: xml
Exporter used: XmlItemExporter

Pickle

FEED_FORMAT: pickle
Exporter used: PickleItemExporter

Marshal

FEED_FORMAT: marshal
Exporter used: MarshalItemExporter

存储

当使用 feed exports 时，你要通过 URI (through the FEED_URI setting) 定义 feed 要存储在哪儿. Feed exports 支持多种存储后端类型，这些存储后端的类型是通过 URI schema 定义的.

The storages backends supported out of the box are:

Local filesystem
FTP
S3 (requires botocore or boto)
Standard output

如果所需的外部 libraries 不可用，有些存储后端也可能不可用. 比如 S3 后端只有在 botocore or boto library 安装了的情况下才可用 (Scrapy supports boto only on Python 2).

存储 URI 参数

存储 URI 也可以包含参数，这些参数会在 feed 被创建的时候被替换掉. These parameters are:

%(time)s - gets replaced by a timestamp when the feed is being created
%(name)s - gets replaced by the spider name

任何其他被命名的参数都会被爬虫的同名属性替换掉. 比如, 当 feed 被创建的时候 %(site_id)s 会被 spider.site_id 属性替换掉.

这里有一些用来展示的例子:

Store in FTP using one directory per spider:
ftp://user:password@ftp.example.com/scraping/feeds/%(name)s/%(time)s.json
Store in S3 using one directory per spider:
s3://mybucket/scraping/feeds/%(name)s/%(time)s.json

存储后端

本地文件系统

The feeds 存储在本地文件系统中.

URI scheme: file
Example URI: file:///tmp/export.csv
Required external libraries: none

注意仅对于本地文件系统存储来说你可以忽略 schema，使用像这样的绝对路径 /tmp/export.csv. 这仅适用于 Unix systems.

FTP

The feeds are stored in a FTP server.

URI scheme: ftp
Example URI: ftp://user:pass@ftp.example.com/path/to/export.csv
Required external libraries: none

S3

The feeds are stored on Amazon S3.

URI scheme: s3
Example URIs:
s3://mybucket/path/to/export.csv
s3://aws_key:aws_secret@mybucket/path/to/export.csv
Required external libraries: botocore (Python 2 and Python 3) or boto (Python 2 only)

AWS 证书可以作为 user/password 传递到 URI, 或者他们可以通过下列 settings 被传递:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY

标准输出

The feeds are written to the standard output of the Scrapy process.

URI scheme: stdout
Example URI: stdout:
Required external libraries: none

Settings

These are the settings used for configuring the feed exports:

FEED_URI (mandatory)
FEED_FORMAT
FEED_STORAGES
FEED_EXPORTERS
FEED_STORE_EMPTY
FEED_EXPORT_ENCODING
FEED_EXPORT_FIELDS
FEED_EXPORT_INDENT

FEED_URI

Default: None

The URI of the export feed. See Storage backends for supported URI schemes.

This setting is required for enabling the feed exports.

FEED_FORMAT

The serialization format to be used for the feed. See Serialization formats for possible values.

FEED_EXPORT_ENCODING

Default: None

The encoding to be used for the feed.

如果不设置或者设置为 None (默认) 对除了 JSON输出外的所有输出会使用 UTF-8, which uses safe numeric encoding (\uXXXX sequences) for historic reasons.

Use utf-8 if you want UTF-8 for JSON too.

FEED_EXPORT_FIELDS

Default: None

要输出的一个 fields 列表, optional. Example: FEED_EXPORT_FIELDS = ["foo", "bar", "baz"].

使用 FEED_EXPORT_FIELDS 选项来定义要输出的 fields 和其输出顺序.

当 FEED_EXPORT_FIELDS 是空的或 None (默认) 时, Scrapy 使用字典定义的 fields 或爬虫 yield 的 Item 子类.

如果一个 exporter 需要一个固定的 fields 集合 (this is the case for CSV export format) and FEED_EXPORT_FIELDS is empty or None, 那么 Scrapy 会尝试通过 exported 数据获得 field 名字 - 目前它使用第一个 item 的 field 名字.

FEED_EXPORT_INDENT

Default: 0

每个 level 用来 indent 输出所需的 spaces. 如果 FEED_EXPORT_INDENT 是一个非负的整数, 那么数组元素和对象成员会使用这个 indent level 来被 pretty-printed. 如果 indent level 是 0 (默认), 或负数, 会让每个 item 占据一个新的行. None 选取的是最 compact representation.

现在只有 JsonItemExporter 和 XmlItemExporter 实现, 比如当你将数据 export 到 .json or .xml.

FEED_STORE_EMPTY

Default: False

Whether to export empty feeds (ie. feeds with no items).

FEED_STORAGES

Default: {}

一个包含了附加的项目支持的 feed 存储后端的字典. 键是 URI schemes and the 值是存储类的路径.

FEED_STORAGES_BASE

Default:

{'': 'scrapy.extensions.feedexport.FileFeedStorage','file': 'scrapy.extensions.feedexport.FileFeedStorage','stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage','s3': 'scrapy.extensions.feedexport.S3FeedStorage','ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
}

A dict containing the built-in feed storage backends supported by Scrapy. You can disable any of these backends by assigning None to their URI scheme in FEED_STORAGES. E.g., to disable the built-in FTP storage backend (without replacement), place this in your settings.py:

FEED_STORAGES = {'ftp': None,
}

FEED_EXPORTERS

Default: {}

A dict containing additional exporters supported by your project. The keys are serialization formats and the values are paths to Item exporter classes.

FEED_EXPORTERS_BASE

Default:

{'json': 'scrapy.exporters.JsonItemExporter','jsonlines': 'scrapy.exporters.JsonLinesItemExporter','jl': 'scrapy.exporters.JsonLinesItemExporter','csv': 'scrapy.exporters.CsvItemExporter','xml': 'scrapy.exporters.XmlItemExporter','marshal': 'scrapy.exporters.MarshalItemExporter','pickle': 'scrapy.exporters.PickleItemExporter',
}

A dict containing the built-in feed exporters supported by Scrapy. You can disable any of these exporters by assigning None to their serialization format in FEED_EXPORTERS. E.g., to disable the built-in CSV exporter (without replacement), place this in your settings.py: