本文主要是介绍【爬虫】Scrapy Feed Exports,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
【原文链接】https://doc.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports
Feed exports
New in version 0.10.
One of the most frequently required features when implementing scrapers is being 能够正确保存爬取下来的数据, 并且常常, 这意味着生成一个有爬取下来数据的 “导出文件” (通常被称为 “export feed”), 这个文件会被其他系统消费.
Scrapy 提供了这个功能 out of the box with the Feed Exports, which allows you to 利用多种序列化格式和存储后端用爬取到的 items 生成一个 feed.
序列化格式
Feed exports 使用 Item exporters 对于爬取到的数据进行序列化. These formats are supported out of the box:
- JSON
- JSON lines
- CSV
- XML
但是你也可以通过 FEED_EXPORTERS
setting 扩展可支持的格式.
JSON
FEED_FORMAT
:json
- Exporter used:
JsonItemExporter
- See this warning if you’re using JSON with large feeds.
JSON lines
FEED_FORMAT
:jsonlines
- Exporter used:
JsonLinesItemExporter
CSV
FEED_FORMAT
:csv
- Exporter used:
CsvItemExporter
- 想要指定要输出的列和输出顺序,使用
FEED_EXPORT_FIELDS
. 其他 feed exporters 也可以使用这个选项, but it is important for CSV because unlike many other export formats CSV uses a fixed header.
XML
FEED_FORMAT
:xml
- Exporter used:
XmlItemExporter
Pickle
FEED_FORMAT
:pickle
- Exporter used:
PickleItemExporter
Marshal
FEED_FORMAT
:marshal
- Exporter used:
MarshalItemExporter
存储
当使用 feed exports 时,你要通过 URI (through the FEED_URI
setting) 定义 feed 要存储在哪儿. Feed exports 支持多种存储后端类型,这些存储后端的类型是通过 URI schema 定义的.
The storages backends supported out of the box are:
- Local filesystem
- FTP
- S3 (requires botocore or boto)
- Standard output
如果所需的外部 libraries 不可用,有些存储后端也可能不可用. 比如 S3 后端只有在 botocore or boto library 安装了的情况下才可用 (Scrapy supports boto only on Python 2).
存储 URI 参数
存储 URI 也可以包含参数,这些参数会在 feed 被创建的时候被替换掉. These parameters are:
%(time)s
- gets replaced by a timestamp when the feed is being created%(name)s
- gets replaced by the spider name
任何其他被命名的参数都会被爬虫的同名属性替换掉. 比如, 当 feed 被创建的时候 %(site_id)s
会被 spider.site_id
属性替换掉.
这里有一些用来展示的例子:
- Store in FTP using one directory per spider:
ftp://user:password@ftp.example.com/scraping/feeds/%(name)s/%(time)s.json
- Store in S3 using one directory per spider:
s3://mybucket/scraping/feeds/%(name)s/%(time)s.json
存储后端
本地文件系统
The feeds 存储在本地文件系统中.
- URI scheme:
file
- Example URI:
file:///tmp/export.csv
- Required external libraries: none
注意仅对于本地文件系统存储来说你可以忽略 schema,使用像这样的绝对路径 /tmp/export.csv
. 这仅适用于 Unix systems.
FTP
The feeds are stored in a FTP server.
- URI scheme:
ftp
- Example URI:
ftp://user:pass@ftp.example.com/path/to/export.csv
- Required external libraries: none
S3
The feeds are stored on Amazon S3.
- URI scheme:
s3
- Example URIs:
s3://mybucket/path/to/export.csv
s3://aws_key:aws_secret@mybucket/path/to/export.csv
- Required external libraries: botocore (Python 2 and Python 3) or boto (Python 2 only)
AWS 证书可以作为 user/password 传递到 URI, 或者他们可以通过下列 settings 被传递:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
标准输出
The feeds are written to the standard output of the Scrapy process.
- URI scheme:
stdout
- Example URI:
stdout:
- Required external libraries: none
Settings
These are the settings used for configuring the feed exports:
FEED_URI
(mandatory)FEED_FORMAT
FEED_STORAGES
FEED_EXPORTERS
FEED_STORE_EMPTY
FEED_EXPORT_ENCODING
FEED_EXPORT_FIELDS
FEED_EXPORT_INDENT
FEED_URI
Default: None
The URI of the export feed. See Storage backends for supported URI schemes.
This setting is required for enabling the feed exports.
FEED_FORMAT
The serialization format to be used for the feed. See Serialization formats for possible values.
FEED_EXPORT_ENCODING
Default: None
The encoding to be used for the feed.
如果不设置或者设置为 None
(默认) 对除了 JSON输出外的所有输出会使用 UTF-8, which uses safe numeric encoding (\uXXXX
sequences) for historic reasons.
Use utf-8
if you want UTF-8 for JSON too.
FEED_EXPORT_FIELDS
Default: None
要输出的一个 fields 列表, optional. Example: FEED_EXPORT_FIELDS = ["foo", "bar", "baz"]
.
使用 FEED_EXPORT_FIELDS 选项来定义要输出的 fields 和其输出顺序.
当 FEED_EXPORT_FIELDS 是空的或 None (默认) 时, Scrapy 使用字典定义的 fields 或爬虫 yield 的 Item
子类.
如果一个 exporter 需要一个固定的 fields 集合 (this is the case for CSV export format) and FEED_EXPORT_FIELDS is empty or None, 那么 Scrapy 会尝试通过 exported 数据获得 field 名字 - 目前它使用第一个 item 的 field 名字.
FEED_EXPORT_INDENT
Default: 0
每个 level 用来 indent 输出所需的 spaces. 如果 FEED_EXPORT_INDENT
是一个非负的整数, 那么数组元素和对象成员会使用这个 indent level 来被 pretty-printed. 如果 indent level 是 0
(默认), 或负数, 会让每个 item 占据一个新的行. None
选取的是最 compact representation.
现在只有 JsonItemExporter
和 XmlItemExporter
实现, 比如当你将数据 export 到 .json
or .xml
.
FEED_STORE_EMPTY
Default: False
Whether to export empty feeds (ie. feeds with no items).
FEED_STORAGES
Default: {}
一个包含了附加的项目支持的 feed 存储后端的字典. 键是 URI schemes and the 值是存储类的路径.
FEED_STORAGES_BASE
Default:
{'': 'scrapy.extensions.feedexport.FileFeedStorage','file': 'scrapy.extensions.feedexport.FileFeedStorage','stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage','s3': 'scrapy.extensions.feedexport.S3FeedStorage','ftp': 'scrapy.extensions.feedexport.FTPFeedStorage', }
A dict containing the built-in feed storage backends supported by Scrapy. You can disable any of these backends by assigning None
to their URI scheme in FEED_STORAGES
. E.g., to disable the built-in FTP storage backend (without replacement), place this in your settings.py
:
FEED_STORAGES = {'ftp': None, }
FEED_EXPORTERS
Default: {}
A dict containing additional exporters supported by your project. The keys are serialization formats and the values are paths to Item exporter classes.
FEED_EXPORTERS_BASE
Default:
{'json': 'scrapy.exporters.JsonItemExporter','jsonlines': 'scrapy.exporters.JsonLinesItemExporter','jl': 'scrapy.exporters.JsonLinesItemExporter','csv': 'scrapy.exporters.CsvItemExporter','xml': 'scrapy.exporters.XmlItemExporter','marshal': 'scrapy.exporters.MarshalItemExporter','pickle': 'scrapy.exporters.PickleItemExporter', }
A dict containing the built-in feed exporters supported by Scrapy. You can disable any of these exporters by assigning None
to their serialization format in FEED_EXPORTERS
. E.g., to disable the built-in CSV exporter (without replacement), place this in your settings.py
:
FEED_EXPORTERS = {'csv': None, }
这篇关于【爬虫】Scrapy Feed Exports的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!