scrapy mysql 报错_Scrapy+MySQL爬取豆瓣电影TOP250

本文主要是介绍scrapy mysql 报错_Scrapy+MySQL爬取豆瓣电影TOP250，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

说真的，不知道为啥！只要一问那些做过爬虫的筒靴，不管是自己平时兴趣爱好亦或是刚接触入门，都喜欢拿豆瓣网作为爬虫练手对象，以至于到现在都变成了没爬过豆瓣的都不好意思说自己搞过爬虫了。好了，切入正题......

一、系统环境

Python版本：2.7.12(64位)

Scrapy版本：1.4.0

Mysql版本：5.6.35(64位)

系统版本：Win10(64位)

MySQLdb版本: MySQL-python-1.2.3.win-amd64-py2.7(64位)

开发IDE：PyCharm-2106.3.3(64位)

二、安装MySQL数据库

2.1、安装MySQLdb

ok，到这里，说明上面的MySQL已经安装成功了，接下来你需要安装MySQLdb了。

2.2、什么是MySQLdb？

MySQLdb 是用于Python链接Mysql数据库的接口，它实现了 Python 数据库 API 规范 V2.0，基于 MySQL C API 上建立的；简单来说，就是类似于Java中的JDBC。

2.3、如何安装MySQLdb？

目前你有两个选择：

1、安装已编译好的版本(强烈推荐)

2、从官网下载，自己编译安装(这个真要取决于个人的RP人品了，如果喜欢折腾的话不妨可以试他一试，在此不做介绍，请自行度娘即可)

ok，我们选择第一种方式，官网下载地址：http://www.codegood.com/downloads，大家根据自己的系统自行下载即可，下载完毕直接双击进行安装，可以修改下安装路径，然后一路next即可。

b8c0e6b5cf9e

image.png

2.4、验证MySQLdb是否安装成功

cmd——》输入python——》输入import MySQLdb，查看是否报错，没有报错则说明MySQLdb安装成功！

b8c0e6b5cf9e

image.png

2.5、如何使用MySQLdb

2.6、熟悉XPath

抓取网页时，你做的最常见的任务是从HTML源码中提取数据。现有的一些库可以达到这个目的。

BeautifulSoup：是在程序员间非常流行的网页分析库，它基于HTML代码的结构来构造一个Python对象，对不良标记的处理也非常合理，但它有一个缺点：慢。

lxml：是一个基于 ElementTree (不是Python标准库的一部分)的python化的XML解析库(也可以解析HTML)。

XPath：即为XML路径语言，它是一种用来确定XML(标准通用标记语言的子集)文档中某部分位置的语言。XPath基于XML的树状结构，有不同类型的节点，包括元素节点，属性节点和文本节点，提供在数据结构树中找寻节点的能力。

Scrapy提取数据有自己的一套机制。它们被称作选择器(seletors)，因为他们通过特定的 XPath 或者 CSS 表达式来“选择” HTML文件中的某个部分。

ok，有了上面这些基本的准备工作之后，我们可以开始正式编写爬虫程序了。这里以豆瓣电影TOP250为例：https://movie.douban.com/top250

三、编写爬虫

首先我们使用Chrome或者Firefox浏览器打开这个地址，然后一起分析下这个页面的html元素结构，按住F12键即可查看网页源代码。分析页面我们可以看到，最终需要提取的信息都已经被包裹在class属性为grid_view的这个ol里面了，所以我们就可以基本确定解析范围了，以这个ol元素为整个大的边框，然后再在里面进行查找定位即可。

b8c0e6b5cf9e

image.png

然后具体细节在此就不罗嗦了，直接撸代码吧：

完整的代码已经上传至github上git@github.com:hu1991die/douan_movie_spider.git，欢迎fork，欢迎clone！

1、DoubanMovieTop250Spider.py

# encoding: utf-8

'''

@author: feizi

@file: DoubanMovieTop250Spider.py

@Software: PyCharm

@desc:

'''

import re

from scrapy import Request

from scrapy.spiders import Spider

from douan_movie_spider.items import DouanMovieItem

class DoubanMovieTop250Spider(Spider):

name = 'douban_movie_top250'

def start_requests(self):

url = 'https://movie.douban.com/top250'

yield Request(url)

def parse(self, response):

item = DouanMovieItem()

movieList = response.xpath('//ol[@class="grid_view"]/li')

for movie in movieList:

# 排名

rank = movie.xpath('.//div[@class="pic"]/em/text()').extract_first()

# 封面

cover = movie.xpath('.//div[@class="pic"]/a/img/@src').extract_first()

# 标题

title = movie.xpath('.//div[@class="hd"]/a/span[1]/text()').extract_first()

# 评分

score = movie.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').extract_first()

# 评价人数

comment_num = movie.xpath('.//div[@class="star"]/span[4]/text()').re(ur'(\d+)')[0]

# 经典语录

quote = movie.xpath('.//p[@class="quote"]/span[@class="inq"]/text()').extract_first()

# 上映年份,上映地区，电影分类

briefList = movie.xpath('.//div[@class="bd"]/p/text()').extract()

if briefList:

# 以'/'进行分割

briefs = re.split(r'/', briefList[1])

# 电影分类

types = re.compile(u'([\u4e00-\u9fa5].*)').findall(briefs[len(briefs) - 1])[0]

# 上映地区

region = re.compile(u'([\u4e00-\u9fa5]+)').findall(briefs[len(briefs) - 2])[0]

if len(briefs) <= 3:

# 上映年份

years = re.compile(ur'(\d+)').findall(briefs[len(briefs) - 3])[0]

else:

# 上映年份

years = ''

for brief in briefs:

if hasNumber(brief):

years = years + re.compile(ur'(\d+)').findall(brief)[0] + ","

print years

if types:

# 替换空格为“,”

types = types.replace(" ", ",")

print(rank, cover, title, score, comment_num, quote, years, region, types)

item['rank'] = rank

item['cover'] = cover

item['title'] = title

item['score'] = score

item['comment_num'] = comment_num

item['quote'] = quote

item['years'] = years

item['region'] = region

item['types'] = types

yield item

# 获取下一页url

next_url = response.xpath('//span[@class="next"]/a/@href').extract_first()

if next_url:

next_url = 'https://movie.douban.com/top250' + next_url

yield Request(next_url)

def hasNumber(str):

return bool(re.search('\d+', str))

2、items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

# See documentation in:

# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

# 电影实体类

class DouanMovieItem(scrapy.Item):

# 排名

rank = scrapy.Field()

# 封面

cover = scrapy.Field()

# 标题

title = scrapy.Field()

# 评分

score = scrapy.Field()

# 评价人数

comment_num = scrapy.Field()

# 经典语录

quote = scrapy.Field()

# 上映年份

years = scrapy.Field()

# 上映地区

region = scrapy.Field()

# 电影类型

types = scrapy.Field()

3、pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import MySQLdb

from scrapy.exceptions import DropItem

from douan_movie_spider.items import DouanMovieItem

# 获取数据库连接

def getDbConn():

conn = MySQLdb.Connect(

host='127.0.0.1',

port=3306,

user='root',

passwd='123456',

db='testdb',

charset='utf8'

)

return conn

# 关闭数据库资源

def closeConn(cursor, conn):

# 关闭游标

if cursor:

cursor.close()

# 关闭数据库连接

if conn:

conn.close()

class DouanMovieSpiderPipeline(object):

def __init__(self):

self.ids_seen = set()

def process_item(self, item, spider):

if item['title'] in self.ids_seen:

raise DropItem("Duplicate item found: %s" % item)

else:

self.ids_seen.add(item['title'])

if item.__class__ == DouanMovieItem:

self.insert(item)

return

return item

def insert(self, item):

try:

# 获取数据库连接

conn = getDbConn()

# 获取游标

cursor = conn.cursor()

# 插入数据库

sql = "INSERT INTO db_movie(rank, cover, title, score, comment_num, quote, years, region, types)VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s)"

params = (item['rank'], item['cover'], item['title'], item['score'], item['comment_num'], item['quote'], item['years'], item['region'], item['types'])

cursor.execute(sql, params)

#事务提交

conn.commit()

except Exception, e:

# 事务回滚

conn.rollback()

print 'except:', e.message

finally:

# 关闭游标和数据库连接

closeConn(cursor, conn)

4、main.py

# encoding: utf-8

'''

@author: feizi

@file: main.py

@Software: PyCharm

@desc:

'''

from scrapy import cmdline

name = "douban_movie_top250"

# cmd = "scrapy crawl {0} -o douban.csv".format(name)

cmd = "scrapy crawl {0}".format(name)

cmdline.execute(cmd.split())

5、settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for douan_movie_spider project

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

# http://doc.scrapy.org/en/latest/topics/settings.html

# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'douan_movie_spider'

SPIDER_MODULES = ['douan_movie_spider.spiders']

NEWSPIDER_MODULE = 'douan_movie_spider.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3013.3 Safari/537.36'

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

# 'Accept-Language': 'en',

# Enable or disable spider middlewares

# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

# 'douan_movie_spider.middlewares.DouanMovieSpiderSpiderMiddleware': 543,

# Enable or disable downloader middlewares

# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

# 'douan_movie_spider.middlewares.MyCustomDownloaderMiddleware': 543,

# Enable or disable extensions

# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html

#EXTENSIONS = {

# 'scrapy.extensions.telnet.TelnetConsole': None,

# Configure item pipelines

# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

'douan_movie_spider.pipelines.DouanMovieSpiderPipeline': 300,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See http://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

需要注意一点，为了防止爬虫被ban，我们可以设置一下USER-AGENT.

还是F12键，查看一下Request Headers请求头，找到User-Agent信息然后设置到settings文件中即可。当然，这只是一种简单的方式，其他更复杂的策略如IP池，User-Agent池请自行google吧，这里不做赘述。

b8c0e6b5cf9e

image.png

四、运行爬虫

b8c0e6b5cf9e

image.png

五、保存结果

b8c0e6b5cf9e

image.png

六、简单数据可视化分析

最后，给大家看下简单的数据可视化分析效果。

6.1、评分top10

b8c0e6b5cf9e

image.png

6.2、标题云

b8c0e6b5cf9e

image.png

6.3、语录云

b8c0e6b5cf9e

image.png

6.4、评论TOP10

b8c0e6b5cf9e

image.png

6.5、每一年电影上映数统计

b8c0e6b5cf9e

image.png

6.6、上映地区统计

b8c0e6b5cf9e

image.png

6.7、电影类型汇总

b8c0e6b5cf9e

image.png

这篇关于scrapy mysql 报错_Scrapy+MySQL爬取豆瓣电影TOP250的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

scrapy mysql 报错_Scrapy+MySQL爬取豆瓣电影TOP250

相关文章

MySQL 删除数据详解(最新整理)

MySQL中查找重复值的实现

从入门到精通MySQL联合查询

MySQL查询JSON数组字段包含特定字符串的方法

mysql表操作与查询功能详解

MySQL中的锁机制详解之全局锁,表级锁,行级锁

MySQL数据库中ENUM的用法是什么详解

MySQL count()聚合函数详解

mysql中的服务器架构详解

MySQL之InnoDB存储引擎中的索引用法及说明