本文主要是介绍爬虫系列(1):爬取北邮网研院导师的联系方式,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
文章目录
- 想法与目的
- 所用环境
- 代码
- 截图解释
- 运行结果
想法与目的
考研的小伙伴们大家好,受今年疫情影响,2020考研的复试到5月份才进行,北邮网研院的计算机科学与技术专业的分数线也提高了12分,到达了312分,不得不找调剂的小伙伴就会面临一个问题——提前联系导师。而北邮网研院导师的介绍是在中心分组里面的,报考的时候却根本不知道老师是属于哪个中心的。那么就产生了一个想法,如果能获取网研院官网上的导师联系方式,那是不是会方便很多呢?(PS:获取了导师的联系方式,也就可以去实现自动给导师发邮件的功能了,哈哈哈,不用做一个没有感情的发邮件机器了)
所用环境
window系统
python 3.7版本
pycharm编辑器
第三方依赖库 | 作用 |
---|---|
urllib | 网址解析 |
requests | 网络请求 |
scrapy | 使用xpath表达式进行信息提取(scrapy安装有点复杂,也可以使用别的库) |
xlsxwriter | 将信息保存到Excel表格中(也可以保存到txt文件中) |
代码
# !usr/bin/env python
# -*- coding:utf-8 _*-
"""
@Author:tzy
@File:爬取北邮网研院导师信息.py
@Time:2020/5/8 12:37
@Motto:不积跬步无以至千里,不积小流无以成江海,程序人生的精彩需要坚持不懈地积累!
参考文档
python:往列表中添加字典时碰到的问题https://blog.csdn.net/Sheldomcooper/article/details/82258006
【Python实例学习】用Python的xlsxwriter模块操作Excel表格,包括写入数据、样式设置、插入图片等https://blog.csdn.net/woshiyigerenlaide/article/details/103976391
"""
import refrom urllib import parse
import requests
from scrapy import Selector
import xlsxwriter#定义链接,用于后文拼接完整URL
DOMAIN = 'https://int.bupt.edu.cn'#获取导师的中心信息,存储到teacherList中
def getTeacherCentre(url, teacherList):#每个导师信息分别存储在teacherDict中teacherDict = dict()#网络请求res_text = requests.get(url).text#使用xpath表达式进行分析sel = Selector(text=res_text)divs = sel.xpath('//div[@class="padl20 ovhi clear"]/div')for div in divs:#所属中心的访问链接以及名称centre_url = div.xpath('./a[1]/@href').extract()[0]centre_url = parse.urljoin(DOMAIN, centre_url)centre_name = div.xpath('./a[1]/text()').extract()[0]teacherDict['centre_url'] = centre_urlteacherDict['centre_name'] = centre_nameprint('centre_url:{}'.format(centre_url))print('centre_name:{}'.format(centre_name))getTeacherName(centre_url, teacherList, teacherDict)# break#测试#获取导师的姓名,存储到teacherList中
def getTeacherName(url, teacherList, teacherDict):res_text = requests.get(url).textsel = Selector(text=res_text)divs = sel.xpath('//div[@class="content padtb10"]/div')for div in divs:#可能获取不到,先设置一个默认值teacher_url = '空'teacher_name = '空'try:teacher_url = div.xpath('./a[2]/@href').extract()[0]#拼接成为一个完整的URLteacher_url = parse.urljoin(DOMAIN, teacher_url)teacher_name = div.xpath('./a[2]/text()').extract()[0]#去掉职称信息teacher_name = re.search('(\S.+?)\s', teacher_name).group(1)#去掉姓名中间的空格teacher_name = re.sub('\s', '', teacher_name)except Exception as e:#有的导师信息的网页书写与其他导师的不同,这里也考虑了部分的不同#不考虑也行,之后全部去手动填补print('='*20+'error'+'='*20)print(e)teacher_url = div.xpath('./a/@href').extract()[0]teacher_url = parse.urljoin(DOMAIN, teacher_url)teacher_name = div.xpath('./a/text()').extract()if len(teacher_name) != 0:#有的获取不到,会是一个空列表或者空值,就要try或者if判断一下teacher_name = re.search('(\S.+?)\s', teacher_name[0])if teacher_name is not None:teacher_name = teacher_name.group(1)teacher_name = re.sub('\s', '', teacher_name)else:#如果书写规则与上述的还是不同,就放弃,之后手动填补teacher_name = '无'finally:print('teacher_url:{}'.format(teacher_url))print('teacher_name:{}'.format(teacher_name))teacherDict['name'] = teacher_nameteacherDict['url'] = teacher_urlgetTeacherInfo(teacher_url, teacherList, teacherDict)# break#测试#获取导师的邮件信息,存储到teacherList中
def getTeacherInfo(url, teacherList, teacherDict):res_text = requests.get(url).textsel = Selector(text=res_text)tbody = sel.xpath('//tbody/tr')email = 'null'#现在只提取邮件,其他信息之后再说for tr in tbody:email_name = tr.xpath('./td[1]//text()').extract()#连接成字符串再判断,有的是Email:,有的是email,有的是电子邮件字段email_name = ''.join(email_name)if 'mail' in email_name or '邮件' in email_name:email = tr.xpath('./td[2]/p//text()').extract()email = ''.join(email).strip()print('email为:{}'.format(email))teacherDict['email'] = email#获取到了所需的全部信息,存储到列表中# 向列表中添加字典值,需要进行深拷贝teacherList.append(teacherDict.copy())#保存信息到excel表格中
#这里是获取完全部数据在进行保存的,因为数据量不大
def saveTeacherInfo(teacherList):workbook = xlsxwriter.Workbook('./teacher1.xlsx')worksheet = workbook.add_worksheet('teacher')heading = ['centre_name', 'centre_url', 'name', 'url', 'email']worksheet.write_row('A1', heading)for i in range(0,len(teacherList)):row = i + 1worksheet.write(row, 0, teacherList[i]['centre_name'])worksheet.write(row, 1, teacherList[i]['centre_url'])worksheet.write(row, 2, teacherList[i]['name'])worksheet.write(row, 3, teacherList[i]['url'])worksheet.write(row, 4, teacherList[i]['email'])workbook.close()print('数据保存完毕')#有关Excel表格的美化就懒得研究了(自适应列宽),反正信息到手了def main():teacherList = []# teacherDict = {}url = 'https://int.bupt.edu.cn/list/list.php?p=6_28_1'getTeacherCentre(url, teacherList)# url = 'https://int.bupt.edu.cn/content/content.php?p=6_28_82'# getTeacherName(url, teacherList, teacherDict)# url = 'https://int.bupt.edu.cn/content/content.php?p=6_16_114'# getTeacherInfo(url, teacherList)#测试-此处注释内容用作代码测试print('teacherList:{}'.format(teacherList))saveTeacherInfo(teacherList)if __name__ == '__main__':main()
截图解释
这里就是爬虫的入口链接-获取导师的所属中心
在这里获取导师的姓名
在这里获取导师的联系方式
运行结果
在pycharm上的输出结果
输出为excel表格
PS:
1、第一次写博文,不喜勿喷,有啥建议直接提,共同进步
2、参考的博文链接在代码中,这里就不放了
3、插入的图片做了遮盖,虽然这些信息在官网上都能直接看,但我有点怕,还是做个遮盖(如果不做遮盖会不会有什么不好的影响呀?)
这篇关于爬虫系列(1):爬取北邮网研院导师的联系方式的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!