Python网络爬虫之提取Beautiful Soup库入门学习笔记手札及代码实战

本文主要是介绍Python网络爬虫之提取Beautiful Soup库入门学习笔记手札及代码实战，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

Beautiful Soup库入门学习

学习笔记手札及单元小结
Beautiful Soup库的安装
Beautiful Soup库的安装小测
Beautiful Soup 库的基本元素
- Beautiful Soup库的引用
- BeautifulSoup类
- Tag标签
- Tag的name（名字）
- Tag的attrs(属性)
- Tag的NavigableString
- Tag的Comment
基于bs4库的HTML内容遍历方法
- 标签树的下行遍历
- 标签树的上行遍历
- 标签树的平行遍历
基于bs4库的HTML格式输出
- bs4库的prettify()方法
- bs4库的编码

学习笔记手札及单元小结

Beautiful Soup库的安装

https://www.crummy.com/software/BeautifulSoup/

Win平台：“以管理员身份运行”cmd
执行 pip install beautifulsoup4

Beautiful Soup库的安装小测

此处用Requests库获取demo.html源代码：

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> r.text
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> demo = r.text

此处引入BeautifulSoup库

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,'html.parser')
>>> print(soup.prettify())
<html><head><title>This is a python demo page</title></head><body><p class="title"><b>The demo python introduces several python courses.</b></p><p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>and<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p></body>
</html>

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>data</p>','html.parser')

Beautiful Soup 库的基本元素

Beautiful Soup库是解析遍历维护 “标签树”的功能库

Beautiful Soup库的引用

from bs4 import BeautifulSoup
import bs4

BeautifulSoup类

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<html>data</html>","html.parser")
>>> soup2 = BeautifulSoup(open("D://demo.html"),"html.parser")

Tag标签

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.title
<title>This is a python demo page</title>
>>> tag = soup.a
>>> tag
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

Tag的name（名字）

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.a.name
'a'
>>> soup.a.parent.name
'p'
>>> soup.a.parent.parent.name
'body'

Tag的attrs(属性)

一个可以有0或多个属性，字典类型

>>> tag = soup.a
>>> tag.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> tag.attrs['class']
['py1']
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'
>>> type(tag.attrs)
<class 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>

Tag的NavigableString

NavigableString可以跨越多个层次

>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.string
'Basic Python'
>>> soup.p
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> soup.p.string
'The demo python introduces several python courses.'
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>

Tag的Comment

Comment是一种特殊类型

>>> newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a coment</p>","html.parser")
>>> newsoup.b.string
'This is a comment'
>>> type(newsoup.b.string)
<class 'bs4.element.Comment'>
>>> newsoup.p.string
'This is not a coment'
>>> type(newsoup.p.string)
<class 'bs4.element.NavigableString'>

基于bs4库的HTML内容遍历方法

标签树的下行遍历

>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> soup.body.contents
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
>>> len(soup.body.contents)
5
>>> soup.body.contents[1]
<p class="title"><b>The demo python introduces several python courses.</b></p>

标签树的上行遍历

>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.title.parent
<head><title>This is a python demo page</title></head>
>>> soup.html.parent
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.parent

标签树的平行遍历

>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.a.next_sibling
' and '
>>> soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
>>> soup.a.previous_sibling
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
>>> soup.a.previous_sibling.previous_sibling
>>> soup.a.parent
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

基于bs4库的HTML格式输出

bs4库的prettify()方法

>>> import requests
>>> r = requests.get("http://python123.io/ws/demo.html")
>>> demo = r.text
>>> demo
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.prettify()
'<html>\n <head>\n  <title>\n   This is a python demo page\n  </title>\n </head>\n <body>\n  <p class="title">\n   <b>\n    The demo python introduces several python courses.\n   </b>\n  </p>\n  <p class="course">\n   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n    Basic Python\n   </a>\n   and\n   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n    Advanced Python\n   </a>\n   .\n  </p>\n </body>\n</html>'
>>> print(soup.prettify())
<html><head><title>This is a python demo page</title></head><body><p class="title"><b>The demo python introduces several python courses.</b></p><p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>and<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p></body>
</html>

>>> print(soup.a.prettify())
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python
</a>

bs4库的编码

bs4库将任何HTML输入都变成utf-8编码

>>> soup = BeautifulSoup("<p>中文</p>","html.parser")
>>> soup.p.string
'中文'
>>> print(soup.p.prettify())
<p>中文
</p>

这篇关于Python网络爬虫之提取Beautiful Soup库入门学习笔记手札及代码实战的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

Python网络爬虫之提取Beautiful Soup库入门学习笔记手札及代码实战

Beautiful Soup库入门学习

学习笔记手札及单元小结

Beautiful Soup库的安装

Beautiful Soup库的安装小测

Beautiful Soup 库的基本元素

Beautiful Soup库的引用

BeautifulSoup类

Tag标签

Tag的name（名字）

Tag的attrs(属性)

Tag的NavigableString

Tag的Comment

基于bs4库的HTML内容遍历方法

标签树的下行遍历

标签树的上行遍历

标签树的平行遍历

基于bs4库的HTML格式输出

bs4库的prettify()方法

bs4库的编码

相关文章

从原理到实战深入理解Java 断言assert

使用Python实现可恢复式多线程下载器

Python中注释使用方法举例详解

Python中win32包的安装及常见用途介绍

Python中re模块结合正则表达式的实际应用案例

从入门到精通C++11 ＜chrono＞库特性

Java MQTT实战应用

Java中调用数据库存储过程的示例代码

Visual Studio 2022 编译C++20代码的图文步骤

python常用的正则表达式及作用