Python spider入门_4 BeautifulSoup 模块

0 前言

BeautifulSoup 是用于解析 HTML 和 XML 文档的 Python 库。它常常用于网页抓取,是 Python 爬虫的基础。它将复杂的 HTML 文档转化为复制的 Python 对象树,例如标记,可导航字符串或注释。

参考网站:

https://geek-docs.com/python/python-tutorial/python-beautifulsoup.html

安装 BeautifulSoup4 包(包含 BeautifulSoup 库)

一般安装方法

pip install Beautifulsoup4

大陆高速下载安装(清华镜像源)

pip install Beautifulsoup4 -i https://pypi.tuna.tsinghua.edu.cn/simple

解析 HTML 所用的解析器

一般来书,BeautifulSoup 默认的解析器为 html.parser,无需额外下载安装。但支持的可使用的解析器多种多样,使用别的解析器是需要下载对应模块,如 lxml

一般安装方法

pip install lxml

大陆高速下载安装(清华镜像源)

pip install lxml -i https://pypi.tuna.tsinghua.edu.cn/simple

示例的文件

在本文示例中,我们会用到以下的文件:

index.html

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
<!DOCTYPE html>
<html>
<head>
<title>Header</title>
<meta charset="utf-8">
</head>

<body>
<h2>Operating systems</h2>

<ul id="mylist" style="width:150px">
<li>Solaris</li>
<li>FreeBSD</li>
<li>Debian</li>
<li>NetBSD</li>
<li>Windows</li>
</ul>

<p>
FreeBSD is an advanced computer operating system used to
power modern servers, desktops, and embedded platforms.
</p>

<p>
Debian is a Unix-like computer operating system that is
composed entirely of free software.
</p>

</body>
</html>

1 BeautifulSoup 简单示例

该简单示例帮助大家熟悉流程为主

1
2
3
4
5
6
7
8
9
10

with open("index.html", "r") as f:

contents = f.read() # 我们打开index.html文件并使用read()方法读取其内容。

soup = BeautifulSoup(contents, 'lxml') # 创建了BeautifulSoup对象;HTML数据将传递给构造函数。第二个选项指定解析器。

print(soup.h2)
print(soup.head)
print(soup.li)
1
2
3
4
5
6
<h2>Operating systems</h2>
<head>
<title>Header</title>
<meta charset="utf-8"/>
</head>
<li>Solaris</li>

这里用到了本地的 html 文件,学习的话我们还是以本地的为主,但后面的实战,我们大多数时候是要在网页上抓取的,你还要学习 requests 库呢。当然,这都是后话了,我们还是先学 BeautifulSoup 库的基本操作吧。为了方便你理解,从网页上抓取的代码我也简单演示一下吧。

1
2
3
4
5
6
7
8
9
10
from bs4 import BeautifulSoup
import requests as req

resp = req.get("http://www.something.com")

soup = BeautifulSoup(resp.text, 'lxml')

print(soup.h2)
print(soup.head)
print(soup.li)

2 BeautifulSoup 元素

2.1 BeautifulSoup 标签、名称、文本

标记的 name 属性给出其名称,text 属性给出文本内容

1
2
3
4
5
6
7
8
9
from bs4 import BeautifulSoup

with open("index.html", "r") as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

print("Html: {}\nName: {}\nText: {}".format(soup.title, soup.title.name, soup.title.text))
1
2
3
Html: <title>Header</title>
Name: title
Text: Header

2.2 BeautifulSoup 遍历标签

使用 recursiveChildGenerator() 方法,可以遍历 HTML 文档

1
2
3
4
5
6
7
8
9
10

with open("index.html", "r") as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

for n in soup.recursiveChildGenerator():
if n.name:
print(n.name)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
html
head
title
meta
body
h2
ul
li
li
li
li
li
p
p

3 BeautifulSoup 元素关系

3.1 子元素

使用 children 属性,我们可以获得一个标签的子级

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from bs4 import BeautifulSoup

with open("index.html", "r") as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

root = soup.html
root1 = soup.body

root_childs = [e.name for e in root.children if e.name is not None] # 检索 html 标签的子级,将它们放到一个列表中
root_childs1 = [e.name for e in root1.children if e.name is not None] # 检索 body 标签的子级,将它们放到一个列表中

print(root_childs)
print(root_childs1)
1
2
['head', 'body']
['h2', 'ul', 'p', 'p']

html 标记的子级有 head 和 body ,而 body 的子级有 h2,ul,p,p,这就是它们的子级关系,即“我子级的子级不是我的子级”

整体关系如下:

image-20241219224123819

3.2 后继元素

使用 descendants 属性,我们可以获得标签的所有后代

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from bs4 import BeautifulSoup

with open("index.html", "r") as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

root = soup.html
root1 = soup.body

root_childs = [e.name for e in root.descendants if e.name is not None] # 检索 html 标签的后代,将它们放到一个列表中
root_childs1 = [e.name for e in root1.descendants if e.name is not None] # 检索 body 标签的后代,将它们放到一个列表中

print(root_childs)
print(root_childs1)
1
2
['head', 'title', 'meta', 'body', 'h2', 'ul', 'li', 'li', 'li', 'li', 'li', 'p', 'p']
['h2', 'ul', 'li', 'li', 'li', 'li', 'li', 'p', 'p']

4 输出

4.1 输出美化(格式化)的代码

使用 prettify() 方法,以Unicode编码输出一份格式化后的 HTML/XML 代码,每个 HTML/XML 标签都独占一行。

(说实话,我并不太喜欢这样式的输出,它不会很好的呈现一份 HTML 中的层级关系)

1
2
3
4
5
6
7
8
9
from bs4 import BeautifulSoup

with open("index.html", "r") as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

print(soup.prettify())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
<!DOCTYPE html>
<html>
<head>
<title>
Header
</title>
<meta charset="utf-8"/>
</head>
<body>
<h2>
Operating systems
</h2>
<ul id="mylist" style="width:150px">
<li>
Solaris
</li>
<li>
FreeBSD
</li>
<li>
Debian
</li>
<li>
NetBSD
</li>
<li>
Windows
</li>
</ul>
<p>
FreeBSD is an advanced computer operating system used to
power modern servers, desktops, and embedded platforms.
</p>
<p>
Debian is a Unix-like computer operating system that is
composed entirely of free software.
</p>
</body>
</html>

4.2 get_text() 方法

如果只想得到 tag 中包含的 text 内容,那么可以调用 get_text() 方法,这个方法获取到 tag 中所有的 text 内容,包括子孙中的文本内容。

1
2
3
4
5
6
7
8
9
10
11
from bs4 import BeautifulSoup

with open("index.html", "r") as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

text_in_body = soup.body.get_text()

print(text_in_body)
1
2
3
4
5
6
7
8
9
10
11
12
Operating systems
Solaris
FreeBSD
Debian
NetBSD
Windows
FreeBSD is an advanced computer operating system used to
power modern servers, desktops, and embedded platforms.


Debian is a Unix-like computer operating system that is
composed entirely of free software.

5 查找

5.1 根据 id 进行查找

通过 find() 方法,我们可以通过各种方式(包括元素 id)查找元素

1
2
3
4
5
6
7
8
9
with open("index.html", "r") as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

n = soup.find("ul", id = 'mylist')
# n = soup.find("ul", attrs = {'id' : 'mylist'}) # 执行相同任务的另一种方法
print(n)
1
2
3
4
5
6
7
<ul id="mylist" style="width:150px">
<li>Solaris</li>
<li>FreeBSD</li>
<li>Debian</li>
<li>NetBSD</li>
<li>Windows</li>
</ul>

5.2 查找所有标签

使用 find_all() 方法,找到满足某些条件的元素

5.2.1 单名称查找

1
2
3
4
5
6
7
8
9
10
from bs4 import BeautifulSoup

with open("index.html", "r") as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

for n in soup.find_all("li"):
print("{0}: {1}".format(n.name, n.text))
1
2
3
4
5
li: Solaris
li: FreeBSD
li: Debian
li: NetBSD
li: Windows

5.2.2 多名称查找

find_all() 方法填入要查找的元素名称的列表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from bs4 import BeautifulSoup

with open("index.html", "r") as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

tags = soup.find_all(['h2', 'p'])

for n in tags:
print(" ".join(n.text.split()))
# 想想看这里为什么用 (" ".join(n.text.split()))
# 可以尝试先用 n.text ,再用 n.text.split() ,然后用 " ".join(n.text.split())
# 看看分别会发生什么
1
2
3
Operating systems
FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms.
Debian is a Unix-like computer operating system that is composed entirely of free software.

5.2.3 使用函数

find_all() 方法还可以使用一个函数,该函数确定应返回哪些元素。

在学习这个的时候,我有一些困惑,如果你也有困惑,希望下面这段关于这部分的 find_all() 逻辑伪代码能帮到你。

1
2
3
4
5
6
7
def find_all(self, func):
results = [] # 存储符合条件的标签
for tag in self.all_tags: # 遍历所有标签
if callable(func): # 检查 func 是否可调用
if func(tag): # 调用 func,传入当前标签
results.append(tag) # 如果返回 True,添加到结果列表
return results # 返回符合条件的标签列表

例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from bs4 import BeautifulSoup

def myfunc(x):
return x.is_empty_element

with open("index.html", "r") as f:

contents = f.read()

soup = BeautifulSoup(contents, 'html.parser')

tags = soup.find_all(myfunc)
for tag in tags:
print(tag)
1
<meta charset="utf-8"/>

5.2.4 使用正则表达式

也可以使用正则表达式查找元素

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from bs4 import BeautifulSoup
import re

with open("index.html", "r") as f:

contents = f.read()

soup = BeautifulSoup(contents, 'html.parser')

patt1 = re.compile('BSD')

strings = soup.find_all(string=patt1)

for string in strings:
print(" ".join(string.split()))
1
2
3
FreeBSD
NetBSD
FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms.

6 BeautifulSoap CSS选择器

通过 select 和 select_one 方法,我们可以使用一些 CSS 选择器来查找元素

1
2
3
4
5
6
7
8
9
from bs4 import BeautifulSoup

with open("index.html", "r") as f:

contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

print(soup.select("li:nth-of-type(3)"))
1
[<li>Debian</li>]

7 增删改查

7.1 BeautifulSoap 追加元素

append() 方法将新标签添加到 HTML 文档

他会添加到指定元素的尾部

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from bs4 import BeautifulSoup

with open("index.html", "r") as f:
contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

newtag = soup.new_tag('li') # 首先,我们使用new_tag()方法创建一个新标签
newtag.string = 'OpenBSD'

ultag = soup.ul # 我们获得对ul标签的引用

ultag.append(newtag) # 我们将新创建的标签附加到ul标签

print(ultag.prettify()) # 我们以整齐的格式打印ul标签
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
<ul id="mylist" style="width:150px">
<li>
Solaris
</li>
<li>
FreeBSD
</li>
<li>
Debian
</li>
<li>
NetBSD
</li>
<li>
Windows
</li>
<li>
OpenBSD
</li>
</ul>

7.2 BeautifulSoap 插入元素

insert() 方法可以将元素插入指定位置。

我所学的教程中,0 号位是插入到指定标签下的首部,1 号位是插入到第一个元素后,2 号位是插入到第二个元素后…;但在我的实际操作中(python3.11),0 和 1 号位都是插入到指定标签下的首部,2 和 3 号位是插入到第一个元素后,4 和 5 号位是插入到第二个元素后…,大家要具体问题具体分析。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
rom bs4 import BeautifulSoup

with open("index.html", "r") as f:
contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

newtag = soup.new_tag('li') # 首先,我们使用new_tag()方法创建一个新标签
newtag.string = 'OpenBSD'

ultag = soup.ul # 我们获得对ul标签的引用

ultag.insert(7, newtag) # 我们将新创建的标签附加到ul标签

print(ultag.prettify()) # 我们以整齐的格式打印ul标签
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
<ul id="mylist" style="width:150px">
<li>
Solaris
</li>
<li>
FreeBSD
</li>
<li>
Debian
</li>
<li>
OpenBSD
</li>
<li>
NetBSD
</li>
<li>
Windows
</li>
</ul>

7.3 BeautifulSoap 替换文字

replace_with 替换元素中的内容

当匹配到多个相同内容时,替换第一个

1
2
3
4
5
6
7
8
9
10
11
12
from bs4 import BeautifulSoup

with open("index.html", "r") as f:
contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

tag = soup.find(text='Solaris')

tag.replace_with('OpenBSD')

print(soup.prettify())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
<!DOCTYPE html>
<html>
<head>
<title>
Header
</title>
<meta charset="utf-8"/>
</head>
<body>
<h2>
Operating systems
</h2>
<ul id="mylist" style="width:150px">
<li>
OpenBSD
</li>
<li>
FreeBSD
</li>
<li>
Debian
</li>
<li>
NetBSD
</li>
<li>
Windows
</li>
</ul>
<p>
FreeBSD is an advanced computer operating system used to
power modern servers, desktops, and embedded platforms.
</p>
<p>
Debian is a Unix-like computer operating system that is
composed entirely of free software.
</p>
</body>
</html>

7.4 BeautifulSoap 删除元素

decompose() 方法可以删除元素并销毁它

注意:选择元素时要用 select_one() 方法或者 find(“ “: string=” “),如果用 find(string=” “) 方法,会稍微麻烦一些,因为 find(string=” “) 方法返回的是字符串对象,而不是标签对象,str 不具有 decompose 方法。

1
2
3
4
5
6
7
8
9
10
11
12
13
from bs4 import BeautifulSoup

with open("index.html", "r") as f:
contents = f.read()

soup = BeautifulSoup(contents, 'lxml')

# tag = soup.select_one("li:nth-of-type(2)")
tag = soup.find("li", string="Solaris") # 返回包含 "Solaris" 的 <li> 标签

tag.decompose()

print(soup.body.prettify())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
<body>
<h2>
Operating systems
</h2>
<ul id="mylist" style="width:150px">
<li>
FreeBSD
</li>
<li>
Debian
</li>
<li>
NetBSD
</li>
<li>
Windows
</li>
</ul>
<p>
FreeBSD is an advanced computer operating system used to
power modern servers, desktops, and embedded platforms.
</p>
<p>
Debian is a Unix-like computer operating system that is
composed entirely of free software.
</p>
</body>