Python spider入门_4 BeautifulSoup 模块

0 前言

BeautifulSoup 是用于解析 HTML 和 XML 文档的 Python 库。它常常用于网页抓取，是 Python 爬虫的基础。它将复杂的 HTML 文档转化为复制的 Python 对象树，例如标记，可导航字符串或注释。

参考网站：

https://geek-docs.com/python/python-tutorial/python-beautifulsoup.html

安装 BeautifulSoup4 包（包含 BeautifulSoup 库）

一般安装方法

pip install Beautifulsoup4

大陆高速下载安装（清华镜像源）

pip install Beautifulsoup4 -i https://pypi.tuna.tsinghua.edu.cn/simple

解析 HTML 所用的解析器

一般来书，BeautifulSoup 默认的解析器为 html.parser，无需额外下载安装。但支持的可使用的解析器多种多样，使用别的解析器是需要下载对应模块，如 lxml

一般安装方法

pip install lxml

大陆高速下载安装（清华镜像源）

pip install lxml -i https://pypi.tuna.tsinghua.edu.cn/simple

示例的文件

在本文示例中，我们会用到以下的文件：

index.html

<!DOCTYPE html>
<html>
    <head>
        <title>Header</title>
        <meta charset="utf-8">
    </head>

    <body>
        <h2>Operating systems</h2>

        <ul id="mylist" style="width:150px">
            <li>Solaris</li>
            <li>FreeBSD</li>
            <li>Debian</li>
            <li>NetBSD</li>
            <li>Windows</li>
        </ul>

        <p>
          FreeBSD is an advanced computer operating system used to
          power modern servers, desktops, and embedded platforms.
        </p>

        <p>
          Debian is a Unix-like computer operating system that is
          composed entirely of free software.
        </p>

    </body>
</html>

1 BeautifulSoup 简单示例

该简单示例帮助大家熟悉流程为主


with open("index.html", "r") as f: 
    
    contents = f.read()  # 我们打开index.html文件并使用read()方法读取其内容。

    soup = BeautifulSoup(contents, 'lxml')  # 创建了BeautifulSoup对象；HTML数据将传递给构造函数。第二个选项指定解析器。

    print(soup.h2)
    print(soup.head)
    print(soup.li)

<h2>Operating systems</h2>
<head>
<title>Header</title>
<meta charset="utf-8"/>
</head>
<li>Solaris</li>

这里用到了本地的 html 文件，学习的话我们还是以本地的为主，但后面的实战，我们大多数时候是要在网页上抓取的，你还要学习 requests 库呢。当然，这都是后话了，我们还是先学 BeautifulSoup 库的基本操作吧。为了方便你理解，从网页上抓取的代码我也简单演示一下吧。

from bs4 import BeautifulSoup
import requests as req

resp = req.get("http://www.something.com")

soup = BeautifulSoup(resp.text, 'lxml')

print(soup.h2)
print(soup.head)
print(soup.li)

2 BeautifulSoup 元素

2.1 BeautifulSoup 标签、名称、文本

标记的 name 属性给出其名称，text 属性给出文本内容

from bs4 import BeautifulSoup

with open("index.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    print("Html: {}\nName: {}\nText: {}".format(soup.title, soup.title.name, soup.title.text))

1
2
3

Html: <title>Header</title>
Name: title
Text: Header

2.2 BeautifulSoup 遍历标签

使用 recursiveChildGenerator() 方法，可以遍历 HTML 文档


with open("index.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    for n in soup.recursiveChildGenerator():
        if n.name:
            print(n.name)

html
head
title
meta
body
h2
ul
li
li
li
li
li
p
p

3 BeautifulSoup 元素关系

3.1 子元素

使用 children 属性，我们可以获得一个标签的子级

from bs4 import BeautifulSoup

with open("index.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    root = soup.html
    root1 = soup.body

    root_childs = [e.name for e in root.children if e.name is not None]  # 检索 html 标签的子级，将它们放到一个列表中
    root_childs1 = [e.name for e in root1.children if e.name is not None]  # 检索 body 标签的子级，将它们放到一个列表中

    print(root_childs)
    print(root_childs1)

1 2	['head', 'body'] ['h2', 'ul', 'p', 'p']

html 标记的子级有 head 和 body ，而 body 的子级有 h2，ul，p，p，这就是它们的子级关系，即“我子级的子级不是我的子级”

整体关系如下：

3.2 后继元素

使用 descendants 属性，我们可以获得标签的所有后代

from bs4 import BeautifulSoup

with open("index.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    root = soup.html
    root1 = soup.body

    root_childs = [e.name for e in root.descendants if e.name is not None]  # 检索 html 标签的后代，将它们放到一个列表中
    root_childs1 = [e.name for e in root1.descendants if e.name is not None]  # 检索 body 标签的后代，将它们放到一个列表中

    print(root_childs)
    print(root_childs1)

1 2	['head', 'title', 'meta', 'body', 'h2', 'ul', 'li', 'li', 'li', 'li', 'li', 'p', 'p'] ['h2', 'ul', 'li', 'li', 'li', 'li', 'li', 'p', 'p']

4 输出

4.1 输出美化（格式化）的代码

使用 prettify() 方法，以Unicode编码输出一份格式化后的 HTML/XML 代码，每个 HTML/XML 标签都独占一行。

（说实话，我并不太喜欢这样式的输出，它不会很好的呈现一份 HTML 中的层级关系）

from bs4 import BeautifulSoup

with open("index.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Header
  </title>
  <meta charset="utf-8"/>
 </head>
 <body>
  <h2>
   Operating systems
  </h2>
  <ul id="mylist" style="width:150px">
   <li>
    Solaris
   </li>
   <li>
    FreeBSD
   </li>
   <li>
    Debian
   </li>
   <li>
    NetBSD
   </li>
   <li>
    Windows
   </li>
  </ul>
  <p>
   FreeBSD is an advanced computer operating system used to
   power modern servers, desktops, and embedded platforms.
  </p>
  <p>
   Debian is a Unix-like computer operating system that is
   composed entirely of free software.
  </p>
 </body>
</html>

4.2 get_text() 方法

如果只想得到 tag 中包含的 text 内容，那么可以调用 get_text() 方法，这个方法获取到 tag 中所有的 text 内容，包括子孙中的文本内容。

from bs4 import BeautifulSoup

with open("index.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    text_in_body = soup.body.get_text()

    print(text_in_body)

Operating systems
Solaris
FreeBSD
Debian
NetBSD
Windows
          FreeBSD is an advanced computer operating system used to
          power modern servers, desktops, and embedded platforms.
    

          Debian is a Unix-like computer operating system that is
          composed entirely of free software.

5 查找

5.1 根据 id 进行查找

通过 find() 方法，我们可以通过各种方式（包括元素 id）查找元素

with open("index.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    n = soup.find("ul", id = 'mylist')
    # n = soup.find("ul", attrs = {'id' : 'mylist'})  # 执行相同任务的另一种方法
    print(n)

<ul id="mylist" style="width:150px">
<li>Solaris</li>
<li>FreeBSD</li>
<li>Debian</li>
<li>NetBSD</li>
<li>Windows</li>
</ul>

5.2 查找所有标签

使用 find_all() 方法，找到满足某些条件的元素

5.2.1 单名称查找

from bs4 import BeautifulSoup

with open("index.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    for n in soup.find_all("li"):
        print("{0}: {1}".format(n.name, n.text))

li: Solaris
li: FreeBSD
li: Debian
li: NetBSD
li: Windows

5.2.2 多名称查找

find_all() 方法填入要查找的元素名称的列表

from bs4 import BeautifulSoup

with open("index.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    tags = soup.find_all(['h2', 'p'])

    for n in tags:
        print(" ".join(n.text.split()))  
        # 想想看这里为什么用 (" ".join(n.text.split()))
        # 可以尝试先用 n.text ，再用 n.text.split() ，然后用 " ".join(n.text.split())
        # 看看分别会发生什么

1
2
3

Operating systems
FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms.
Debian is a Unix-like computer operating system that is composed entirely of free software.

5.2.3 使用函数

find_all() 方法还可以使用一个函数，该函数确定应返回哪些元素。

在学习这个的时候，我有一些困惑，如果你也有困惑，希望下面这段关于这部分的 find_all() 逻辑伪代码能帮到你。

def find_all(self, func):
    results = []  # 存储符合条件的标签
    for tag in self.all_tags:  # 遍历所有标签
        if callable(func):  # 检查 func 是否可调用
            if func(tag):  # 调用 func，传入当前标签
                results.append(tag)  # 如果返回 True，添加到结果列表
    return results  # 返回符合条件的标签列表

例子：

from bs4 import BeautifulSoup

def myfunc(x):
    return x.is_empty_element

with open("index.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'html.parser')

    tags = soup.find_all(myfunc)
    for tag in tags:
        print(tag)

1	<meta charset="utf-8"/>

5.2.4 使用正则表达式

也可以使用正则表达式查找元素

from bs4 import BeautifulSoup
import re

with open("index.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'html.parser')

    patt1 = re.compile('BSD')

    strings = soup.find_all(string=patt1)

    for string in strings:
        print(" ".join(string.split()))

1
2
3

FreeBSD
NetBSD
FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms.

6 BeautifulSoap CSS选择器

通过 select 和 select_one 方法，我们可以使用一些 CSS 选择器来查找元素

from bs4 import BeautifulSoup

with open("index.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    print(soup.select("li:nth-of-type(3)"))

1	[<li>Debian</li>]

7 增删改查

7.1 BeautifulSoap 追加元素

append() 方法将新标签添加到 HTML 文档

他会添加到指定元素的尾部

from bs4 import BeautifulSoup

with open("index.html", "r") as f:
    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    newtag = soup.new_tag('li')  # 首先，我们使用new_tag()方法创建一个新标签
    newtag.string = 'OpenBSD'

    ultag = soup.ul  # 我们获得对ul标签的引用

    ultag.append(newtag)  # 我们将新创建的标签附加到ul标签

    print(ultag.prettify())  # 我们以整齐的格式打印ul标签

<ul id="mylist" style="width:150px">
<li>
Solaris
</li>
<li>
FreeBSD
</li>
<li>
Debian
</li>
<li>
NetBSD
</li>
<li>
Windows
</li>
<li>
OpenBSD
</li>
</ul>

7.2 BeautifulSoap 插入元素

insert() 方法可以将元素插入指定位置。

我所学的教程中，0 号位是插入到指定标签下的首部，1 号位是插入到第一个元素后，2 号位是插入到第二个元素后…；但在我的实际操作中(python3.11)，0 和 1 号位都是插入到指定标签下的首部，2 和 3 号位是插入到第一个元素后，4 和 5 号位是插入到第二个元素后…，大家要具体问题具体分析。

rom bs4 import BeautifulSoup

with open("index.html", "r") as f:
    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    newtag = soup.new_tag('li')  # 首先，我们使用new_tag()方法创建一个新标签
    newtag.string = 'OpenBSD'

    ultag = soup.ul  # 我们获得对ul标签的引用

    ultag.insert(7, newtag)  # 我们将新创建的标签附加到ul标签

    print(ultag.prettify())  # 我们以整齐的格式打印ul标签

<ul id="mylist" style="width:150px">
<li>
Solaris
</li>
<li>
FreeBSD
</li>
<li>
Debian
</li>
<li>
OpenBSD
</li>
<li>
NetBSD
</li>
<li>
Windows
</li>
</ul>

7.3 BeautifulSoap 替换文字

replace_with 替换元素中的内容

当匹配到多个相同内容时，替换第一个

from bs4 import BeautifulSoup

with open("index.html", "r") as f:
    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    tag = soup.find(text='Solaris')

    tag.replace_with('OpenBSD')

    print(soup.prettify())

<!DOCTYPE html>
<html>
<head>
<title>
Header
</title>
<meta charset="utf-8"/>
</head>
<body>
<h2>
Operating systems
</h2>
<ul id="mylist" style="width:150px">
<li>
OpenBSD
</li>
<li>
FreeBSD
</li>
<li>
Debian
</li>
<li>
NetBSD
</li>
<li>
Windows
</li>
</ul>
<p>
FreeBSD is an advanced computer operating system used to
power modern servers, desktops, and embedded platforms.
</p>
<p>
Debian is a Unix-like computer operating system that is
composed entirely of free software.
</p>
</body>
</html>

7.4 BeautifulSoap 删除元素

decompose() 方法可以删除元素并销毁它

注意：选择元素时要用 select_one() 方法或者 find(“ “: string=” “)，如果用 find(string=” “) 方法，会稍微麻烦一些，因为 find(string=” “) 方法返回的是字符串对象，而不是标签对象，str 不具有 decompose 方法。

from bs4 import BeautifulSoup

with open("index.html", "r") as f:
    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    # tag = soup.select_one("li:nth-of-type(2)")
    tag = soup.find("li", string="Solaris")  # 返回包含 "Solaris" 的 <li> 标签

    tag.decompose()

    print(soup.body.prettify())

<body>
<h2>
Operating systems
</h2>
<ul id="mylist" style="width:150px">
<li>
FreeBSD
</li>
<li>
Debian
</li>
<li>
NetBSD
</li>
<li>
Windows
</li>
</ul>
<p>
FreeBSD is an advanced computer operating system used to
power modern servers, desktops, and embedded platforms.
</p>
<p>
Debian is a Unix-like computer operating system that is
composed entirely of free software.
</p>
</body>