Error in user YAML: (<unknown>): could not find expected ':' while scanning a simple key at line 3 column 1

---
- oeasy Python 0532
- 这是 oeasy 系统化 Python 教程，从基础一步步讲，扎实、完整、不跳步。愿意花时间学，就能真正学会。
本教程同步发布在： 

     个人网站： `https://oeasy.org` 
     蓝桥云课： `https://www.lanqiao.cn/courses/3584` 
     GitHub： `https://github.com/overmind1980/oeasy-python-tutorial` 
     Gitee： `https://gitee.com/overmind1980/oeasypython` 
---

语法 html 生成 etree

回忆

上上次将字符串
- 转化为 etree节点
上次
- 发送requests
- 返回的response
将 response 作为源头
- 生成etree节点

但是如果网页文件的编码格式
- 不是utf-8
- 而是gb2312呢？？🤔

再准备环境

启动 nginx

sudo service nginx start
sudo service nginx status
firefox http://localhost &

浏览器中确认
- 已经在 localhost 启动web服务

准备网页

网页

编辑gb2312 文档

vi gb.html

编写网页

<html>
  <head>
    <title>gb2312格式</title>
  </head>
  <body>
	我的格式并不是utf-8，而是gb2312。
  </body>
</html>

设置编码

观察编码

:set fileencoding?

目前没有设定编码

编码为默认的utf-8

保存

:w

保存文档后小加号消失

准备修改编码格式

设置编码

注意等号两边没有空格

:set fileencoding=gb2312

设定编码后
- 小加号出现
- 文件字节状态改变

保存并退出

:wq

回到 shell
- 准备浏览页面

浏览文件

把 gb.html拷贝到
- 网页服务器根目录

sudo cp gb.html /usr/share/nginx/html 
sudo service nginx start 
firefox http://localhost/gb.html &

浏览器显示正常

这个文件编码怎么才能看出来呢？

尝试爬取

import requests
from lxml import etree
response = requests.get("http://localhost/gb.html")
b_html = response.content
et_html = etree.HTML(b_html)
print(b_html)

单个汉字编码为两字节 (2-bytes)

2-bytes 对应 gb2312
- 3-bytes 对应 utf-8

直接生成

尝试直接解码

print(et_html[0][0].text)

直接解码字节流
- 使用默认解码方式utf-8
- 结果为乱码

乱码原因

尝试手工解码

b"\xb8\xf1".decode("gb2312")
b"\xca\xbd".decode("gb2312")

解码结果

字节流该如何
- 正确地解码呢？

询问

问问ai

需要设置HTML parser 语法分析器
- 再用 gb2312解码

设置解码格式

语法分析的时候设置 parser

import requests
from lxml import etree
response = requests.get("http://localhost/gb.html")
b_html = response.content
parser=etree.HTMLParser(encoding="gb2312")
et_html = etree.HTML(b_html, parser)
print(et_html[1].text)

指定解码格式为 gb2312

能否让浏览器和爬虫
- 清楚地知道
- 当前网页到底使用什么编码呢？

修改网页

vi gb.html

为网页文件添加
- 第三句元数据设置

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
    <title>gb2312格式</title>
  </head>
  <body>
	我的格式并不是utf-8，而是gb2312。
  </body>
</html>

添加第三句 meta(元数据)
- 明确网页文件
- 编码格式gb2312

:wq

保存并退回到shell

浏览

sudo cp gb.html /usr/share/nginx/html
firefox http://localhost/gb.html &

拷贝并浏览
- 查看源代码
- view source

解码

由于网页header中
- 设置了元数据meta
- content="text/html; charset=gb2312"

import requests
from lxml import etree
response = requests.get("http://localhost/gb.html")
b_html = response.content
et_html = etree.HTML(b_html)
print(et_html[1].text)

lxml 在 HTML解析的时候
- 不用手动设置解码方式
- 爬虫按照 gb2312 自动解码

head中的元数据
- 虽然不是具体内容
- 但是作用很大
如果元数据(meta)中
- 设置编码为 gb2312
- 实际文件编码为 utf-8 呢？

设置编码

保留meta中字符集设置为gb2312
- 设置网页编码为utf-8

保存并退回到shell

覆盖文件

遇到乱码问题
- 就去找parser
- 设置解码格式

sudo cp gb.html /usr/share/nginx/html 
firefox http://localhost/gb.html &

浏览器被骗惨了

如果使用爬虫爬取呢？

爬取

import requests
from lxml import etree
response = requests.get("http://localhost/gb.html")
b_html = response.content
et_html = etree.HTML(b_html)
print(et_html[1].text)

将字节流解码失败

应该怎么办呢？

设置编码

设置使用utf-8解码

parser=etree.HTMLParser(encoding="utf-8")
et_html = etree.HTML(b_html, parser)
print(et_html[1].text)

解码成功

总结

这次了解了编码设置
- response 获得网页的字节流之后
- 可按照指定的编码格式解码

编码名称	应用领域
UTF-8	使用最广泛
ASCII	英语
ISO-8859-1	拉丁字母
GBK	简繁体中文
BIG5	繁体中文

etree 可以顺利生成了
- 如何快速定位要爬取的节点呢？？🤔
下次再说 👋

本文来自 oeasy Python 系统教程。
想完整、扎实学 Python，
搜索 oeasy 即可。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

语法 html 生成 etree

回忆

再准备环境

网页

设置编码

保存

设置编码

浏览文件

尝试爬取

直接生成

乱码原因

尝试手工解码

询问

设置解码格式

修改网页

浏览

解码

设置编码

覆盖文件

爬取

设置编码

总结

FilesExpand file tree

0532.md

Latest commit

History

0532.md

File metadata and controls

语法 html 生成 etree

回忆

再准备环境

网页

设置编码

保存

设置编码

浏览文件

尝试爬取

直接生成

乱码原因

尝试手工解码

询问

设置解码格式

修改网页

浏览

解码

设置编码

覆盖文件

爬取

设置编码

总结