Skip to content

Commit 21d9d6e

Browse files
Protocol Buffer TeamLogofile
authored andcommitted
This update includes the following:
* Publishes the "Proto Serialization is not Cannoical" topic * Updates the documentation publishing script to accommodate the move to Node 20 on GitHub PiperOrigin-RevId: 623514390 Change-Id: I80fa75dc9fcd7bc3d51906653008979adbc0ec31
1 parent 61d503a commit 21d9d6e

File tree

2 files changed

+81
-4
lines changed

2 files changed

+81
-4
lines changed

.github/workflows/gh-pages.yml

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,25 +13,29 @@ jobs:
1313
deploy:
1414
runs-on: ubuntu-22.04
1515
steps:
16-
- uses: actions/checkout@v3
16+
- uses: actions/checkout@v4
1717
with:
1818
submodules: true # Fetch Hugo themes (true OR recursive)
1919
fetch-depth: 0 # Fetch all history for .GitInfo and .Lastmod
2020

21+
- uses: actions/setup-node@v4
22+
with:
23+
node-version: 20
24+
2125
- name: Setup Hugo
22-
uses: peaceiris/actions-hugo@v2
26+
uses: peaceiris/actions-hugo@v3
2327
with:
2428
hugo-version: 'latest'
2529
extended: true
26-
30+
2731
- name: Install Dependencies
2832
run: npm install autoprefixer postcss postcss-cli
2933

3034
- name: Build
3135
run: hugo --minify
3236

3337
- name: Deploy
34-
uses: peaceiris/actions-gh-pages@v3
38+
uses: peaceiris/actions-gh-pages@v4
3539
if: github.ref == 'refs/heads/main'
3640
with:
3741
github_token: ${{ secrets.GITHUB_TOKEN }}
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
+++
2+
title = "Proto Serialization Is Not Canonical"
3+
weight = 88
4+
description = "Explains how serialization works and why it is not canonical."
5+
type = "docs"
6+
+++
7+
8+
<!--*
9+
# Document freshness: For more information, see go/fresh-source.
10+
freshness: { owner: 'haberman' reviewed: '2024-02-05' }
11+
*-->
12+
13+
Many people want a serialized proto to canonically represent the contents of
14+
that proto. Use cases include:
15+
16+
* using a serialized proto as a key in a hash table
17+
* taking a fingerprint or checksum of a serialized proto
18+
* comparing serialized payloads as a way of checking message equality
19+
20+
Unfortunately, *protobuf serialization is not (and cannot be) canonical*. There
21+
are a few notable exceptions, such as MapReduce, but in general you should
22+
generally think of proto serialization as unstable. This page explains why.
23+
24+
## Deterministic is not Canonical
25+
26+
Deterministic serialization is not canonical. The serializer can generate
27+
different output for many reasons, including but not limited to the following
28+
variations:
29+
30+
1. The protobuf schema changes in any way.
31+
1. The application being built changes in any way.
32+
1. The binary is built with different flags (eg. opt vs. debug).
33+
1. The protobuf library is updated.
34+
35+
This means that hashes of serialized protos are fragile and not stable across
36+
time or space.
37+
38+
There are many reasons why the serialized output can change. The above list is
39+
not exhaustive. Some of them are inherent difficulties in the problem space that
40+
would make it inefficient or impossible to guarantee canonical serialization
41+
even if we wanted to. Others are things we intentionally leave undefined to
42+
allow for optimization opportunities.
43+
44+
## Inherent Barriers to Stable Serialization
45+
46+
Protobuf objects preserve unknown fields to provide forward and backward
47+
compatibility. Unknown fields cannot be canonically serialized:
48+
49+
1. Unknown fields can't distinguish between bytes and sub-messages, as both
50+
have the same wire type. This makes it impossible to canonicalize messages
51+
stored in the unknown field set. If we were going to canonicalize, we would
52+
need to recurse into unknown submessages to sort their fields by field
53+
number, but we don't have enough information to do this.
54+
1. Unknown fields are always serialized after known fields, for efficiency. But
55+
canonical serialization would require interleaving unknown fields with known
56+
fields by field number. This would cause efficiency and code size overheads
57+
for everybody, even people who do not use the feature.
58+
59+
## Things Intentionally Left Undefined
60+
61+
Even if canonical serialization was feasible (that is, if we could solve the
62+
unknown field problem), we intentionally leave serialization order undefined to
63+
allow for more optimization opportunities:
64+
65+
1. If we can prove a field is never used in a binary, we can remove it from the
66+
schema completely and process it as an unknown field. This saves substantial
67+
code size and CPU cycles.
68+
2. There may be opportunities to optimize by serializing vectors of the same
69+
field together, even though this would break field number order.
70+
71+
To leave room for optimizations like this, we want to intentionally scramble
72+
field order in some configurations, so that applications do not inappropriately
73+
depend on field order.

0 commit comments

Comments
 (0)