Skip to content

Commit 23618d8

Browse files
committed
docs(unixfs): clarify PBNode field order and streaming parser implications
dag-pb serializes Links before Data on the wire, but the UnixFS spec did not document this or its impact on implementations. - note after PBNode schema: field order is stricter than intuitive protobuf convention, decoders MUST accept both, encoders SHOULD use Links-before-Data per IPIP-499 profiles - warning in dag-pb Types section: streaming parsers cannot determine node type until after all links are read - test vectors: wire order annotations for directory and HAMT fixtures - appendix: historical context and Robustness Principle guidance - dag-pb spec reference updated to Wayback Machine snapshot
1 parent 25a3fc2 commit 23618d8

File tree

1 file changed

+71
-1
lines changed

1 file changed

+71
-1
lines changed

src/unixfs.md

Lines changed: 71 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,27 @@ message PBNode {
135135
}
136136
```
137137

138+
:::note
139+
140+
The `PBNode` definition above lists `Links` (field 2) before `Data` (field 1).
141+
This field order is stricter than the intuitive protobuf convention of
142+
serializing fields by field number.
143+
144+
Decoders MUST accept both field orderings, as existing IPFS data contains
145+
blocks encoded in either order.
146+
147+
Encoders that want to be compliant with the `unixfs-v0-2015` and
148+
`unixfs-v1-2025` profiles from
149+
[IPIP-499](https://specs.ipfs.tech/ipips/ipip-0499/) SHOULD produce `Links`
150+
before `Data`, matching the [`dag-pb`][ipld-dag-pb] wire encoding order used
151+
by those profiles. A future IPIP introducing new profiles MAY adopt a
152+
different field order.
153+
154+
See the "Protobuf Strictness" section of the [`dag-pb` spec][ipld-dag-pb]
155+
for the full set of encoding constraints.
156+
157+
:::
158+
138159
After decoding the node, we obtain a `PBNode`. This `PBNode` contains a field
139160
`Data` that contains the bytes that require the second decoding. This will also be
140161
a protobuf message specified in the UnixFSV1 format:
@@ -180,6 +201,23 @@ it is implied that the `PBNode.Data` field is protobuf-encoded.
180201
A `dag-pb` UnixFS node supports different types, which are defined in
181202
`decode(PBNode.Data).Type`. Every type is handled differently.
182203

204+
:::warning
205+
206+
**Streaming parser consideration:** In the [`dag-pb`][ipld-dag-pb] encoding
207+
order required by [IPIP-499](https://specs.ipfs.tech/ipips/ipip-0499/)
208+
profiles, all `PBNode.Links` entries are serialized before `PBNode.Data`.
209+
Since `DataType` (which determines how to interpret the node and its links) is
210+
encoded inside `PBNode.Data`, a streaming or incremental protobuf parser cannot
211+
determine the node type until after all links have been read.
212+
213+
This affects implementations that attempt to interpret links during parsing:
214+
In particular, a streaming parser cannot determine whether link `Name` fields
215+
carry [HAMT hex-prefixed bucket indices](#hamt-structure-and-parameters) or
216+
plain [directory entry names](#dag-pb-directory) without first buffering all
217+
links.
218+
219+
:::
220+
183221
### `dag-pb` `File`
184222

185223
A :dfn[File] is a container over an arbitrary sized amount of bytes. Files are either
@@ -851,6 +889,7 @@ Test vectors for UnixFS directory structures, progressing from simple flat direc
851889
```
852890
- Purpose: Directory listing, link sorting, deduplication (ascii.txt and ascii-copy.txt share same CID)
853891
- Validation: Links sorted lexicographically by Name, each has valid Tsize
892+
- Wire order: `Links`(x4) then `Data` ([`dag-pb`][ipld-dag-pb] field order per [IPIP-499](https://specs.ipfs.tech/ipips/ipip-0499/) profiles)
854893
855894
### Nested Directories
856895
@@ -956,6 +995,7 @@ Test vectors for UnixFS directory structures, progressing from simple flat direc
956995
- Fanout field = 256
957996
- Link Names in HAMT have 2-character hex prefix (hash buckets)
958997
- Can retrieve any file by name through hash bucket calculation
998+
- Wire order: `Links`(x252) then `Data` ([`dag-pb`][ipld-dag-pb] field order per [IPIP-499](https://specs.ipfs.tech/ipips/ipip-0499/) profiles)
959999
9601000
## Special Cases and Advanced Features
9611001
@@ -1186,6 +1226,36 @@ Below section explains some of historical decisions. This is not part of specifi
11861226
and is provided here only for extra context.
11871227
:::
11881228

1229+
## `PBNode` Field Order: Legacy Constraint and Compatibility Guidance
1230+
1231+
The [`dag-pb`][ipld-dag-pb] encoding order required by
1232+
[IPIP-499](https://specs.ipfs.tech/ipips/ipip-0499/) profiles (`unixfs-v0-2015`
1233+
and `unixfs-v1-2025`) serializes `PBNode.Links` (field 2) before `PBNode.Data`
1234+
(field 1). This is stricter than the intuitive protobuf convention of encoding
1235+
fields by field number.
1236+
1237+
This ordering is a historical artifact: early protobuf serializers (notably
1238+
the original JavaScript implementation) wrote fields in source declaration
1239+
order rather than field number order. The original `.proto` definition listed
1240+
`Links` before `Data` (while assigning them field numbers 2 and 1
1241+
respectively). Once blocks with this byte ordering were written to the IPFS
1242+
network, the encoding became permanent: changing it would produce different
1243+
CIDs for the same logical content. The [`dag-pb` specification][ipld-dag-pb]
1244+
codified this field order for existing profiles.
1245+
1246+
Following the [Robustness Principle](https://specs.ipfs.tech/architecture/principles/#robustness),
1247+
implementations writing backward and forward compatible software should be
1248+
conservative in what they produce (use the field order expected by the target
1249+
profile) and liberal in what they accept (decode blocks regardless of field
1250+
order). A future IPIP introducing new profiles may adopt a different field
1251+
order convention.
1252+
1253+
A practical consequence of the current `Links`-before-`Data` order is that
1254+
streaming protobuf parsers encounter all link entries before `PBNode.Data`.
1255+
For UnixFS, this means the node type (`DataType`) and associated metadata
1256+
(e.g., HAMT `fanout` and `hashType`) are not available until after all links
1257+
have been parsed. See the [`dag-pb` Types](#dag-pb-types) section for details.
1258+
11891259
## Design Considerations: Extra Metadata
11901260

11911261
Metadata support in UnixFSv1.5 has been expanded to increase the number of possible
@@ -1305,4 +1375,4 @@ the fractional part is represented as a 4-byte `fixed32`,
13051375
[multicodec]: https://github.com/multiformats/multicodec
13061376
[multihash]: https://github.com/multiformats/multihash
13071377
[Bitswap]: https://specs.ipfs.tech/bitswap-protocol/
1308-
[ipld-dag-pb]: https://ipld.io/specs/codecs/dag-pb/spec/
1378+
[ipld-dag-pb]: https://web.archive.org/web/20260305020653/https://ipld.io/specs/codecs/dag-pb/spec/

0 commit comments

Comments
 (0)