Reviving `py-ipld-dag`: Direction, Scope & Alignment with Modern IPLD #1211

seetadev · 2026-02-13T14:55:18Z

seetadev
Feb 13, 2026
Maintainer

Hi all,

First, thank you to Rod for unarchiving this repository and for the thoughtful guidance in the thread.

✅ Current Status

The repository has now been unarchived.
The broader Python dependency chain is stabilizing well:
- py-libp2p
- py-multiaddr
- py-cid
All relevant packages have proper releases and are up to date on PyPI.
The release flow across the stack is now significantly cleaner and more coordinated.

However, py-ipld-dag remains the last structural piece in the chain that needs active attention and modernization.

⚠️ Reality Check: The Current State of `py-ipld-dag`

As Rod rightly pointed out:

The codebase is ancient and incomplete
It is out of sync with modern IPLD concepts
It does not reflect the current ecosystem’s architectural direction

Rather than attempting incremental patchwork, we believe this calls for a fresh, scoped, and deliberate redesign.

📚 Ecosystem References We Will Study

Based on Rod’s suggestions, we will deeply study the following before proposing a concrete redesign:

1️⃣ JS Reference Implementation (Primary Ecosystem Anchor)

multiformats/js-multiformats – focal point for modern IPLD pieces
ipld/js-dag-cbor
ipld/js-dag-json
ipld/js-dag-pb

The JS ecosystem appears to be the most cohesive representation of current IPLD design philosophy. Aligning here first ensures conceptual correctness.

2️⃣ Go Reference Model

go-ipld-prime

Rod noted that this model is:

Architecturally different
More complex
Possibly too heavy for our needs

Still, it represents a mature interpretation of IPLD abstractions and will help us understand trade-offs between minimalism and full feature modeling.

🎯 Proposed Direction for Python

Rather than recreating a full IPLD framework immediately, we propose:

Phase 1 – Libp2p-Focused Minimal Core

Scope the implementation narrowly to what py-libp2p actually needs:

CID resolution compatibility
DAG-CBOR / DAG-PB basic handling
Clean integration with py-cid
Clear codec boundary
No speculative abstractions

Keep it:

Lean
Dependency-light
Conceptually aligned with JS multiformats

Phase 2 – Conceptual Alignment

Using:

JS multiformats model
js-dag-* codecs
go-ipld-prime abstractions

We will draft:

A short IPLD Concepts Synthesis Document
A Python-specific design spec
A clear statement of non-goals

Rod also suggested an interesting exercise:

Point AI at these ecosystems and synthesize IPLD concepts + suggest a Python path forward.

We plan to do exactly that — use ecosystem study + synthesis to produce a coherent Python-native interpretation rather than copying one model blindly.

🛠 Governance & Maintenance

Rod mentioned possibly removing direct admin entries and routing management fully through github-mgmt repos. That makes sense for long-term hygiene, and we’re happy to align with whatever governance structure IPLD prefers.

For our part:

We are ready to take active, ongoing maintenance responsibility
We intend to work on this full-time (in focused cycles)
We will open incremental PRs with clear design notes
No large, opaque rewrites

📌 Immediate Next Steps

Study:
- js-multiformats
- js-dag-{cbor,json,pb}
- go-ipld-prime
Draft:
- Python IPLD minimal design spec
- Scope boundaries
Open:
- Initial architecture proposal PR
- Small experimental slice for libp2p needs

🙏 Thank You

Rod — thank you again for:

Unarchiving the repository
Calling out the architectural drift
Pointing us toward the right ecosystem anchors

We’ll return shortly with:

A synthesized design document
A concrete implementation roadmap
The first minimal PR

Looking forward to feedback from the broader IPLD community.

Wish to CC @pacrob, @acul71, @lla-dane, @yashksaini-coder and @itsmoh.

@yashksaini-coder and @acul71 have started working on it earlier and would like them to follow the key pointers shared by Rod.

yashksaini-coder · 2026-02-13T19:48:54Z

yashksaini-coder
Feb 13, 2026

Here's a small update on how outdated it is compared to go IPDL

1. Overview

Aspect	Finding
Age	~9+ years (docs from 2013, copyright 2017; design from early IPLD era)
Scope	Small: core in `dag/dag.py` (~145 lines), `dag/utils.py` (~10 lines), minimal tests
Status	Pre-alpha (0.1.0), incomplete and not aligned with current IPLD

The project implements a custom Merkle DAG with Node and Link, custom JSON-style serialization, and multihash-based identity. It does not implement the IPLD Data Model, standard codecs (DAG-JSON, DAG-CBOR, DAG-PB), or CIDs.

2. How Outdated vs. Current IPLD

2.1 Data model and links

This codebase	Current IPLD (ipld.io)
Links = multihash only (Base58). `Link(name, size, multihash)`	Links = CIDs (Content Identifiers): version + multicodec + multihash; optional multibase in string form. CID spec
Custom node shape: `{"data": …, "links": [Link, …]}`	IPLD Data Model: maps, lists, strings, ints, bytes, Link (single kind). No prescribed “data + links” shape; links are first-class Link kind. Data Model
Hash over ad-hoc JSON serialization	Canonical codecs: DAG-JSON (key sorting, `/` and `bytes` encoding), DAG-CBOR, etc. Codecs

So: links and data model are conceptually outdated — multihash-only links and custom structure, not CID-based IPLD.

2.2 Serialization and codecs

This repo: json.dumps({"data": data, "links": links}) with default serializer.
- Not DAG-JSON (no "/" for links, no "bytes" for bytes, no key ordering).
- links are Link instances → default json.dumps would fail (Link not JSON-serializable); either broken or used with a custom serializer that’s not in the repo.
IPLD: Standard codecs (e.g. DAG-JSON) define canonical encoding (key sort, reserved / and bytes encoding) so the same logical node has a single canonical form and consistent hash.

Verdict: Serialization is non-standard and likely broken as-is; no IPLD codec is implemented.

2.3 Dependencies and ecosystem

Dependency	Version in repo	Current / note
pymultihash	0.8.2	Superseded; py-multihash now ~3.x with different API (`encode`/`decode`, `Multihash`, no `digest(..., "base58")`). Multiformats also maintain a broader multiformats Python lib.
base58	0.2.5	Very old; current base58 is 2.x.
morphys	1.0	Legacy Python 2-era helper for unicode/bytes; with Python 3.10+ and strict `requires-python = ">=3.10"` it’s redundant.
pyrsistent	0.13.0	Old; current pyrsistent is 0.20+.

So: dependency set is old and some APIs have changed; upgrading will require code changes (especially around multihash and encoding).

2.4 Python and tooling

Classifiers: Python 3.4, 3.5, 3.6 (outdated).
requires-python: >=3.10 (conflicts with classifiers).
Docs: Sphinx from 2013; copyright 2017; Read the Docs/Travis badges in README point to old workflows.
Ruff/mypy: Present and reasonable; bumpversion and towncrier suggest a later pass of tooling was added.

3. Bugs and design issues in the current code

3.1 `Link` has no public properties

Node.__init__ uses link.size (line 30).
Link only has _name, _size, _multihash and no @property for size, name, or multihash.
Effect: sum((link.size for link in self._links), ...) will raise AttributeError unless something else (e.g. a subclass) adds .size.

3.2 `Node.remove_link` is wrong and would fail

Line 79: node.name and node.multihash are used in the filter; Node has no .name (only .multihash). The intent is clearly to filter by link identity, so it should use link.name and link.multihash (and Link needs public properties).
Line 78: node.links = [...] — links is a read-only property (returns self._links), so assignment is invalid and would fail at runtime.

3.3 `Node.create` and serialization

serializer({"data": data, "links": links}) with default json.dumps: Link instances are not JSON-serializable, so this would raise unless a custom serializer is passed (and none is provided in the codebase).
Link.serialize() is defined but empty (pass), and is never used in the main path.

3.4 Node constructor and multihash type

Docstring says multihash should be “either a str or bytes” but the code only branches on isinstance(multihash, bytes) and otherwise raises; so only bytes are accepted, and then base58.b58decode(multihash) is used — b58decode typically expects a string. So the type handling is inconsistent and likely buggy.

3.5 Mutability and copying

add_link does node.links.append(link) on a shallow-deepcopy’d node; in Python, node.links is the same list reference as node._links, so this mutates the original node’s list if the caller still holds it. The pattern is fragile.

4. Tests

test_dag.py: Single placeholder test; no real Node/Link or Merkle behavior.
test_utils.py: Exercises node_to_link and asserts on Link’s private attributes (_name, _size, _multihash), which is brittle and underscores the missing public API on Link.

So: test coverage does not validate the core DAG or serialization, and some code paths (e.g. Node.create with default serializer, remove_link, add_link with list mutation) are not exercised.

5. Summary: how outdated and what to do

Dimension	Severity	Summary
IPLD alignment	High	Not IPLD: multihash-only links, custom node shape, no CIDs, no standard codecs (DAG-JSON, DAG-CBOR, DAG-PB).
Dependencies	High	Old, and multihash/base58 APIs have changed; morphys unnecessary on 3.10+.
Correctness	High	Broken or incomplete: Link missing properties, `remove_link` wrong and invalid assignment, default serialization fails, Node multihash type handling inconsistent.
Python / tooling	Medium	Classifiers vs requires-python mismatch; docs/CI badges outdated.
Tests	Medium	Minimal; don’t cover main DAG logic or failure paths.

Recommendation: Treat this as a legacy proof-of-concept, not a current IPLD implementation. To bring it toward ipld.io:

Adopt IPLD Data Model and CIDs: Links as CIDs; data as standard IPLD kinds (maps, lists, bytes, links) instead of a fixed {"data", "links"} shape.
Use standard codecs: Implement or use existing DAG-JSON / DAG-CBOR (and optionally DAG-PB) for canonical serialization and hashing.
Refresh dependencies: Move to current multihash/CID libs (e.g. multiformats or py-multihash 3.x), drop morphys, update base58.
Fix and clarify API: Add proper Link (and Node) properties, fix remove_link/mutability, and make serialization explicit and codec-based.
Align metadata and CI: Fix Python version and classifiers, update README/docs and CI to match current practices.

If you want, next steps can be: (a) a short “migration checklist” (ordered list of code and config changes), or (b) a minimal patch set that only fixes the obvious bugs and dependency versions so the existing (non-IPLD) design at least runs.

0 replies

endomorphosis · 2026-02-13T19:59:27Z

endomorphosis
Feb 13, 2026

I know when i have been implementing stuff, i occasionally have issues with the multhash/multiformats/cid libraries created at different times, and also making CID's programatically that are the same CID as Kubo. I have also noticed that I have to maintain several different versions of protobuf because of packages (i forget which of the packages is using the old version of protobuf).

0 replies

endomorphosis · 2026-02-13T20:04:02Z

endomorphosis
Feb 13, 2026

oh, and a nice to have feature would be that we are able to have a converter from ipld <-> json-LD as a part of the package so that knowledge graphs e.g. neo4j can automatically be ingested into ipld and back again. This enablement will make it much easier for people to use GraphRAG architectures using content addressed data.

1 reply

yashksaini-coder Feb 13, 2026

Now that's something interesting and really good

gerceboss · 2026-02-19T08:14:03Z

gerceboss
Feb 19, 2026

@yashksaini-coder are you already working on it ? Would like to collaborate

2 replies

yashksaini-coder Feb 19, 2026

Yes I have a PR up for migrating to modern tools for development
ipld/py-ipld-dag#14

seetadev Feb 20, 2026
Maintainer Author

Yes I have a PR up for migrating to modern tools for development ipld/py-ipld-dag#14

@gerceboss , @yashksaini-coder : Wish to share that please review vmx's feedback shared in the call yesterday.

Whether simple migration is needed, or a complete shift.

Please review the points shared by Rod on equivalent js and go implementations.

Ccing @acul71 , who will help you reorganize based on the feedback shared by both Rod and vmx.

Also, ccing @pacrob.

acul71 · 2026-02-21T18:58:03Z

acul71
Feb 21, 2026
Maintainer

Is not decided yet, but we are considering to open a new repo (for LICENSE problems), and keeping alive the py-ipld-dag in the middle while.

Also I can see in https://ipld.io/docs/
That there are many codecs:
https://ipld.io/specs/codecs/
DAG-CBOR
DAG-COSMOS - Specification for Tendermint and Cosmos as an IPLD Data Structure and the suite of codecs used to convert Tendermint and Cosmos types to and from the IPLD Data Model
DAG-ETH - Specification for Ethereum as an IPLD Data Structure and the suite of codecs used to convert Ethereum types to and from the IPLD Data Model
DAG-JOSE
DAG-JSON
DAG-PB

Which one(s) we should implement and as seprate repos (packages), or all in one?
Right now maybe the DAG-PB and DAG-JSON are the more requested ?

1 reply

endomorphosis Feb 22, 2026

I use ipld-dag-pb ipld-car for ipfs compatibility and https://pypi.org/project/libipld/ for speed

IronJam11 · 2026-02-22T12:07:22Z

IronJam11
Feb 22, 2026

Implementation Analysis: Python DAG-CBOR Challenges and Approach

Initial Findings After Ecosystem Study

I spent time going through the resources Rod pointed to — js-multiformats, js-dag-cbor (and its underlying cborg library), js-dag-json, js-dag-pb, the IPLD Data Model spec, the DAG-CBOR spec, and a surface-level read of go-ipld-prime. Here is where I landed and what I think the real difficulties are going to be for a Python implementation.

The JS Model is the Right Anchor, But the Python Tooling Gap is Non-Trivial

The JS ecosystem has a tight coupling between cborg (the CBOR library) and js-dag-cbor. cborg exposes fine-grained options that map almost 1:1 to DAG-CBOR constraints:

allowIndefinite: false
allowNaN: false
allowInfinity: false
rejectDuplicateMapKeys: true
strict: true
float64: true

The codec layer on top is thin — roughly 160 lines — because cborg does the heavy lifting.

The Python CBOR Challenge

Python's best CBOR library is cbor2. It is solid for general-purpose CBOR, but it does not expose equivalent options. There is no way to:

Reject indefinite-length items
Reject duplicate map keys
Detect trailing bytes
Force float64-only encoding

Worse, cbor2 has a canonical=True mode that looks like it would help (it sorts keys correctly), but it also minimizes floats — encoding 1.0 as float16 (3 bytes, prefix 0xf9) instead of float64 (9 bytes, prefix 0xfb). The DAG-CBOR spec requires float64 always. So canonical=True is actively harmful here.

What This Means for Implementation

A Python DAG-CBOR codec cannot be a thin wrapper around cbor2 the way js-dag-cbor is a thin wrapper around cborg. We would need to:

Pre-sort map keys ourselves (length-first, then lexicographic — RFC 7049 section 3.9 ordering) and encode with canonical=False to preserve float64
Build a strict decoder, likely by subclassing cbor2's pure-Python decoder (cbor2._decoder.CBORDecoder) and overriding decode_map, decode_array, decode_string, and decode_bytestring to reject indefinite-length encoding and duplicate keys
Do post-decode validation for NaN, Infinity, and bignum integers (because cbor2 handles bignum tags 2/3 internally before any hook can intercept them — they silently become large Python ints)
Wrap input in BytesIO and check stream position after decoding to catch trailing bytes

This is doable but it is real work and it needs careful testing, especially around:

Float encoding verification
Decoder subclassing (cbor2 uses a module-level dispatch table internally that bypasses normal method resolution, so a naive subclass override does nothing)

CID Handling Maps Cleanly

Tag 42 (CID links) is the one area that maps well between JS and Python. cbor2 has:

A default hook for encoding unknown types
A tag_hook for decoding unknown tags

CID encode: Prepend 0x00 (identity multibase prefix), wrap in CBORTag(42, ...)

CID decode: Strip the 0x00, call CID.decode()

The multiformats library by hashberg-io handles CID creation, parsing, and multihash — it is the Python equivalent of js-multiformats for the foundational layer, and storacha/py-ipld-dag-pb already depends on it, which is a good signal.

On the Question of Which Codecs and How Many Repos

Responding to acul71's question — I think the priority order is clear from what py-libp2p actually needs:

DAG-CBOR (highest priority, general purpose, used everywhere)
DAG-JSON (important for debugging and human-readable interchange)
DAG-PB (already exists via storacha/py-ipld-dag-pb, so integrate rather than rewrite)

Repository Structure

Whether these live in one repo or separate ones: the JS ecosystem uses separate repos (js-dag-cbor, js-dag-json, js-dag-pb), each exporting the same BlockCodec interface (name, code, encode, decode).

For Python, I think starting in one repo with a shared codec protocol and separate codec modules (ipld/codecs/dag_cbor.py, ipld/codecs/dag_json.py) makes sense for the initial phase. We can split later if needed.

The codec protocol itself (the equivalent of JS BlockCodec) is just a Python Protocol with:

name: str
code: int
encode()
decode()

Concrete Blockers and Risks

1. Float Behavior is Subtle and Easy to Get Wrong

The cbor2 float behavior is subtle and easy to get wrong silently. Tests need byte-level verification (check that encoded floats always start with 0xfb, not 0xf9 or 0xfa).

2. Internal Tag Handling

cbor2 handles certain CBOR tags internally (bignums 2/3, datetime 0/1, UUID 37) before any user hook is called. A DAG-CBOR decoder must catch these after the fact by type-checking the decoded output — if you see a datetime or an integer outside uint64 range, it came from a forbidden tag.

3. Dependency Choice

The dependency choice between hashberg-io/multiformats and py-cid needs a decision:

multiformats is more comprehensive (CID + multibase + multicodec + multihash in one package) and is already used by storacha/py-ipld-dag-pb
py-cid is maintained under the IPLD org

I lean toward multiformats for practical reasons but this should be a team decision.

4. Clean Break from Old Code

The old dag/dag.py code (Node, Link, custom JSON serialization) shares zero concepts with modern IPLD. As yashksaini-coder's audit showed, it has at least 5 critical bugs and implements none of the current IPLD spec. This is a clean-break situation, not a migration.

Proposed Initial Scope

Aligned with what seetadev outlined as Phase 1 — a minimal core scoped to what py-libp2p needs:

Codec protocol (the interface contract)
DAG-CBOR encode/decode with full spec compliance
Clean CID integration via multiformats
A Block abstraction (value + bytes + cid, mirroring js-multiformats)
Comprehensive tests, including byte-level verification and cross-implementation compatibility checks

Next Steps

Happy to start putting together an initial architecture proposal PR once there is alignment on direction.

References Studied:

IPLD Data Model: https://ipld.io/docs/data-model/
DAG-CBOR Spec: https://ipld.io/specs/codecs/dag-cbor/spec/
js-multiformats: https://github.com/multiformats/js-multiformats
js-dag-cbor: https://github.com/ipld/js-dag-cbor
cborg library: https://github.com/rvagg/cborg
go-ipld-prime: https://github.com/ipld/go-ipld-prime
hashberg-io/multiformats: https://github.com/hashberg-io/multiformats
storacha/py-ipld-dag-pb: https://github.com/storacha/py-ipld-dag-pb

CCing @seetadev @acul71 @yashksaini-coder

0 replies

Reviving py-ipld-dag: Direction, Scope & Alignment with Modern IPLD #1211

Uh oh!

seetadev Feb 13, 2026 Maintainer

✅ Current Status

⚠️ Reality Check: The Current State of py-ipld-dag

📚 Ecosystem References We Will Study

1️⃣ JS Reference Implementation (Primary Ecosystem Anchor)

2️⃣ Go Reference Model

🎯 Proposed Direction for Python

Phase 1 – Libp2p-Focused Minimal Core

Phase 2 – Conceptual Alignment

🛠 Governance & Maintenance

📌 Immediate Next Steps

🙏 Thank You

Replies: 6 comments · 4 replies

Uh oh!

Here's a small update on how outdated it is compared to go IPDL

1. Overview

2. How Outdated vs. Current IPLD

2.1 Data model and links

2.2 Serialization and codecs

2.3 Dependencies and ecosystem

2.4 Python and tooling

3. Bugs and design issues in the current code

3.1 Link has no public properties

3.2 Node.remove_link is wrong and would fail

3.3 Node.create and serialization

3.4 Node constructor and multihash type

3.5 Mutability and copying

4. Tests

5. Summary: how outdated and what to do

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seetadev Feb 20, 2026 Maintainer Author

Uh oh!

Uh oh!

acul71 Feb 21, 2026 Maintainer

Uh oh!

Uh oh!

Implementation Analysis: Python DAG-CBOR Challenges and Approach

Initial Findings After Ecosystem Study

The JS Model is the Right Anchor, But the Python Tooling Gap is Non-Trivial

The Python CBOR Challenge

What This Means for Implementation

CID Handling Maps Cleanly

On the Question of Which Codecs and How Many Repos

Repository Structure

Concrete Blockers and Risks

1. Float Behavior is Subtle and Easy to Get Wrong

2. Internal Tag Handling

3. Dependency Choice

4. Clean Break from Old Code

Proposed Initial Scope

Next Steps

Reviving `py-ipld-dag`: Direction, Scope & Alignment with Modern IPLD #1211

seetadev
Feb 13, 2026
Maintainer

⚠️ Reality Check: The Current State of `py-ipld-dag`

Replies: 6 comments 4 replies

3.1 `Link` has no public properties

3.2 `Node.remove_link` is wrong and would fail

3.3 `Node.create` and serialization

seetadev Feb 20, 2026
Maintainer Author

acul71
Feb 21, 2026
Maintainer