-
Notifications
You must be signed in to change notification settings - Fork 204
add multicodec #1194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add multicodec #1194
Conversation
|
@gerceboss : Thank you Nilav. Appreciate the update and the improvements proposed here — this looks like a solid step forward 👍 Inviting @yashksaini-coder, @sumanjeet0012 , and @acul71 to review the changes, especially around the serialization updates and related tests. Appreciate the clarity on the context from #1172 and #1193. Looking forward to feedback and any suggestions before merge. |
9ab1c06 to
552d090
Compare
|
@gerceboss Is the PR ready for review or you are currently working on this ? |
|
Its ready for review @yashksaini-coder , I have left out the serialisation part as it was optional in the discussion so it can be added in another PR |
yashksaini-coder
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Required Fixes:-
- Breaking Changes: Binary format changes not clearly documented
- Fragile Parsing: Reverse indexing and hardcoded offsets
- Limited Hash Support: Assumes SHA2-256 throughout
- Test Coverage: Missing edge cases and backward compatibility tests
- Documentation: Needs updates reflecting new capabilities
- Error Handling: Some paths lack proper exception handling
Some Tests cases that needs to be done-
-
Backward Compatibility
def test_old_cid_format_decoding(): """Verify old CIDs still work after changes""" # Use pre-generated CIDs from old format old_cid = bytes([0x01, 0x55, 0x12, 0x20, ...]) # Should still verify correctly
-
Varint Codec Handling
def test_multibyte_varint_codec(): """Test codecs with multi-byte varint encoding""" # Test codecs > 127 that require multi-byte varint
-
String Codec Input
def test_string_codec(): cid1 = compute_cid_v1(data, "raw") cid2 = compute_cid_v1(data, CODEC_RAW) assert cid1 == cid2
-
Invalid Codec Handling
def test_invalid_codec(): with pytest.raises(ValueError): compute_cid_v1(data, "nonexistent-codec") with pytest.raises(ValueError): compute_cid_v1(data, 0xFFFFFFFF)
|
|
||
| return cid | ||
| # CIDv1 format: <version><codec-varint><multihash> | ||
| return bytes([CID_V1]) + codec_prefixed + multihash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using add_prefix() to get proper varint-encoded codec prefix is a great refactoring, but since codec was previously a single byte, now it's varint-encoded
-
This changes the binary format of CIDv1 which can lead to
⚠️ Breaking Change -
This can impact the Existing CIDs that may not be backward compatible
-
Can you add migration strategy and version testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compatibility:
- Codecs < 128 (e.g., raw=0x55, dag-pb=0x70): Single-byte variant encoding
means legacy and new formats are identical (backward compatible). - Codecs >= 128: Multi-byte variant encoding means formats are different (breaking).
Migration Strategy:
- Legacy CIDs with codec < 128 continue to work without migration.
- Legacy CIDs with codec >= 128 need recomputation from original data.
- Use some function like
detect_cid_version()to identify format. - Use some function
migrate_legacy_cid()to convert when possible.
If this sounds good to you, I'll go ahead and implement the necessary functions and tests @yashksaini-coder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation: Needs updates reflecting new capabilities
can you please guide on this ? where to update and some example ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gerceboss Great to identify this, follow the main discussion post regarding the Multicodec integration #1172
Also I've created a small guide implementation, review and look at it, and see if it can help implement and answer your queries, #1172 (comment)
Keeping other in loop is important as well.
CC: @acul71
| ): | ||
| self.public_key = public_key | ||
| self.payload_type = payload_type | ||
|
|
||
| # Normalise payload_type to a Code instance | ||
| if isinstance(payload_type, bytes): | ||
| codec_name = get_codec(payload_type) | ||
| self.payload_type_code = Code.from_string(codec_name) | ||
| elif isinstance(payload_type, str): | ||
| self.payload_type_code = Code.from_string(payload_type) | ||
| else: | ||
| self.payload_type_code = payload_type | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Location:
Envelope.__init__()codec normalization - Issue: No try-except for
get_codec()orCode.from_string()failures - Recommendation:
try: codec_name = get_codec(payload_type) self.payload_type_code = Code.from_string(codec_name) except Exception as e: raise ValueError(f"Invalid codec: {e}")
| return compute_cid_v1(data, codec) | ||
|
|
||
|
|
||
| def parse_cid_codec(cid: bytes) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Issue:
parse_cid_codec()returns "cidv0" for CIDv0 - Problem: Not a standard multicodec name
- Recommendation: Document or return
Noneor "dag-pb"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see , its better to use DAG_PB.name 👍
libp2p/bitswap/cid.py
Outdated
| if len(cid) >= 4: | ||
| # Return first 4 bytes (version + codec + hash type + hash length) | ||
| return cid[:4] | ||
| # For CIDv1 produced by this module, the structure is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚠️ Critical: This assumes SHA2-256 (32-byte digest)- Hardcoded offset
cid[-33]is fragile for other hash algorithms It may break if codec varint encoding is multi-byte. Add hash algorithm flexibility or explicit validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently only sha2-256 is supported in py-libp2p and I saw that multihash integration is out of scope for this one #1170
@yashksaini-coder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okh I have seen the referred discussion post, #1170 (comment)
yashksaini-coder
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gerceboss Kindly check the reviewed code areas, and also add missing test cases
|
@gerceboss Have you completed the changes to the PR yet ? |
b5223d6 to
cf0a4a8
Compare
|
Updated with the changes necessary , added the docs as well @yashksaini-coder . One test was failing locally so had to change # Run echo example as server via module so imports resolve correctly
cmd = [sys.executable, "-u", "-m", "examples.echo.echo", "-p", "0"]in |
cf0a4a8 to
e169751
Compare
What was wrong?
Improvement proposed in #1172
Issue #1193
To-Do
@acul71 @seetadev