-
Notifications
You must be signed in to change notification settings - Fork 204
Addresses #1180 Integrate py-multihash v3 API in Bitswap CID module #1186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…1180 (Priority 1 from discussion libp2p#1170)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request integrates the py-multihash v3 API into the Bitswap CID module to replace manual byte manipulation with proper library calls, addressing Priority 1 from issue #1180 and discussion #1170.
Changes:
- Replaced manual multihash construction using
hashlib.sha256()and byte concatenation withmultihash.digest()andmh.encode()API calls - Refactored
verify_cid()function from 55 lines to 30 lines usingmultihash.decode()andmh.verify()for cleaner, more maintainable code - Added 12 new tests covering edge cases (malformed multihash, truncated CIDs, empty CIDs) and 4 performance benchmark tests
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| libp2p/bitswap/cid.py | Refactored CID computation and verification functions to use py-multihash v3 API, eliminating manual byte manipulation and improving error handling |
| tests/core/bitswap/test_cid.py | Added comprehensive compatibility tests for edge cases and performance benchmarks to validate the py-multihash v3 integration |
| newsfragments/1180.feature.rst | Added changelog entry documenting the integration of py-multihash v3 API with backward compatibility guarantee |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
libp2p/bitswap/cid.py
Outdated
| match = mh.verify(data) | ||
| logger.debug(f" Verification: {'MATCH' if match else 'MISMATCH'}") | ||
| return match | ||
| except (ValueError, IndexError) as e: |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception handling here catches ValueError and IndexError, but based on how multihash.decode() is used elsewhere in the codebase (e.g., libp2p/peer/id.py:117 and libp2p/records/pubkey.py:40), it appears the library may raise other exception types as well. Those instances use a broad except Exception clause.
Consider catching a broader set of exceptions or just Exception to ensure all multihash decoding errors are handled gracefully, especially since this is error-handling code where returning False is the appropriate fallback behavior.
| except (ValueError, IndexError) as e: | |
| except Exception as e: |
tests/core/bitswap/test_cid.py
Outdated
| assert elapsed < 0.5, ( | ||
| f"CID computation too slow: {elapsed:.3f}s for {iterations} iterations" | ||
| ) | ||
|
|
||
| # Log performance for reference | ||
| print( | ||
| f"\nCID computation: {avg_time * 1000:.2f}ms per 1MB " | ||
| f"(total: {elapsed:.3f}s for {iterations} iterations)" | ||
| ) | ||
|
|
||
| def test_verification_performance(self): | ||
| """Benchmark CID verification speed.""" | ||
| import time | ||
|
|
||
| # Test with 1MB of data | ||
| data = b"x" * (1024 * 1024) | ||
| cid = compute_cid_v1(data) | ||
| iterations = 10 | ||
|
|
||
| # Warm up | ||
| for _ in range(2): | ||
| verify_cid(cid, data) | ||
|
|
||
| # Benchmark | ||
| start = time.perf_counter() | ||
| for _ in range(iterations): | ||
| verify_cid(cid, data) | ||
| elapsed = time.perf_counter() - start | ||
|
|
||
| avg_time = elapsed / iterations | ||
|
|
||
| # Should complete 10 iterations of 1MB verification in reasonable time | ||
| # Expected: < 0.5 seconds total (< 50ms per iteration) | ||
| assert elapsed < 0.5, ( | ||
| f"CID verification too slow: {elapsed:.3f}s for {iterations} iterations" | ||
| ) | ||
|
|
||
| # Log performance for reference | ||
| print( | ||
| f"\nCID verification: {avg_time * 1000:.2f}ms per 1MB " | ||
| f"(total: {elapsed:.3f}s for {iterations} iterations)" | ||
| ) | ||
|
|
||
| def test_small_data_performance(self): | ||
| """Benchmark performance with small data (typical use case).""" | ||
| import time | ||
|
|
||
| # Test with small data (1KB) | ||
| data = b"x" * 1024 | ||
| iterations = 1000 | ||
|
|
||
| # Warm up | ||
| for _ in range(10): | ||
| cid = compute_cid_v1(data) | ||
| verify_cid(cid, data) | ||
|
|
||
| # Benchmark computation | ||
| start = time.perf_counter() | ||
| for _ in range(iterations): | ||
| compute_cid_v1(data) | ||
| comp_elapsed = time.perf_counter() - start | ||
|
|
||
| # Benchmark verification | ||
| cid = compute_cid_v1(data) | ||
| start = time.perf_counter() | ||
| for _ in range(iterations): | ||
| verify_cid(cid, data) | ||
| verify_elapsed = time.perf_counter() - start | ||
|
|
||
| # Should handle 1000 iterations of 1KB quickly | ||
| # Expected: < 0.2 seconds for computation, < 0.2 seconds for verification | ||
| assert comp_elapsed < 0.2, ( | ||
| f"Small data computation too slow: {comp_elapsed:.3f}s" | ||
| ) | ||
| assert verify_elapsed < 0.2, ( | ||
| f"Small data verification too slow: {verify_elapsed:.3f}s" | ||
| ) |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These performance assertions use fixed time thresholds (0.5 seconds, 0.2 seconds) that may be too strict for CI environments or slower machines. Performance tests with hard time limits can cause flaky test failures in continuous integration systems with variable load.
Consider either:
- Making these thresholds configurable via environment variables
- Significantly increasing the thresholds to be more forgiving (e.g., 2-5x current values)
- Converting these to benchmark tests that log performance without asserting on specific thresholds
- Using relative performance comparisons instead of absolute time limits
This is especially important for the 1MB tests which could be affected by I/O, GC, or system load.
tests/core/bitswap/test_cid.py
Outdated
| def test_multihash_api_integration(self): | ||
| """Test that py-multihash v3 API is properly integrated.""" | ||
| import multihash | ||
|
|
||
| # Test that we can use multihash directly | ||
| data = b"test data" | ||
| mh = multihash.digest(data, multihash.Func.sha2_256) | ||
|
|
||
| # Verify multihash properties | ||
| assert mh.code == 0x12 # SHA-256 code | ||
| assert len(mh.digest) == 32 # SHA-256 produces 32 bytes | ||
| assert mh.verify(data) is True | ||
| assert mh.verify(b"wrong data") is False |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test directly asserts on multihash API properties (like mh.code and mh.verify) which are part of the py-multihash library's interface, not the CID module being tested. While it's good to verify the integration works, this test would be better placed as part of the compatibility tests or removed entirely.
The actual integration testing is already covered by test_cidv0_format_compatibility and test_cidv1_format_compatibility which verify that CIDs computed using the multihash API work correctly with the verify_cid function. This test adds minimal value beyond verifying that py-multihash itself works, which should be the responsibility of that library's own tests.
Addresses libp2p#1180 (Priority 3 from discussion libp2p#1170)
…ses libp2p#1180 (Priority 2 from discussion libp2p#1170)
…actoring - Priority 2: DAG streaming capability - Priority 3: Records validation
|
I have a question about Priority 2 (Streaming hash for large files) from the issue. I implemented streaming for single-block files using # libp2p/bitswap/dag.py:163
with open(file_path, "rb") as f:
cid = compute_cid_v1_stream(f, codec=CODEC_RAW)
# libp2p/bitswap/dag.py:211
for chunk_data in enumerate(chunk_file(file_path, chunk_size)):
chunk_cid = compute_cid_v1(chunk_data, codec=CODEC_RAW) # 256KB in memory
|
|
@aniruddha1295 : This is an excellent and very thorough piece of work — thanks a lot for driving this 👏 Looping in @sumanjeet0012, @yashksaini-coder, and @acul71 for visibility and review as well. Overall, this PR does a great job of addressing #1180 in a clean, well-structured way, and it clearly reflects careful thought around maintainability, correctness, and long-term alignment with the libp2p stack. What works really well
On the streaming questionYour reasoning makes sense. Applying streaming to single-block files is where the memory benefit is most clear, and the current chunking approach for multi-block files already keeps memory bounded. I’m 👍 on keeping it as-is for now, with the option to extend later if we see real-world demand. Despite the CI failures at the moment (likely tied to the perf assertions), the core design and implementation here look solid. Once those are addressed, this feels very close to being merge-ready. Thanks again for the high-quality contribution — this meaningfully improves the Bitswap CID path and sets us up well for future work. |
|
Hello @aniruddha129, thanks for this PR and for integrating the new py-multihash v3 API! The PR is well done but there are some issues that must be addressed before merging:
Full review here (try to feed this message to COPILOT AI and see if it's able to cope with that. (ANYWAY Check always the code for allucinations ) (-: AI PR Review: PR #1186 — Integrate py-multihash v3 API in Bitswap CID ModulePR: #1186 1. Summary of ChangesThis PR replaces manual multihash byte manipulation with py-multihash v3 API calls across three modules, addressing all three priorities outlined in issue #1180 and discussion #1170. Changes by phase:
Files affected: 5 files (3 source modules, 1 test file, 1 newsfragment) Breaking changes: None. The public API signatures and behavior are preserved. 2. Branch Sync Status and Merge ConflictsBranch Sync Status
Merge Conflict Analysis✅ No merge conflicts detected. The PR branch can be merged cleanly into origin/main. 3. Strengths
4. Issues FoundCriticalC0. All py-multihash v3 APIs fail in CI — namespace collision with
C1. Missing type annotation causes mypy failure (BLOCKER)
MajorM1. Newsfragment missing trailing newline (pre-commit failure)
M2. Double file read in
M3.
M4. Exception handling in
Minorm1. Pyrefly false-positive type errors (informational)
m2.
m3. Performance tests use hardcoded time thresholds
m4.
m5. Redundant docstring phrasing
5. Security Review
6. Documentation and Examples
7. Newsfragment Requirement
8. Tests and ValidationLocal Test Results
All tests pass locally. No regressions detected. GitHub Actions CI Results
All CI failures trace back to the New Test Coverage
Good edge-case coverage. Performance tests may cause flaky failures in CI (see m3). Lint Results
Documentation Build
9. Recommendations for ImprovementMust Fix (Blockers)
Should Fix
Nice to Have
10. Questions for the Author
11. Overall Assessment
Summary: The PR correctly replaces manual byte manipulation with py-multihash v3 API calls, and the core refactoring logic is sound. However, it is blocked by two issues: |
- C0: Already fixed (p2pclient>=0.2.1) - C1: Add BinaryIO type annotation to compute_cid_v1_stream - M1: Add trailing newline to newsfragment - M2: Remove streaming from dag.py (performance regression) - M3: Read hash algorithm from prefix instead of hardcoding SHA-256 - M4: Widen exception handling in verify_cid to Exception - m2: Move pytest import to top of test file - m3: Add @pytest.mark.benchmark decorator - m4: Remove test_multihash_api_integration - m5: Clean up redundant docstring phrasing - Update newsfragment to reflect Phase 1 & 3 only All 43 tests passing locally. make linux-docs passed. Pre-commit pyrefly check fails on 35 errors in unrelated files (examples, libp2p core, tests) not modified in this PR.
|
@acul71 @seetadev @sumanjeet0012 , I've pushed all 11 fixes from your review (commit Note on commit process: I used I've opened a discussion to ask about the recommended approach for this situation: [(https://github.com//discussions/1200#discussion-9451060)] All my changes are tested and ready for CI validation. Let me know if you need any clarifications! |
Integrate py-multihash v3 API (Complete Implementation)
What was wrong?
Issue #1180
The codebase was using manual byte manipulation for multihash operations and exception-based validation instead of leveraging the py-multihash v3 API that's already available in our dependencies. This made the code harder to maintain and didn't take advantage of the library's built-in validation and error handling.
Based on discussion #1170, this PR addresses all 3 priorities from the issue.
How was it fixed?
Phase 1: Bitswap CID Module
Updated libp2p/bitswap/cid.py:
hashlib.sha256()+ byte construction withmultihash.digest()andmh.encode()multihash.decode()andmh.verify()instead of manual byte slicing (reduced from 55 lines to 30 lines)Added tests in
tests/core/bitswap/test_cid.py:Phase 2: DAG Streaming
Updated libp2p/bitswap/cid.py and libp2p/bitswap/dag.py:
multihash.sum_stream()for memory-efficient hashingNote: I applied streaming to single-block files where the benefit is clear. Multi-block files already use chunking (256KB chunks), so streaming each individual chunk would provide minimal benefit. Happy to extend this if you think it would be valuable.
Phase 3: Records Validation
Updated libp2p/records/pubkey.py:
multihash.is_valid()to avoid exception overhead