Skip to content

Conversation

@aniruddha1295
Copy link

@aniruddha1295 aniruddha1295 commented Feb 3, 2026

Integrate py-multihash v3 API (Complete Implementation)

What was wrong?

Issue #1180

The codebase was using manual byte manipulation for multihash operations and exception-based validation instead of leveraging the py-multihash v3 API that's already available in our dependencies. This made the code harder to maintain and didn't take advantage of the library's built-in validation and error handling.

Based on discussion #1170, this PR addresses all 3 priorities from the issue.

How was it fixed?

Phase 1: Bitswap CID Module

Updated libp2p/bitswap/cid.py:

  1. Replaced manual hashlib.sha256() + byte construction with multihash.digest() and mh.encode()
  2. Refactored verify_cid() to use multihash.decode() and mh.verify() instead of manual byte slicing (reduced from 55 lines to 30 lines)
  3. Updated compute_cid_v0(), compute_cid_v1(), and reconstruct_cid_from_prefix_and_data() to use the multihash API

Added tests in tests/core/bitswap/test_cid.py:

  • 8 new compatibility tests for edge cases (malformed multihash, truncated CIDs, etc.)
  • 4 performance benchmarks to validate the changes

Phase 2: DAG Streaming

Updated libp2p/bitswap/cid.py and libp2p/bitswap/dag.py:

  1. Added compute_cid_v1_stream() using multihash.sum_stream() for memory-efficient hashing
  2. Applied streaming to single-block files to avoid loading large files into memory during hash computation
  3. Updated docstrings to document the streaming capability

Note: I applied streaming to single-block files where the benefit is clear. Multi-block files already use chunking (256KB chunks), so streaming each individual chunk would provide minimal benefit. Happy to extend this if you think it would be valuable.

Phase 3: Records Validation

Updated libp2p/records/pubkey.py:

  1. Replaced exception-based validation with multihash.is_valid() to avoid exception overhead
  2. Updated docstrings to reflect the change

Copilot AI review requested due to automatic review settings February 3, 2026 22:14
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request integrates the py-multihash v3 API into the Bitswap CID module to replace manual byte manipulation with proper library calls, addressing Priority 1 from issue #1180 and discussion #1170.

Changes:

  • Replaced manual multihash construction using hashlib.sha256() and byte concatenation with multihash.digest() and mh.encode() API calls
  • Refactored verify_cid() function from 55 lines to 30 lines using multihash.decode() and mh.verify() for cleaner, more maintainable code
  • Added 12 new tests covering edge cases (malformed multihash, truncated CIDs, empty CIDs) and 4 performance benchmark tests

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
libp2p/bitswap/cid.py Refactored CID computation and verification functions to use py-multihash v3 API, eliminating manual byte manipulation and improving error handling
tests/core/bitswap/test_cid.py Added comprehensive compatibility tests for edge cases and performance benchmarks to validate the py-multihash v3 integration
newsfragments/1180.feature.rst Added changelog entry documenting the integration of py-multihash v3 API with backward compatibility guarantee

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

match = mh.verify(data)
logger.debug(f" Verification: {'MATCH' if match else 'MISMATCH'}")
return match
except (ValueError, IndexError) as e:
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception handling here catches ValueError and IndexError, but based on how multihash.decode() is used elsewhere in the codebase (e.g., libp2p/peer/id.py:117 and libp2p/records/pubkey.py:40), it appears the library may raise other exception types as well. Those instances use a broad except Exception clause.

Consider catching a broader set of exceptions or just Exception to ensure all multihash decoding errors are handled gracefully, especially since this is error-handling code where returning False is the appropriate fallback behavior.

Suggested change
except (ValueError, IndexError) as e:
except Exception as e:

Copilot uses AI. Check for mistakes.
Comment on lines 242 to 318
assert elapsed < 0.5, (
f"CID computation too slow: {elapsed:.3f}s for {iterations} iterations"
)

# Log performance for reference
print(
f"\nCID computation: {avg_time * 1000:.2f}ms per 1MB "
f"(total: {elapsed:.3f}s for {iterations} iterations)"
)

def test_verification_performance(self):
"""Benchmark CID verification speed."""
import time

# Test with 1MB of data
data = b"x" * (1024 * 1024)
cid = compute_cid_v1(data)
iterations = 10

# Warm up
for _ in range(2):
verify_cid(cid, data)

# Benchmark
start = time.perf_counter()
for _ in range(iterations):
verify_cid(cid, data)
elapsed = time.perf_counter() - start

avg_time = elapsed / iterations

# Should complete 10 iterations of 1MB verification in reasonable time
# Expected: < 0.5 seconds total (< 50ms per iteration)
assert elapsed < 0.5, (
f"CID verification too slow: {elapsed:.3f}s for {iterations} iterations"
)

# Log performance for reference
print(
f"\nCID verification: {avg_time * 1000:.2f}ms per 1MB "
f"(total: {elapsed:.3f}s for {iterations} iterations)"
)

def test_small_data_performance(self):
"""Benchmark performance with small data (typical use case)."""
import time

# Test with small data (1KB)
data = b"x" * 1024
iterations = 1000

# Warm up
for _ in range(10):
cid = compute_cid_v1(data)
verify_cid(cid, data)

# Benchmark computation
start = time.perf_counter()
for _ in range(iterations):
compute_cid_v1(data)
comp_elapsed = time.perf_counter() - start

# Benchmark verification
cid = compute_cid_v1(data)
start = time.perf_counter()
for _ in range(iterations):
verify_cid(cid, data)
verify_elapsed = time.perf_counter() - start

# Should handle 1000 iterations of 1KB quickly
# Expected: < 0.2 seconds for computation, < 0.2 seconds for verification
assert comp_elapsed < 0.2, (
f"Small data computation too slow: {comp_elapsed:.3f}s"
)
assert verify_elapsed < 0.2, (
f"Small data verification too slow: {verify_elapsed:.3f}s"
)
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These performance assertions use fixed time thresholds (0.5 seconds, 0.2 seconds) that may be too strict for CI environments or slower machines. Performance tests with hard time limits can cause flaky test failures in continuous integration systems with variable load.

Consider either:

  1. Making these thresholds configurable via environment variables
  2. Significantly increasing the thresholds to be more forgiving (e.g., 2-5x current values)
  3. Converting these to benchmark tests that log performance without asserting on specific thresholds
  4. Using relative performance comparisons instead of absolute time limits

This is especially important for the 1MB tests which could be affected by I/O, GC, or system load.

Copilot uses AI. Check for mistakes.
Comment on lines 202 to 214
def test_multihash_api_integration(self):
"""Test that py-multihash v3 API is properly integrated."""
import multihash

# Test that we can use multihash directly
data = b"test data"
mh = multihash.digest(data, multihash.Func.sha2_256)

# Verify multihash properties
assert mh.code == 0x12 # SHA-256 code
assert len(mh.digest) == 32 # SHA-256 produces 32 bytes
assert mh.verify(data) is True
assert mh.verify(b"wrong data") is False
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test directly asserts on multihash API properties (like mh.code and mh.verify) which are part of the py-multihash library's interface, not the CID module being tested. While it's good to verify the integration works, this test would be better placed as part of the compatibility tests or removed entirely.

The actual integration testing is already covered by test_cidv0_format_compatibility and test_cidv1_format_compatibility which verify that CIDs computed using the multihash API work correctly with the verify_cid function. This test adds minimal value beyond verifying that py-multihash itself works, which should be the responsibility of that library's own tests.

Copilot uses AI. Check for mistakes.
@aniruddha1295 aniruddha1295 changed the title Addresses #1180 Integrate py-multihash v3 API in Bitswap CID module (… Addresses #1180 Integrate py-multihash v3 API in Bitswap CID module Feb 3, 2026
…actoring

- Priority 2: DAG streaming capability
- Priority 3: Records validation
@aniruddha1295
Copy link
Author

aniruddha1295 commented Feb 4, 2026

@sumanjeet0012 @seetadev

I have a question about Priority 2 (Streaming hash for large files) from the issue.

I implemented streaming for single-block files using multihash.sum_stream():

# libp2p/bitswap/dag.py:163
with open(file_path, "rb") as f:
    cid = compute_cid_v1_stream(f, codec=CODEC_RAW)
    
# libp2p/bitswap/dag.py:211
for chunk_data in enumerate(chunk_file(file_path, chunk_size)):
    chunk_cid = compute_cid_v1(chunk_data, codec=CODEC_RAW)  # 256KB in memory
    

@seetadev
Copy link
Contributor

seetadev commented Feb 7, 2026

@aniruddha1295 : This is an excellent and very thorough piece of work — thanks a lot for driving this 👏

Looping in @sumanjeet0012, @yashksaini-coder, and @acul71 for visibility and review as well.
(@sumanjeet0012 especially since you did the original Bitswap integration in py-libp2p.)

Overall, this PR does a great job of addressing #1180 in a clean, well-structured way, and it clearly reflects careful thought around maintainability, correctness, and long-term alignment with the libp2p stack.

What works really well

  • Replacing manual byte manipulation with the py-multihash v3 API is absolutely the right move. This significantly improves readability, correctness, and future maintainability, while also letting us rely on well-tested library behavior instead of custom logic.
  • The refactor of verify_cid() is especially nice — reducing complexity while improving semantic clarity is a big win.
  • The phased approach (CID module → DAG streaming → records validation) makes the PR easy to reason about despite its size, and the commit history is clean and review-friendly.
  • The streaming support using multihash.sum_stream() is a solid, pragmatic implementation. Applying it where it provides real benefit (single-block files) while avoiding unnecessary churn in the chunked path is a thoughtful trade-off.
  • Test coverage is strong. The added edge-case tests materially improve confidence around malformed inputs and compatibility, and the changelog entry is clear and complete.

On the streaming question

Your reasoning makes sense. Applying streaming to single-block files is where the memory benefit is most clear, and the current chunking approach for multi-block files already keeps memory bounded. I’m 👍 on keeping it as-is for now, with the option to extend later if we see real-world demand.

Despite the CI failures at the moment (likely tied to the perf assertions), the core design and implementation here look solid. Once those are addressed, this feels very close to being merge-ready.

Thanks again for the high-quality contribution — this meaningfully improves the Bitswap CID path and sets us up well for future work.

@acul71
Copy link
Contributor

acul71 commented Feb 7, 2026

Hello @aniruddha129, thanks for this PR and for integrating the new py-multihash v3 API!

The PR is well done but there are some issues that must be addressed before merging:

  • Lint/typecheck errors
    Run make pr and make docs to spot them and resolve. Did you push with git commit --no-verify?
    Also add a trailing newline to the newsfragment (make docs # or make linux-docs will catch it).

  • Double file read in dag.py single-block path — performance regression
    sum_stream has no beneficial use in the current MerkleDag architecture: single-block files (≤ chunk_size) must be fully loaded into memory for add_block() anyway, so streaming the hash just adds an extra file read with no memory savings. Multi-block files are already chunked into small pieces, so there's no large-file streaming scenario either. Please remove compute_cid_v1_stream from dag.py. (Sorry if the issue description was misleading on this — sum_stream is a valid utility but doesn't fit the MerkleDag code paths. It could be useful if something different than DAG is used in the future.)

  • Duplicate py-multihash tests
    Check if test_multihash_api_integration is already covered by py-multihash's own test suite — we don't need to duplicate library-level tests (testing mh.code, mh.digest, mh.verify directly) here.

  • reconstruct_cid_from_prefix_and_data() hardcodes SHA-256 regardless of prefix
    The prefix bytes contain the hash function code at prefix[2], but it's always using sha2_256. Could you improve this to read the hash algorithm from the prefix, or is that out of scope for bitswap?

Full review here (try to feed this message to COPILOT AI and see if it's able to cope with that. (ANYWAY Check always the code for allucinations ) (-:

AI PR Review: PR #1186 — Integrate py-multihash v3 API in Bitswap CID Module

PR: #1186
Author: @aniruddha1295
Branch: integrate-multihash-v3main
Issue: #1180
Discussion: #1170
Review Date: 2026-02-07
Reviewer: AI (claude-4.6-opus)


1. Summary of Changes

This PR replaces manual multihash byte manipulation with py-multihash v3 API calls across three modules, addressing all three priorities outlined in issue #1180 and discussion #1170.

Changes by phase:

  • Phase 1 — Bitswap CID Module (libp2p/bitswap/cid.py): Replaced hashlib.sha256() + manual byte construction with multihash.digest() / mh.encode(). Refactored verify_cid() to use multihash.decode() + mh.verify() instead of manual byte slicing.
  • Phase 2 — DAG Streaming (libp2p/bitswap/dag.py): Added compute_cid_v1_stream() using multihash.sum_stream() and applied it to single-block files in MerkleDag.add_file().
  • Phase 3 — Records Validation (libp2p/records/pubkey.py): Replaced exception-based multihash validation with multihash.is_valid().
  • Tests (tests/core/bitswap/test_cid.py): Added 8 compatibility edge-case tests and 4 performance benchmarks.
  • Newsfragment (newsfragments/1180.feature.rst): Added changelog entry for the feature.

Files affected: 5 files (3 source modules, 1 test file, 1 newsfragment)
Additions: 316 lines | Deletions: 63 lines

Breaking changes: None. The public API signatures and behavior are preserved.


2. Branch Sync Status and Merge Conflicts

Branch Sync Status

  • Status: ℹ️ Ahead of origin/main
  • Details: Branch is 0 commits behind and 5 commits ahead of origin/main.

Merge Conflict Analysis

No merge conflicts detected. The PR branch can be merged cleanly into origin/main.


3. Strengths

  1. Clear alignment with issue and discussion: The PR addresses all three priorities from issue Integrate py-multihash v3 features into py-libp2p #1180 and follows the code examples provided in discussion py-libp2p Multihash Integration Analysis #1170 closely.
  2. Good edge-case test coverage: The TestCompatibility class covers important edge cases (malformed multihash, truncated CIDs, empty CIDs, single-byte CIDs, wrong hash types) that weren't previously tested.
  3. Cleaner verification logic: The refactored verify_cid() is significantly simpler (30 lines vs 55 lines) by delegating decoding and verification to the library.
  4. Backward compatibility preserved: All existing tests pass, confirming no regressions.
  5. Proper use of multihash API: The multihash.digest(), mh.encode(), mh.verify(), and multihash.is_valid() calls are used correctly.

4. Issues Found

Critical

C0. All py-multihash v3 APIs fail in CI — namespace collision with pymultihash (BLOCKER)

  • Files: libp2p/bitswap/cid.py, libp2p/bitswap/dag.py, libp2p/records/pubkey.py, tests/core/bitswap/test_cid.py

  • Issue: All GitHub Actions CI checks fail (8 tox jobs + 3 Windows jobs = 11 failing jobs) with AttributeError on every Python version (3.10–3.13) and both Linux and Windows:

    1. module 'multihash' has no attribute 'sum_stream' — affects dag.py, causing 5 test failures
    2. module 'multihash' has no attribute 'is_valid' — affects pubkey.py, causing 4 test failures
    3. 'Multihash' object has no attribute 'code' — affects test code, causing 2 test failures

    Root cause: pymultihash package namespace collision

    The project's dev dependency p2pclient==0.2.0 (in pyproject.toml) depends on pymultihash==0.8.2. Both py-multihash and pymultihash install a multihash/ Python package into site-packages — same namespace, different code. When uv installs both simultaneously in CI, pymultihash's __init__.py overwrites py-multihash's __init__.py. The pymultihash version lacks the v3 APIs (sum_stream, is_valid, Multihash.code). Locally, py-multihash happened to be installed last, so its files took precedence.

    Dependency chain: pyproject.tomlp2pclient==0.2.0pymultihash==0.8.2 → overwrites multihash/ namespace

  • Fix (verified): p2pclient v0.2.1 (released 2026-01-28) already depends on py-multihash>=3.0.0 instead of pymultihash, eliminating the collision. Bumping p2pclient from ==0.2.0 to >=0.2.1 in pyproject.toml resolves all 11 CI failures. This has been verified locally — all tox core tests pass cleanly after the bump.

C1. Missing type annotation causes mypy failure (BLOCKER)

  • File: libp2p/bitswap/cid.py
  • Line(s): 67
  • Issue: compute_cid_v1_stream(file_obj, ...) is missing a type annotation for file_obj, causing mypy to fail with [no-untyped-def].
  • Suggestion: Add a proper type annotation:
    from typing import BinaryIO
    
    def compute_cid_v1_stream(file_obj: BinaryIO, codec: int = CODEC_RAW) -> bytes:

Major

M1. Newsfragment missing trailing newline (pre-commit failure)

  • File: newsfragments/1180.feature.rst
  • Issue: The file was committed without a trailing newline. The end-of-file-fixer pre-commit hook auto-fixes this, but the fix is not committed. Run pre-commit run --all-files and commit the fixed file.

M2. Double file read in dag.py single-block path — performance regression

  • File: libp2p/bitswap/dag.py

  • Line(s): 161–167

  • Issue: The streaming approach reads the file twice — once for hash computation via compute_cid_v1_stream(), and once to load data for add_block(). The old code read the file once and computed the CID from the in-memory data. For single-block files (≤ 256KB by default), the data must be loaded into memory regardless for add_block(), so streaming adds I/O overhead without saving memory.

    sum_stream has no beneficial application in the current MerkleDag architecture:

    1. Single-block files (≤ chunk_size): Must be fully loaded into memory for add_block(). Streaming gains nothing and costs an extra file read.
    2. Multi-block files (> chunk_size): Already chunked into ≤ 256KB pieces by chunk_file(). Each chunk is materialized in memory when compute_cid_v1() is called. Streaming a 256KB chunk provides no memory benefit.
    3. Root node CID: Computed over the DAG-PB serialized metadata blob (links to chunks), not over the original large file. There is never a "hash the entire large file" step.

    Cross-ecosystem note: go-libp2p's go-multihash provides an equivalent SumStream() function but does not use it in its own codebase — all CID/multihash operations use multihash.Sum() with byte slices.

    The py-multihash sum_stream implementation is sound — it just doesn't have a use case in the current dag.py code paths. It could be useful in future non-DAG contexts.

  • Suggestion: Revert the dag.py single-block path to the original approach (read file once, compute CID from in-memory bytes). The compute_cid_v1_stream() utility can remain in cid.py for potential future use, but remove its import and usage from dag.py.

M3. reconstruct_cid_from_prefix_and_data() hardcodes SHA-256 regardless of prefix

  • File: libp2p/bitswap/cid.py
  • Line(s): 139–143
  • Issue: The function hardcodes multihash.Func.sha2_256 without consulting the hash algorithm code in the prefix. If the prefix specifies a different hash algorithm, the reconstruction will produce an incorrect CID. This was a pre-existing limitation, but this PR was an opportunity to improve it. The prefix bytes are <version><codec><hash-type><hash-length>, so prefix[2] contains the hash function code.
  • Suggestion: Consider reading the hash function from the prefix:
    hash_code = prefix[2] if len(prefix) > 2 else multihash.Func.sha2_256
    mh = multihash.digest(data, hash_code)
    return prefix + mh.digest
    If intentionally left for a future PR, add a TODO comment.

M4. Exception handling in verify_cid() may be too narrow

  • File: libp2p/bitswap/cid.py
  • Line(s): 187
  • Issue: The except (ValueError, IndexError) clause may not catch all exceptions that multihash.decode() can raise. Other parts of the codebase (e.g., libp2p/peer/id.py:117) use a broader except Exception clause when calling multihash.decode().
  • Suggestion: Widen to except Exception as e: since returning False is the safe fallback for malformed input.

Minor

m1. Pyrefly false-positive type errors (informational)

  • File: libp2p/bitswap/cid.py (line 82), libp2p/records/pubkey.py (line 41)
  • Issue: Pyrefly reports No attribute 'sum_stream' in module 'multihash' and No attribute 'is_valid' in module 'multihash'. Both attributes exist at runtime — this is a pyrefly type-stub limitation with py-multihash v3, not a code issue.
  • Suggestion: No action required from the PR author. The project may want to add pyrefly ignore comments or update stubs separately.

m2. compute_cid_v0 imported locally in test instead of at top of file

  • File: tests/core/bitswap/test_cid.py
  • Line(s): 179
  • Issue: compute_cid_v0 is imported inside test_cidv0_format_compatibility() rather than at the top of the file with other imports.
  • Suggestion: Move the import to the top-level import block for consistency.

m3. Performance tests use hardcoded time thresholds

  • File: tests/core/bitswap/test_cid.py
  • Line(s): 242–347
  • Issue: The TestPerformance class uses fixed time thresholds (e.g., < 0.5s, < 0.2s) that may fail in CI environments with variable load.
  • Suggestion: Either increase thresholds significantly (5-10x), mark them with @pytest.mark.benchmark and skip in CI, or log performance without asserting on absolute thresholds.

m4. test_multihash_api_integration tests the library, not the module

  • File: tests/core/bitswap/test_cid.py
  • Line(s): 202–214
  • Issue: This test directly tests py-multihash properties (mh.code, mh.digest, mh.verify), which is the library's own responsibility. These should be covered by py-multihash's own test suite, not duplicated here.
  • Suggestion: Remove this test or convert it to a minimal smoke test.

m5. Redundant docstring phrasing

  • File: libp2p/bitswap/cid.py
  • Lines: 26–31, 46–50, 68–79, etc.
  • Issue: Several docstrings repeat "Uses py-multihash v3 API for robust multihash handling" in both the summary and a separate paragraph.
  • Suggestion: Keep one mention per docstring.

5. Security Review

  • Risk: None identified.
  • Impact: None.
  • Notes: The changes are internal refactoring that preserves the same cryptographic operations (SHA-256 hashing, CID verification). No new external input handling paths are introduced. The verify_cid() exception handling (issue M4) could theoretically allow an unhandled exception to propagate on crafted input, but this would cause a crash rather than a security bypass.

6. Documentation and Examples

  • Docstrings are updated for all modified functions and accurately describe the new behavior.
  • The module-level docstring in dag.py is updated to mention streaming hash computation.
  • No README or tutorial updates are needed since these are internal API changes.
  • Minor issue: Docstrings are somewhat verbose with repeated "Uses py-multihash v3 API" phrasing (see m5).

7. Newsfragment Requirement


8. Tests and Validation

Local Test Results

Metric Result
Total Tests 1931
Passed 1931 ✅
Failed 0 ✅
Skipped 16
Errors 0 ✅
Warnings 25 (pre-existing)
Duration 87.48s

All tests pass locally. No regressions detected.

GitHub Actions CI Results

Job Status Cause
tox (3.10–3.13, core) ❌ FAIL (x4) 11 test failures each — namespace collision (C0)
tox (3.10–3.13, lint) ❌ FAIL (x4) mypy (C1) + pyrefly (m1) errors
windows (3.11–3.13, core) ❌ FAIL (x3) 11 test failures each — namespace collision (C0)
docs, demos, interop, utils, wheel ✅ PASS

All CI failures trace back to the pymultihash namespace collision (C0) and the missing type annotation (C1). Both are resolved by bumping p2pclient>=0.2.1 and adding the type annotation.

New Test Coverage

Test Class Tests Purpose
TestCompatibility 8 Edge cases: malformed, truncated, empty, single-byte CIDs; wrong hash type; CIDv0/v1 compatibility; API integration
TestPerformance 4 Benchmarks: CID computation, verification, small data, codec comparison

Good edge-case coverage. Performance tests may cause flaky failures in CI (see m3).

Lint Results

Check Result
YAML/TOML ✅ Passed
End of files ❌ Failed (auto-fixed newsfragment — see M1)
Trailing whitespace ✅ Passed
pyupgrade / ruff / ruff format / mdformat ✅ Passed
mypy Failed — missing type annotation (see C1)
pyrefly Failed — false-positive missing-attribute errors (see m1)
RST check ✅ Passed

Documentation Build


9. Recommendations for Improvement

Must Fix (Blockers)

  1. Bump p2pclient to >=0.2.1 in pyproject.toml to resolve the pymultihash namespace collision causing all CI failures (C0).
  2. Add type annotation to compute_cid_v1_stream's file_obj parameter (use BinaryIO from typing) to fix the mypy failure (C1).
  3. Commit the newsfragment trailing newline fix — run pre-commit run --all-files and commit (M1).

Should Fix

  1. Remove compute_cid_v1_stream usage from dag.py — revert to reading the file once for single-block files. The streaming approach adds overhead without benefit in MerkleDag (M2).
  2. Widen exception handling in verify_cid() from (ValueError, IndexError) to Exception for robustness (M4).

Nice to Have

  1. Consider reading the hash algorithm from the prefix in reconstruct_cid_from_prefix_and_data() instead of hardcoding SHA-256 (M3).
  2. Move the compute_cid_v0 import in the test to the top-level import block (m2).
  3. Make performance test thresholds more lenient or mark them as benchmarks (m3).
  4. Remove test_multihash_api_integration — it tests the library, not the module (m4).
  5. Remove redundant "Uses py-multihash v3 API" lines in docstrings (m5).

10. Questions for the Author

  1. Double file read in dag.py: Was the double file read for single-block files intentional? The data must be fully loaded into memory for add_block() anyway, so streaming the hash doesn't save memory. Would you consider reverting to the previous single-read approach?

  2. Exception handling breadth: The existing codebase uses except Exception when calling multihash.decode() (e.g., libp2p/peer/id.py). Was except (ValueError, IndexError) in verify_cid() a deliberate choice?

  3. Performance claims: The newsfragment claims "5-50% faster." Was this measured? The streaming approach for single-block files actually adds overhead (double file I/O). Could you share benchmark results?

  4. Hash algorithm flexibility: The prefix contains the hash algorithm code, but reconstruct_cid_from_prefix_and_data() always uses SHA-256. Is there a plan to make this configurable in a follow-up PR?


11. Overall Assessment

Criterion Rating
Quality Rating Changes Requested
Security Impact None
Merge Readiness Blocked — CI failures + architectural issue in dag.py
Confidence High

Summary: The PR correctly replaces manual byte manipulation with py-multihash v3 API calls, and the core refactoring logic is sound. However, it is blocked by two issues: (1) a pymultihash namespace collision that breaks all CI tests — fixable by bumping p2pclient to >=0.2.1 — and (2) a performance regression in dag.py where sum_stream is applied to single-block files without benefit, causing a double file read. The sum_stream utility itself is valid but has no beneficial use case in the current MerkleDag architecture. Additional fixes needed: missing type annotation (mypy blocker), newsfragment trailing newline, and narrower exception handling in verify_cid(). Once these are addressed, the PR is in good shape for merge.

acul71 and others added 3 commits February 8, 2026 00:35
- C0: Already fixed (p2pclient>=0.2.1)
- C1: Add BinaryIO type annotation to compute_cid_v1_stream
- M1: Add trailing newline to newsfragment
- M2: Remove streaming from dag.py (performance regression)
- M3: Read hash algorithm from prefix instead of hardcoding SHA-256
- M4: Widen exception handling in verify_cid to Exception
- m2: Move pytest import to top of test file
- m3: Add @pytest.mark.benchmark decorator
- m4: Remove test_multihash_api_integration
- m5: Clean up redundant docstring phrasing
- Update newsfragment to reflect Phase 1 & 3 only

All 43 tests passing locally. make linux-docs passed.
Pre-commit pyrefly check fails on 35 errors in unrelated files
(examples, libp2p core, tests) not modified in this PR.
@aniruddha1295
Copy link
Author

aniruddha1295 commented Feb 9, 2026

@acul71 @seetadev @sumanjeet0012 ,

I've pushed all 11 fixes from your review (commit 3d4505f7).

Note on commit process: I used git commit --no-verify because the pre-commit pyrefly check was failing on 35 errors in files I didn't modify (examples, libp2p core, tests). My modified files all pass pyrefly individually.

I've opened a discussion to ask about the recommended approach for this situation: [(https://github.com//discussions/1200#discussion-9451060)]

All my changes are tested and ready for CI validation. Let me know if you need any clarifications!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants