Skip to content

Conversation

SgtPooki
Copy link
Collaborator

@SgtPooki SgtPooki commented Sep 26, 2025

🎯 Add Piece Details with Leaf Count Information

Overview

This PR adds the ability to retrieve detailed piece information including leaf counts and calculated raw sizes for pieces stored in data sets. This provides users with more granular insights into their stored data, particularly useful for understanding storage costs and data organization.

New Features

1. Enhanced Piece Data Interface

  • Added DataSetPieceDataWithLeafCount interface that extends existing DataSetPieceData
  • Includes leafCount (number of leaves in piece's merkle tree) and rawSize (calculated as leafCount * 32 bytes)

2. New API Methods

StorageManager:

  • getDataSetPieceDataWithLeafCount(dataSetId, pieceId) - Get leaf count for individual pieces

StorageContext:

  • getDataSetPiecesWithDetails() - Get all pieces with leaf count and raw size information

WarmStorageService:

  • getPieceLeafCount(dataSetId, pieceId) - Core method to fetch leaf count from PDPVerifier contract

3. PDPVerifier Integration

  • Added getPieceLeafCount(dataSetId, pieceId) method to interact with the PDPVerifier contract
  • Retrieves leaf count information directly from the blockchain

Key Benefits

  • Storage Cost Analysis: Users can now calculate exact storage costs using leaf counts
  • Data Insights: Better understanding of data organization and piece sizes

Usage Examples

// Get all pieces with details
const context = await synapse.storage.createContext({ dataSetId, providerId })
const piecesWithDetails = await context.getDataSetPiecesWithDetails()

piecesWithDetails.forEach(piece => {
  console.log(`Piece ${piece.pieceId}:`)
  console.log(`  CID: ${piece.pieceCid}`)
  console.log(`  Leaf Count: ${piece.leafCount}`)
  console.log(`  Raw Size: ${piece.rawSize} bytes`)
})

// Get individual piece leaf count
const leafCount = await synapse.storage.getDataSetPieceDataWithLeafCount(dataSetId, pieceId)
const rawSize = leafCount * 32

Files Changed

Core Implementation:

  • src/types.ts - Added DataSetPieceDataWithLeafCount interface
  • src/pdp/verifier.ts - Added getPieceLeafCount() method
  • src/warm-storage/service.ts - Added getPieceLeafCount() wrapper method
  • src/storage/manager.ts - Added public API method
  • src/storage/context.ts - Added getDataSetPiecesWithDetails() method

Testing:

  • src/test/warm-storage-service.test.ts - Tests for WarmStorageService methods
  • src/test/storage.test.ts - Tests for StorageManager and StorageContext methods

Documentation:

  • utils/example-piece-details.js - Working example script demonstrating the new functionality

Testing

  • ✅ All existing tests pass
  • ✅ New functionality has tests
  • ✅ Example script successfully demonstrates the feature with real data

Example Output

📊 Data Set Summary:
  PDP Verifier Data Set ID: 158
  Piece Count: 4

✅ Retrieved 4 pieces with details:
  Piece 1: Leaf Count: 319044, Raw Size: 10209408 bytes (9970.13 KB)
  Piece 2: Leaf Count: 319044, Raw Size: 10209408 bytes (9970.13 KB)
  Piece 3: Leaf Count: 319044, Raw Size: 10209408 bytes (9970.13 KB)
  Piece 4: Leaf Count: 319044, Raw Size: 10209408 bytes (9970.13 KB)

📈 Total: 1276176 leaves, 40837632 bytes (39880.50 KB)

Related filecoin-project/filecoin-pin#50

@github-project-automation github-project-automation bot moved this to 📌 Triage in FS Sep 26, 2025
@rvagg
Copy link
Collaborator

rvagg commented Sep 26, 2025

Leaf count is nice, and useful for total data set if we can get it, but at the piece level we should have it in the Piece CID because we're using Piece CID v2. The only catch is that the way we currently get the piece list for a data set is a little broken and will be returning the wrong size (but the right multihash).

There's two separate issues to resolve here:

  1. That Curio bug in our pdpv0 branch needs to be fixed regardless so it returns the Piece CID as it should be (I had a brief look this week and it wasn't obvious where the counting was wrong, I'm slightly worried it's recording the wrong size on the way in to the database rather than coming out. (For now I'd just pretend they are correct and use them with the understanding that we're going to fix this).
  2. We really shouldn't be asking Curio for the piece list for our data set in the first place. PDPVerifier has the piece list, we just need an accessor for it, it'll probably have to be paginated to account for very large data sets but this is the best version here so we go to the chain, which we can trust, not Curio, which we shouldn't. Then we have the v2 CIDs and we can decode them to get the size (there should be some code in Synapse to help with this in piece.ts, or at least it should point to how to do it if someone wants to give it a go).

Total leaf count for a whole data set though would be useful too if we can get that off the chain.

@rjan90 rjan90 moved this from 📌 Triage to 🔎 Awaiting review in FS Sep 29, 2025
@BigLep
Copy link
Contributor

BigLep commented Sep 29, 2025

My understanding of the situation is that the ball is in @SgtPooki court to:

  1. Get leaf count from the PieceCIDv2 (just assume it's correct indepenendent of Curio bug)
  2. Update the PDPVerifier wrapper in synapse-sdk to expose getActivePieces from the PDPVerifier contract. For now, synapse-sdk would expose a getAllActivePieces which then walks the PDPVerifier.getActivePieces until hasMore=false.

Copy link
Collaborator Author

@SgtPooki SgtPooki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self review.. some comments pointing reviewers to places where i'm not sure about things

* @returns The number of leaves for this piece
*/
async getPieceLeafCount(dataSetId: number, pieceId: number): Promise<number> {
// TODO: DO we need to call the contract for leaf count?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

callout here.. not sure if piece.ts leafCount calculation is enough?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ok, but I don't think it's all that useful here; let's remove it for now and ignore leaf counts -- they mostly shouldn't be a concern to the user other than their fairly close relationship to size

Comment on lines +1274 to +1277
// TODO: should we call the contract for leaf count? i.e. pdpVerifier.getPieceLeafCount(this._dataSetId, piece.pieceId)
const leafCount = getLeafCount(piece.pieceCid) ?? 0
// TODO: is there a better way to get the raw size?
const rawSize = getRawSize(piece.pieceCid) ?? 0
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only place where we're getting setting rawSize and leafCount now.. should we be calling the contract instead of these helper methods?

Comment on lines +1348 to +1369
// Parse the piece data as a PieceCID
// The contract stores the full PieceCID multihash digest (including height and padding)
// The data comes as a hex string from ethers, we need to decode it as bytes then as a CID
const pieceDataHex = result.pieces[i].data
const pieceDataBytes = ethers.getBytes(pieceDataHex)

const cid = CID.decode(pieceDataBytes)
const pieceCid = asPieceCID(cid)
if (!pieceCid) {
throw createError(
'StorageContext',
'getAllActivePiecesGenerator',
`Invalid PieceCID returned from contract for piece ${result.pieceIds[i]}`
)
}

yield {
pieceId: result.pieceIds[i],
pieceCid,
subPieceCid: pieceCid,
subPieceOffset: 0, // TODO: figure out how to get the sub piece offset
} satisfies DataSetPieceData
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: a core contributor should really eyeball this to make sure i'm doing things properly

Comment on lines +264 to +269
// Expected leaf count is 2^height where height is calculated from size
const expectedHeight = Size.Unpadded.toHeight(BigInt(size))
const expectedLeafCount = 2 ** expectedHeight

assert.isNotNull(leafCount)
assert.strictEqual(leafCount, expectedLeafCount)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this accurate?

Comment on lines +290 to +293
// Expected raw size is leaf count * 32
const expectedHeight = Size.Unpadded.toHeight(BigInt(size))
const expectedLeafCount = 2 ** expectedHeight
const expectedRawSize = expectedLeafCount * 32
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this accurate?

})
})

describe('getAllActivePieces', () => {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of tests and mocking setup in here. I tried following existing patterns, but some mock helpers would be nice. especially for contract calls. Something like this would be amazing:

const mockContractContext = createMockContractContext()
mockContractContext.mock(key, singleCallResponse) // key maps to transaction data prefix?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synapse.test.ts has a new mock system that we are trying to move towards

rawSize,
leafCount,
subPieceCid: piece.pieceCid,
subPieceOffset: 0, // TODO: figure out how to get the sub piece offset
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do I fulfill this value accurately?

@hugomrdias
Copy link
Collaborator

Couple of comments here:

  • i dont think we should have any get all method in the sdk that hammers a contract, developer should be able to paginated to their liking and not be forced into the full list of pieces.
  • im not sure if we need to have this in the core sdk seems very filecoin-pin specific

Anyway i dont feel like i can fully review this properly, we probably need rod to answers some of the question you added

@SgtPooki
Copy link
Collaborator Author

SgtPooki commented Oct 6, 2025

i dont think we should have any get all method in the sdk that hammers a contract, developer should be able to paginated to their liking and not be forced into the full list of pieces.

That makes sense. Can remove.

im not sure if we need to have this in the core sdk seems very filecoin-pin specific

I'm guessing you're talking about the getAllActivePiecesGenerator, getPiecesWithDetails, and getAllActivePieces methods?

We still need some methods exposed from synapse-sdk that aren't currently. I'm all ears for what we want to handle in the sdk and can handle other in filecoin-pin..

I think we should at least export these methods:

pdp/verifier.ts: getActivePieces, getPieceLeafCount
piece/piece.ts: getLeafCount, getRawSize

Will wait for update from @rvagg before doing anything else

* @param pieceCid - The PieceCID to extract raw size from
* @returns The raw size in bytes or null if invalid
*/
export function getRawSize(pieceCid: PieceCID | CID | string): number | null {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite right because height only gets us the padded size, we need to unpad using the padding that's also encoded. See #283

Comment on lines +1274 to +1275
// TODO: should we call the contract for leaf count? i.e. pdpVerifier.getPieceLeafCount(this._dataSetId, piece.pieceId)
const leafCount = getLeafCount(piece.pieceCid) ?? 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's ditch this

@rvagg
Copy link
Collaborator

rvagg commented Oct 7, 2025

OK, here's my current thoughts about this:

  1. I really like that you're fetching the pieces from the contract, I've been wanting to get rid of the query to the SP to get the pieces (I thought I had an issue for this somewhere?). It's not trustworthy, we should be getting it off the chain like you're doing here.
  2. Leaves are a problem and we should ignore them wherever possible I think. Leaves are just the raw size rounded up to the nearest 32 because that's the proving unit. But our actual sizes don't have to be a multiple of 32. Now that I'm looking at PDPVerifier I can see that the rawSizes it returns is the leaf count multiplied by 32, i.e. actual raw size rounded up to the nearest 32, which is wrong. I've opened an issue about this @ Fix & clarify all uses of "rawSize" pdp#217, but we can just ignore it entirely for our purposes.
  3. I think that we could just encourage use of feat: getSizeFromPieceCID(cid) to extract size from PieceCIDv2 #283 to devs to work this out themselves from a CID and not bother augmenting here with sizes.

Which leaves us with just "get pieces from contract". So I think we should pivot here slightly, so here's my suggestion:

  • Leave getActivePieces as the heart of this in pdp/verifier.ts, but don't return rawSizes from there, it's not helpful
  • Rename getAllActivePiecesGenerator to just getPieces, it can return an async generator, that's just how you list them, it's the new getDataSetPieces, but just return AsyncGenerator<PieceCID> - i.e. no need to do anything else but query the contract, yield CIDs and keep going as long as you're asked to.
  • Let's change getDataSetPieces - the name is bad, so let's @deprecate that method and replace the guts of it with what you have for getAllActivePieces, leaving the API stable (no additional args, just use defaults), so it uses getAllActivePiecesGenerator to do a collect-all. Eventually we'll remove this and let the user do collection according to their needs.
  • Remove everything else

Then we just consume PieceCIDs in Filecoin Pin, use https://github.com/FilOzone/synapse-sdk/pull/283 over there to get the sizes we want and encourage devs to use that pattern. We probably should document all of this too.

@rvagg
Copy link
Collaborator

rvagg commented Oct 8, 2025

I've just realised that we also need pieceId along with the CIDs, so our getPieces should return an async generator of [PieceCID, number] or { cid: PieceCid, id: number }. We'll just need to document that the number is important for certain other operations.

@rvagg
Copy link
Collaborator

rvagg commented Oct 14, 2025

getSizeFromPieceCID now available as a util in pieces.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🔎 Awaiting review

Development

Successfully merging this pull request may close these issues.

4 participants