Skip to content

Crawler Azure Blob Writes Error With block list is invalid #664

@ljones140

Description

@ljones140

What

We are seeing a high amount of errors when writing to Azure Blob Storage with The specified block list is invalid

This error happens in the Clearly Defined Crawler when it upserts into Azure blob storage
storage/storageDocStore.js

It uses the Azure streaming upload function.

await blockBlobClient.uploadStream(dataStream, 8 << 20, 5, options)

We have a high number of instances running and The specified block list is invalid could be caused by a race condition with the uploads. If we have more that one pod attempting to upload the same package at the same time.

 Pod 1: Processing request A → uploads "go/golang/package-v1.json"
        → Starts uploadStream with blocks: block_00001, block_00002, block_00003...

 Pod 2: Processing SAME request A → uploads "go/golang/package-v1.json"
        → ALSO starts uploadStream with blocks: block_00001, block_00002, block_00003...

 Azure: "Wait, I have TWO sets of block_00001 for the same blob!"
        → "The specified block list is invalid" ❌

Possible fixes

###Use upload rather than upload stream.

await blockBlobClient.upload(data, data.length, options)

This doesn't chunk the file into streams. We should consider this if the size of the blobs we are uploading are in <= 100 mb. We may need to continue to use the stream if the blobs are larger.

Obtain Leases when uploading to blobs

Azure documentation recommends to use leases for concurrency race issues. Azure docs

That feels like a bigger change which would require a lot of testing.

Solution for now

  • Files <100MB: Use upload()
  • Files ≥100MB: Use uploadStream()

And lets see if that alleviates the problem

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions