-
Notifications
You must be signed in to change notification settings - Fork 33
Description
What
We are seeing a high amount of errors when writing to Azure Blob Storage with The specified block list is invalid
This error happens in the Clearly Defined Crawler when it upserts into Azure blob storage
storage/storageDocStore.js
It uses the Azure streaming upload function.
await blockBlobClient.uploadStream(dataStream, 8 << 20, 5, options)
We have a high number of instances running and The specified block list is invalid could be caused by a race condition with the uploads. If we have more that one pod attempting to upload the same package at the same time.
Pod 1: Processing request A → uploads "go/golang/package-v1.json"
→ Starts uploadStream with blocks: block_00001, block_00002, block_00003...
Pod 2: Processing SAME request A → uploads "go/golang/package-v1.json"
→ ALSO starts uploadStream with blocks: block_00001, block_00002, block_00003...
Azure: "Wait, I have TWO sets of block_00001 for the same blob!"
→ "The specified block list is invalid" ❌
Possible fixes
###Use upload rather than upload stream.
await blockBlobClient.upload(data, data.length, options)
This doesn't chunk the file into streams. We should consider this if the size of the blobs we are uploading are in <= 100 mb. We may need to continue to use the stream if the blobs are larger.
Obtain Leases when uploading to blobs
Azure documentation recommends to use leases for concurrency race issues. Azure docs
That feels like a bigger change which would require a lot of testing.
Solution for now
- Files <100MB: Use upload()
- Files ≥100MB: Use uploadStream()
And lets see if that alleviates the problem