How Blobfuse2 Works

How BlobFuse works

BlobFuse leverages the libfuse (fuse3) library to interface with the Linux FUSE kernel module and performs file system operations using Azure Storage REST APIs. It translates Azure Blob Storage object names into a directory-like structure using path conventions, allowing files to be accessed as if residing locally. Operations such as mkdir, opendir, readdir, rmdir, open, read, create, write, close, unlink, truncate, stat, rename are supported. Chmod is also supported for HNS accounts.

BlobFuse has two operating modes:

Caching (File Cache)
Streaming(Block cache).

Caching (File Cache)

In this mode, Blobfuse downloads the entire file from Azure Blob Storage into a local cache directory before making it available to the application. All subsequent reads and writes are served from this local cache until the file is evicted or invalidated. If the file was created or modified, then close of file-handles from application end will trigger upload of this file to storage container. This mode is suitable for workloads with repeated reads of files or datasets which can fit in local disk.

Streaming (Block Cache)

Unlike traditional file caching, which downloads the entire file before serving, block cache mode streams data in chunks (blocks) and serves it as it downloads. This is designed for workloads involving large files, such as AI/ML training datasets, genomic sequencing, and HPC simulations.

Recommendations for using Block cache:

User applications must check the returned code(success/failure) for filesystem calls like read, write, close, flush, etc. If error is returned, the application must abort their respective operation.
User applications must ensure that there is only one writer at a time for a given file.
When dealing with very large files (in TiB), the block-size must be configured accordingly. Azure Storage supports only [50,000blocks(https://learn.microsoft.com/en-us/rest/api/storageservices/put-block-list?tabs=microsoft-entra-id#remarks) per blob.

Block cache should be used with following caveats:

Concurrent write operations on the same file using multiple handles is not checked for data consistency and may lead toincorrect data being written.
A read operation on a file that is being written to simultaneously by another process or handle will not return the mostup-to-date data.
When copying files with trailing null bytes using cp utility to a Blobfuse2 mounted path, use --sparse=never parameter to avoid data being trimmed. For example, cp--sparse=never src dest.
In write operations, data written is persisted(or committed) to the Azure Storage container only when close, sync or flushoperations are called by user application.
Files cannot be modified if they were originally created with block-size different than the one configured.

Use this decision tree to choose between File-cache and Streaming with Block cache modes.

There is an option to disable caching either at both the Kernel and BlobFuse levels or exclusively at the Kernel level. Refer this page for details.

How Blobfuse2 Works

How BlobFuse works

Caching (File Cache)

Streaming (Block Cache)

Recommendations for using Block cache:

Block cache should be used with following caveats:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally