-
Notifications
You must be signed in to change notification settings - Fork 239
How Blobfuse2 Works
BlobFuse leverages the libfuse (fuse3) library to interface with the Linux FUSE kernel module and performs file system operations using Azure Storage REST APIs. It translates Azure Blob Storage object names into a directory-like structure using path conventions, allowing files to be accessed as if residing locally. Operations such as mkdir, opendir, readdir, rmdir, open, read, create, write, close, unlink, truncate, stat, rename are supported. Chmod is also supported for HNS accounts.
BlobFuse has two operating modes:
- Caching (File Cache)
- Streaming(Block cache).
In this mode, Blobfuse downloads the entire file from Azure Blob Storage into a local cache directory before making it available to the application. All subsequent reads and writes are served from this local cache until the file is evicted or invalidated. If the file was created or modified, then close of file-handles from application end will trigger upload of this file to storage container. This mode is suitable for workloads with repeated reads of files or datasets which can fit in local disk.
Unlike traditional file caching, which downloads the entire file before serving, block cache mode streams data in chunks (blocks) and serves it as it downloads. This is designed for workloads involving large files, such as AI/ML training datasets, genomic sequencing, and HPC simulations.
- User applications must check the returned code(success/failure) for filesystem calls like read, write, close, flush, etc. If error is returned, the application must abort their respective operation.
- User applications must ensure that there is only one writer at a time for a given file.
- When dealing with very large files (in TiB), the block-size must be configured accordingly. Azure Storage supports only [50,000blocks(https://learn.microsoft.com/en-us/rest/api/storageservices/put-block-list?tabs=microsoft-entra-id#remarks) per blob.
- Concurrent write operations on the same file using multiple handles is not checked for data consistency and may lead toincorrect data being written.
- A read operation on a file that is being written to simultaneously by another process or handle will not return the mostup-to-date data.
- When copying files with trailing null bytes using cp utility to a Blobfuse2 mounted path, use --sparse=never parameter to avoid data being trimmed. For example, cp--sparse=never src dest.
- In write operations, data written is persisted(or committed) to the Azure Storage container only when close, sync or flushoperations are called by user application.
- Files cannot be modified if they were originally created with block-size different than the one configured.
Use this decision tree to choose between File-cache and Streaming with Block cache modes.
There is an option to disable caching either at both the Kernel and BlobFuse levels or exclusively at the Kernel level. Refer this page for details.