Clarify Blob vs local Parquet/lancedb for indexing & query paths in v2.5.0? #2041

zedhaque · 2025-09-04T00:35:32Z

zedhaque
Sep 4, 2025

Hi maintainers—could you please confirm whether my understanding for GraphRAG v2.5.0 is correct?

The config and code include a Blob storage backend for pipeline artifacts (e.g., storage: type: blob, BlobPipelineStorage), but the query CLI/library still expects local Parquet outputs—so the usual pattern is to download/sync from Blob first before querying.
For indexing, Blob-backed writes work via the storage config, but the downstream query step still reads locally.

What I’m looking at in the repo:

Storage backend (Blob): graphrag/storage/blob_pipeline_storage.py and graphrag/storage/factory.py (Blob storage implementation and factory wiring for storage.type: blob).

Query path (local Parquet expectation): graphrag/cli/query.py (e.g., _resolve_output_files(...), --data-dir overriding output.base_dir) which appears to read Parquet from a local directory.

Indexing outputs: default pipeline writes tables as Parquet (e.g., graphrag/cli/index.py, and graphrag/index/workflows/v1/subflows/*).

Questions

Is it the intended behavior today that the query path reads Parquet from a local directory (even if indexing wrote to Blob), meaning practitioners should sync from Blob to local before running queries?

For indexing, is using storage: { type: blob, ... } the supported way to write artifacts to Blob end-to-end, with the caveat that the query step still needs a local copy?

If there’s a supported way for the query path to read directly from Azure Blob (without a local sync), could you point me to the config or API usage?

I understand I can use Azure AI Search and Cosmos - but I am purely looking at blob at this point - we are at very early stage of evaluation/design - many thanks for your reply.

Answered by natoverse

Sep 9, 2025

If you are using the CLI, the configured storage should be used exactly the same across indexing and query. I just tested with 2.5.0 and it worked for me with no local parquets. If you are using the API query methods, those expect to be handed a dataframe, so you'd need to retrieve it from storage and load it into memory first.

View full answer

natoverse · 2025-09-09T21:44:10Z

natoverse
Sep 9, 2025
Maintainer

If you are using the CLI, the configured storage should be used exactly the same across indexing and query. I just tested with 2.5.0 and it worked for me with no local parquets. If you are using the API query methods, those expect to be handed a dataframe, so you'd need to retrieve it from storage and load it into memory first.

1 reply

zedhaque Sep 11, 2025
Author

Thank you - I can also confirm above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarify Blob vs local Parquet/lancedb for indexing & query paths in v2.5.0? #2041

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Clarify Blob vs local Parquet/lancedb for indexing & query paths in v2.5.0? #2041

Uh oh!

zedhaque Sep 4, 2025

Replies: 1 comment · 1 reply

Uh oh!

natoverse Sep 9, 2025 Maintainer

Uh oh!

zedhaque Sep 11, 2025 Author

zedhaque
Sep 4, 2025

Replies: 1 comment 1 reply

natoverse
Sep 9, 2025
Maintainer

zedhaque Sep 11, 2025
Author