-
Hi maintainers—could you please confirm whether my understanding for GraphRAG v2.5.0 is correct? The config and code include a Blob storage backend for pipeline artifacts (e.g., storage: type: blob, BlobPipelineStorage), but the query CLI/library still expects local Parquet outputs—so the usual pattern is to download/sync from Blob first before querying. What I’m looking at in the repo: Storage backend (Blob): graphrag/storage/blob_pipeline_storage.py and graphrag/storage/factory.py (Blob storage implementation and factory wiring for storage.type: blob). Query path (local Parquet expectation): graphrag/cli/query.py (e.g., _resolve_output_files(...), --data-dir overriding output.base_dir) which appears to read Parquet from a local directory. Indexing outputs: default pipeline writes tables as Parquet (e.g., graphrag/cli/index.py, and graphrag/index/workflows/v1/subflows/*). Questions Is it the intended behavior today that the query path reads Parquet from a local directory (even if indexing wrote to Blob), meaning practitioners should sync from Blob to local before running queries? For indexing, is using storage: { type: blob, ... } the supported way to write artifacts to Blob end-to-end, with the caveat that the query step still needs a local copy? If there’s a supported way for the query path to read directly from Azure Blob (without a local sync), could you point me to the config or API usage? I understand I can use Azure AI Search and Cosmos - but I am purely looking at blob at this point - we are at very early stage of evaluation/design - many thanks for your reply. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
If you are using the CLI, the configured storage should be used exactly the same across indexing and query. I just tested with 2.5.0 and it worked for me with no local parquets. If you are using the API query methods, those expect to be handed a dataframe, so you'd need to retrieve it from storage and load it into memory first. |
Beta Was this translation helpful? Give feedback.
If you are using the CLI, the configured storage should be used exactly the same across indexing and query. I just tested with 2.5.0 and it worked for me with no local parquets. If you are using the API query methods, those expect to be handed a dataframe, so you'd need to retrieve it from storage and load it into memory first.