Skip to content

Commit 8368b12

Browse files
KennyZhang1AlonsoGuevarajgbradley1
authored
Add Cosmos DB storage/cache option (#1431)
* added cosmosdb constructor and database methods * added rest of abstract method headers * added cosmos db container methods * implemented has and delete methods * finished implementing abstract class methods * integrated class into storage factory * integrated cosmosdb class into cache factory * added support for new config file fields * replaced primary key cosmosdb initialization with connection strings * modified cosmosdb setter to require json * Fix non-default emitters * Format * Ruff * ruff * first successful run of cosmosdb indexing * removed extraneous container_name setting * require base_dir to be typed as str * reverted merged changed from closed branch * removed nested try statement * readded initial non-parquet emitter fix * added basic support for parquet emitter using internal conversions * merged with main and resolved conflicts * fixed more merge conflicts * added cosmosdb functionality to query pipeline * tested query for cosmosdb * collapsed cosmosdb schema to use minimal containers and databases * simplified create_database and create_container functions * ruff fixes and semversioner * spellcheck and ci fixes * updated pyproject toml and lock file * apply fixes after merge from main * add temporary comments * refactor cache factory * refactored storage factory * minor formatting * update dictionary * fix spellcheck typo * fix default value * fix pydantic model defaults * update pydantic models * fix init_content * cleanup how factory passes parameters to file storage * remove unnecessary output file type * update pydantic model * cleanup code * implemented clear method * fix merge from main * add test stub for cosmosdb * regenerate lock file * modified set method to collapse parquet rows * modified get method to collapse parquet rows * updated has and delete methods and docstrings to adhere to new schema * added prefix helper function * replaced delimiter for prefixed id * verified empty tests are passing * fix merges from main * add find test * update cicd step name * tested querying for new schema * resolved errors from merge conflicts * refactored set method to handle cache in new schema * refactored get method to handle cache in new schema * force unique ids to be written to cosmos for nodes * found bug with has and delete methods * modified has and delete to work with cache in new schema * fix the merge from main * minor typo fixes * update lock file * spellcheck fix * fix init function signature * minor formatting updates * remove https protocol * change localhost to 127.0.0.1 address * update pytest to use bacj engine * verified cache tests * improved speed of has function * resolved pytest error with find function * added test for child method * make container_name variable private as _container_name * minor variable name fix * cleanup cosmos pytest and make the cosmosdb storage class operations more efficient * update cicd to use different cosmosdb emulator * test with http protocol * added pytest for clear() * add longer timeout for cosmosdb emulator startup * revert http connection back to https * add comments to cicd code for future dev usage * set to container and database clients to none upon deletion * ruff changes * add comments to cicd code * removed unneeded None statements and ruff fixes * more ruff fixes * Update test_run.py * remove unnecessary call to delete container * ruff format updates * Reverted test_run.py * fix ruff formatter errors * cleanup variable names to be more consistent * remove extra semversioner file * revert pydantic model changes * revert pydantic model change * revert pydantic model change * re-enable inline formatting rule * update documentation in dev guide --------- Co-authored-by: Alonso Guevara <[email protected]> Co-authored-by: Josh Bradley <[email protected]>
1 parent c1c09ba commit 8368b12

30 files changed

+925
-302
lines changed

.github/workflows/python-integration-tests.yml

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ permissions:
2323

2424
concurrency:
2525
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
26-
# Only run the for the latest commit
26+
# only run the for the latest commit
2727
cancel-in-progress: true
2828

2929
env:
@@ -37,7 +37,7 @@ jobs:
3737
matrix:
3838
python-version: ["3.10"]
3939
os: [ubuntu-latest, windows-latest]
40-
fail-fast: false # Continue running all jobs even if one fails
40+
fail-fast: false # continue running all jobs even if one fails
4141
env:
4242
DEBUG: 1
4343

@@ -84,6 +84,17 @@ jobs:
8484
id: azuright
8585
uses: potatoqualitee/[email protected]
8686

87+
# For more information on installation/setup of Azure Cosmos DB Emulator
88+
# https://learn.microsoft.com/en-us/azure/cosmos-db/how-to-develop-emulator?tabs=docker-linux%2Cpython&pivots=api-nosql
89+
# Note: the emulator is only available on Windows runners. It can take longer than the default to initially startup so we increase the default timeout.
90+
# If a job fails due to timeout, restarting the cicd job usually resolves the problem.
91+
- name: Install Azure Cosmos DB Emulator
92+
if: runner.os == 'Windows'
93+
run: |
94+
Write-Host "Launching Cosmos DB Emulator"
95+
Import-Module "$env:ProgramFiles\Azure Cosmos DB Emulator\PSModules\Microsoft.Azure.CosmosDB.Emulator"
96+
Start-CosmosDbEmulator -Timeout 500
97+
8798
- name: Integration Test
8899
run: |
89100
poetry run poe test_integration
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"type": "patch",
3+
"description": "Implement cosmosdb storage option for cache and output"
4+
}

DEVELOPING.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,6 @@ graphrag
4545
├── config # configuration management
4646
├── index # indexing engine
4747
| └─ run/run.py # main entrypoint to build an index
48-
├── llm # generic llm interfaces
4948
├── logger # logger module supporting several options
5049
│   └─ factory.py # └─ main entrypoint to create a logger
5150
├── model # data model definitions associated with the knowledge graph

dictionary.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ ints
2828

2929
# Azure
3030
abfs
31+
cosmosdb
3132
Hnsw
3233
odata
3334

graphrag/cache/factory.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99

1010
from graphrag.config.enums import CacheType
1111
from graphrag.storage.blob_pipeline_storage import BlobPipelineStorage
12+
from graphrag.storage.cosmosdb_pipeline_storage import create_cosmosdb_storage
1213
from graphrag.storage.file_pipeline_storage import FilePipelineStorage
1314

1415
if TYPE_CHECKING:
@@ -50,6 +51,8 @@ def create_cache(
5051
)
5152
case CacheType.blob:
5253
return JsonPipelineCache(BlobPipelineStorage(**kwargs))
54+
case CacheType.cosmosdb:
55+
return JsonPipelineCache(create_cosmosdb_storage(**kwargs))
5356
case _:
5457
if cache_type in cls.cache_types:
5558
return cls.cache_types[cache_type](**kwargs)

graphrag/config/create_graphrag_config.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -362,6 +362,7 @@ def hydrate_parallelization_params(
362362
storage_account_blob_url=reader.str(Fragment.storage_account_blob_url),
363363
container_name=reader.str(Fragment.container_name),
364364
base_dir=reader.str(Fragment.base_dir) or defs.CACHE_BASE_DIR,
365+
cosmosdb_account_url=reader.str(Fragment.cosmosdb_account_url),
365366
)
366367
with (
367368
reader.envvar_prefix(Section.reporting),
@@ -383,6 +384,7 @@ def hydrate_parallelization_params(
383384
storage_account_blob_url=reader.str(Fragment.storage_account_blob_url),
384385
container_name=reader.str(Fragment.container_name),
385386
base_dir=reader.str(Fragment.base_dir) or defs.STORAGE_BASE_DIR,
387+
cosmosdb_account_url=reader.str(Fragment.cosmosdb_account_url),
386388
)
387389

388390
with (
@@ -667,6 +669,7 @@ class Fragment(str, Enum):
667669
concurrent_requests = "CONCURRENT_REQUESTS"
668670
conn_string = "CONNECTION_STRING"
669671
container_name = "CONTAINER_NAME"
672+
cosmosdb_account_url = "COSMOSDB_ACCOUNT_URL"
670673
deployment_name = "DEPLOYMENT_NAME"
671674
description = "DESCRIPTION"
672675
enabled = "ENABLED"

graphrag/config/enums.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ class CacheType(str, Enum):
1919
"""The none cache configuration type."""
2020
blob = "blob"
2121
"""The blob cache configuration type."""
22+
cosmosdb = "cosmosdb"
23+
"""The cosmosdb cache configuration type"""
2224

2325
def __repr__(self):
2426
"""Get a string representation."""
@@ -60,6 +62,8 @@ class StorageType(str, Enum):
6062
"""The memory storage type."""
6163
blob = "blob"
6264
"""The blob storage type."""
65+
cosmosdb = "cosmosdb"
66+
"""The cosmosdb storage type"""
6367

6468
def __repr__(self):
6569
"""Get a string representation."""

graphrag/config/init_content.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,15 +63,15 @@
6363
## connection_string and container_name must be provided
6464
6565
cache:
66-
type: {defs.CACHE_TYPE.value} # or blob
66+
type: {defs.CACHE_TYPE.value} # one of [blob, cosmosdb, file]
6767
base_dir: "{defs.CACHE_BASE_DIR}"
6868
6969
reporting:
7070
type: {defs.REPORTING_TYPE.value} # or console, blob
7171
base_dir: "{defs.REPORTING_BASE_DIR}"
7272
7373
storage:
74-
type: {defs.STORAGE_TYPE.value} # or blob
74+
type: {defs.STORAGE_TYPE.value} # one of [blob, cosmosdb, file]
7575
base_dir: "{defs.STORAGE_BASE_DIR}"
7676
7777
## only turn this on if running `graphrag index` with custom settings

graphrag/config/input_models/cache_config_input.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,3 +16,4 @@ class CacheConfigInput(TypedDict):
1616
connection_string: NotRequired[str | None]
1717
container_name: NotRequired[str | None]
1818
storage_account_blob_url: NotRequired[str | None]
19+
cosmosdb_account_url: NotRequired[str | None]

graphrag/config/input_models/storage_config_input.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,3 +16,4 @@ class StorageConfigInput(TypedDict):
1616
connection_string: NotRequired[str | None]
1717
container_name: NotRequired[str | None]
1818
storage_account_blob_url: NotRequired[str | None]
19+
cosmosdb_account_url: NotRequired[str | None]

0 commit comments

Comments
 (0)