Skip to content

Commit 6bcc0ec

Browse files
Merge pull request #325 from cloudera/main
Release 1.30.0
2 parents 313f761 + 613e807 commit 6bcc0ec

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+1416
-261
lines changed

.env.example

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,19 @@
11
AWS_DEFAULT_REGION=us-west-2
22

3+
# H2 or PostgreSQL (RDS) (H2 is default)
4+
DB_TYPE=H2
5+
6+
# H2
37
DB_URL=jdbc:h2:../databases/rag
48

9+
# RDS
10+
# DB_URL= "jdbc:postgresql://<host>:<port>/<database>"
11+
DB_USERNAME=
12+
DB_PASSWORD=
13+
14+
# Model Provider
15+
MODEL_PROVIDER=Bedrock
16+
517
# CAII
618
CAII_DOMAIN=
719

@@ -10,7 +22,7 @@ AZURE_OPENAI_API_KEY=
1022
AZURE_OPENAI_ENDPOINT=
1123
OPENAI_API_VERSION=
1224

13-
# QDRANT or OPENSEARCH
25+
# QDRANT or OPENSEARCH or CHROMADB
1426
VECTOR_DB_PROVIDER=QDRANT
1527

1628
# OpenSearch
@@ -19,6 +31,18 @@ OPENSEARCH_USERNAME=
1931
OPENSEARCH_PASSWORD=
2032
OPENSEARCH_NAMESPACE=
2133

34+
# ChromaDB
35+
CHROMADB_HOST=http://localhost
36+
CHROMADB_PORT=8000
37+
CHROMADB_TOKEN=
38+
# Tenant and database defaults to the Chroma default values
39+
CHROMADB_TENANT=
40+
CHROMADB_DATABASE=
41+
# If CHROMADB_HOST starts with "https://" and your server uses a private CA,
42+
# set it to the path of your PEM bundle so Python can verify TLS connections to ChromaDB:
43+
CHROMADB_SERVER_SSL_CERT_PATH=/absolute/path/to/ca-bundle.pem
44+
CHROMADB_ENABLE_ANONYMIZED_TELEMETRY=false
45+
2246
# AWS
2347
AWS_ACCESS_KEY_ID=
2448
AWS_SECRET_ACCESS_KEY=

.github/workflows/publish_release.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -122,4 +122,4 @@ jobs:
122122
echo "No changes to commit"
123123
fi
124124
env:
125-
GITHUB_TOKEN: ${{ github.token }}
125+
GITHUB_TOKEN: ${{ github.token }}

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
.env
22
.idea/*
33
.vscode/*
4+
.cursor/*
45
!.idea/copyright/
56
!.idea/prettier.xml
67
!.idea/google-java-format.xml

README.md

Lines changed: 47 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,32 @@ RAG Studio can utilize the local file system or an S3 bucket for storing documen
5252

5353
S3 will also require providing the AWS credentials for the bucket.
5454

55+
### Vector Database Options
56+
57+
RAG Studio supports Qdrant (default), OpenSearch (Cloudera Semantic Search), and ChromaDB.
58+
59+
- To choose the vector DB, set `VECTOR_DB_PROVIDER` to one of `QDRANT`, `OPENSEARCH`, or `CHROMADB` in your `.env`.
60+
61+
#### ChromaDB Setup
62+
63+
If you select ChromaDB, configure the following environment variables in `.env`:
64+
65+
- `CHROMADB_HOST` - Hostname or URL for ChromaDB. Use `localhost` for local Docker.
66+
- `CHROMADB_PORT` - Port for ChromaDB (default `8000`). Not required if `CHROMADB_HOST` starts with `https://` and the server infers the port.
67+
- `CHROMADB_TENANT` - Optional. Defaults to the Chroma default tenant.
68+
- `CHROMADB_DATABASE` - Optional. Defaults to the Chroma default database.
69+
- `CHROMADB_TOKEN` - Optional. Include if your Chroma server requires an auth token.
70+
- `CHROMADB_SERVER_SSL_CERT_PATH` - Optional. Path to PEM bundle for TLS verification when using HTTPS with a private CA.
71+
- `CHROMADB_ENABLE_ANONYMIZED_TELEMETRY` - Optional. Enables anonymized telemetry in the ChromaDB client; defaults to `false`.
72+
73+
Notes:
74+
75+
- The local-dev script will automatically start a ChromaDB Docker container when `VECTOR_DB_PROVIDER=CHROMADB`, `CHROMADB_HOST=localhost` on `CHROMADB_PORT=8000`.
76+
- ChromaDB collections are automatically namespaced using the tenant and database values to avoid conflicts between different RAG Studio instances.
77+
- For production deployments, consider using a dedicated ChromaDB server with authentication enabled via `CHROMADB_TOKEN`.
78+
- When using HTTPS endpoints, ensure your certificate chain is properly configured or provide the CA bundle path via `CHROMADB_SERVER_SSL_CERT_PATH`.
79+
- Anonymized telemetry is disabled by default. You can enable it either by setting `CHROMADB_ENABLE_ANONYMIZED_TELEMETRY=true`.
80+
5581
### Enhanced Parsing Options:
5682

5783
RAG Studio can optionally enable enhanced parsing by providing the `USE_ENHANCED_PDF_PROCESSING` environment variable. Enabling this will allow RAG Studio to parse images and tables from PDFs. When enabling this feature, we strongly recommend using this with a GPU and at least 16GB of memory.
@@ -82,7 +108,7 @@ This variable can be set from the project settings for the AMP in CML.
82108
## Air-gapped Environments
83109

84110
If you are using an air-gapped environment, you will need to whitelist at the minimum the following domains in order to use the AMP.
85-
There may be other domains that need to be whitelisted depending on your environment and the model service provider you select.
111+
There may be other domains that need to be whitelisted depending on your environment and the model service provider you select.
86112

87113
- `https://github.com`
88114
- `https://raw.githubusercontent.com`
@@ -150,17 +176,29 @@ the Node service locally, you can do so by following these steps:
150176
docker run -p 6333:6333 -p 6334:6334 -v $(pwd)/databases/qdrant_storage:/qdrant/storage:z qdrant/qdrant
151177
```
152178

179+
#### To run ChromaDB locally
180+
181+
```
182+
docker run --name chromadb_dev --rm -d -p 8000:8000 -v $(pwd)/databases/chromadb_storage:/data chromadb/chroma
183+
```
184+
185+
#### Use ChromaDB with local-dev.sh
186+
187+
- Copy `.env.example` to `.env`.
188+
- Set `VECTOR_DB_PROVIDER=CHROMADB` in `.env` (defaults assume `CHROMADB_HOST=localhost` and `CHROMADB_PORT=8000`).
189+
- Run `./local-dev.sh` from the repo root. When `CHROMADB_HOST=localhost`, the script will auto-start a ChromaDB Docker container.
190+
153191
#### Modifying UI in CML
154192

155-
* This is an unsupported workflow, but it is possible to modify the UI code in CML.
193+
- This is an unsupported workflow, but it is possible to modify the UI code in CML.
156194

157-
- Start a CML Session from a CML Project that has the RAG Studio AMP installed.
158-
- Open the terminal in the CML Session and navigate to the `ui` directory.
159-
- Run `source ~/.bashrc` to ensure the Node environment variables are loaded.
160-
- Install PNPM using `npm install -g pnpm`. Docs on PNPM can be found here: https://pnpm.io/installation#using-npm
161-
- Run `pnpm install` to install the dependencies.
162-
- Make your changes to the UI code in the `ui` directory.
163-
- Run `pnpm build` to build the new UI bundle.
195+
* Start a CML Session from a CML Project that has the RAG Studio AMP installed.
196+
* Open the terminal in the CML Session and navigate to the `ui` directory.
197+
* Run `source ~/.bashrc` to ensure the Node environment variables are loaded.
198+
* Install PNPM using `npm install -g pnpm`. Docs on PNPM can be found here: https://pnpm.io/installation#using-npm
199+
* Run `pnpm install` to install the dependencies.
200+
* Make your changes to the UI code in the `ui` directory.
201+
* Run `pnpm build` to build the new UI bundle.
164202

165203
## The Fine Print
166204

docs/chat_flow.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ sequenceDiagram
1414
participant MLflow as MLflow
1515
1616
User->>UI: Enters query
17-
UI->>API: POST /sessions/{session_id}/chat
17+
UI->>API: POST /sessions/{session_id}/stream-completion
1818
Note over UI,API: Request includes query and configuration
1919
2020
API->>MetadataApi: GET session metadata

llm-service/app/ai/indexing/base.py

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,12 @@
1+
import json
12
import logging
23
import os
34
from abc import abstractmethod
45
from dataclasses import dataclass
56
from pathlib import Path
6-
from typing import Dict, Type, Optional
7+
from typing import Dict, Type, Optional, TypeVar
8+
9+
from llama_index.core.schema import BaseNode
710

811
from .readers.base_reader import BaseReader, ReaderConfig
912
from .readers.csv import CSVReader
@@ -26,7 +29,6 @@
2629
".docx": DocxReader,
2730
".pptx": PptxReader,
2831
".pptm": PptxReader,
29-
".ppt": PptxReader,
3032
".csv": CSVReader,
3133
".json": JSONReader,
3234
".jpg": ImagesReader,
@@ -40,6 +42,9 @@
4042
}
4143

4244

45+
TNode = TypeVar("TNode", bound=BaseNode)
46+
47+
4348
@dataclass
4449
class NotSupportedFileExtensionError(Exception):
4550
file_extension: str
@@ -54,6 +59,13 @@ def __init__(
5459
self.data_source_id = data_source_id
5560
self.reader_config = reader_config
5661

62+
@staticmethod
63+
def _flatten_metadata(chunk: TNode) -> TNode:
64+
for key, value in chunk.metadata.items():
65+
if isinstance(value, list) or isinstance(value, dict):
66+
chunk.metadata[key] = json.dumps(value)
67+
return chunk
68+
5769
@abstractmethod
5870
def index_file(self, file_path: Path, doc_id: str) -> None:
5971
pass

llm-service/app/ai/indexing/embedding_indexer.py

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,12 @@ def index_file(self, file_path: Path, document_id: str) -> None:
108108
# we're capturing "text".
109109
converted_chunks: List[BaseNode] = [chunk for chunk in chunk_batch]
110110

111+
# flatten metadata if vector store has self.flat_metadata
112+
if self.chunks_vector_store.flat_metadata:
113+
converted_chunks = [
114+
self._flatten_metadata(chunk) for chunk in converted_chunks
115+
]
116+
111117
chunks_vector_store = self.chunks_vector_store.llama_vector_store()
112118
chunks_vector_store.add(converted_chunks)
113119

@@ -130,6 +136,12 @@ def _compute_embeddings(
130136
logger.debug(f"Waiting for {len(futures)} futures")
131137
for future in as_completed(futures):
132138
i, batch_embeddings = future.result()
133-
for chunk, embedding in zip(batched_chunks[i], batch_embeddings):
139+
batch_chunks = batched_chunks[i]
140+
if len(batch_chunks) != len(batch_embeddings):
141+
raise ValueError(
142+
f"Expected {len(batch_chunks)} embedding vectors for this batch of chunks,"
143+
+ f" but got {len(batch_embeddings)} from {self.embedding_model.model_name}"
144+
)
145+
for chunk, embedding in zip(batch_chunks, batch_embeddings):
134146
chunk.embedding = embedding
135-
yield batched_chunks[i]
147+
yield batch_chunks

llm-service/app/ai/indexing/readers/docx.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ def __init__(self, *args: Any, **kwargs: Any) -> None:
5252
def load_chunks(self, file_path: Path) -> ChunksResult:
5353
documents = self.inner.load_data(file_path)
5454
assert len(documents) == 1
55-
document = documents[0]
55+
document = documents[0] # single document contains all pages' contents
5656
document.id_ = self.document_id
5757

5858
document_text = document.text

llm-service/app/ai/indexing/readers/pptx.py

Lines changed: 16 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,6 @@
3939
from pathlib import Path
4040
from typing import Any
4141

42-
from llama_index.core import Document
4342
from llama_index.readers.file import PptxReader as LlamaIndexPptxReader
4443

4544
from .base_reader import BaseReader, ChunksResult
@@ -51,27 +50,25 @@ def __init__(self, *args: Any, **kwargs: Any) -> None:
5150
self.inner = LlamaIndexPptxReader()
5251

5352
def load_chunks(self, file_path: Path) -> ChunksResult:
53+
# TODO: This loop makes a lot of function calls;
54+
# if it's slow, we should try .pdf.PageTracker which consolidates contents to avoid that
55+
ret = ChunksResult()
56+
for i, document in enumerate(self.inner.load_data(file_path)):
57+
document.id_ = self.document_id
5458

55-
documents = self.inner.load_data(file_path)
56-
assert len(documents) == 1
57-
document: Document = documents[0]
58-
document.id_ = self.document_id
59-
60-
document_text = document.text
61-
62-
secrets = self._block_secrets([document_text])
63-
if secrets is not None:
64-
return ChunksResult(secret_types=secrets)
59+
document_text = document.text
6560

66-
ret = ChunksResult()
61+
secrets = self._block_secrets([document_text])
62+
if secrets is not None:
63+
return ChunksResult(secret_types=secrets)
6764

68-
anonymized_text = self._anonymize_pii(document_text)
69-
if anonymized_text is not None:
70-
ret.pii_found = True
71-
document_text = anonymized_text
65+
anonymized_text = self._anonymize_pii(document_text)
66+
if anonymized_text is not None:
67+
ret.pii_found = True
68+
document_text = anonymized_text
7269

73-
document.set_content(document_text)
70+
document.set_content(document_text)
7471

75-
self._add_document_metadata(document, file_path)
76-
ret.chunks = self._chunks_in_document(document)
72+
self._add_document_metadata(document, file_path)
73+
ret.chunks.extend(self._chunks_in_document(document))
7774
return ret

llm-service/app/ai/indexing/summary_indexer.py

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -70,13 +70,13 @@
7070
from qdrant_client.http.exceptions import UnexpectedResponse
7171

7272
from app.services import models
73+
from app.ai.vector_stores.vector_store import VectorStore
7374
from .base import BaseTextIndexer
7475
from .readers.base_reader import ReaderConfig, ChunksResult
7576
from ..vector_stores.vector_store_factory import VectorStoreFactory
76-
from ...config import settings
77+
from ...config import settings, ModelSource
7778
from ...services.metadata_apis import data_sources_metadata_api
78-
from ...services.models.providers import ModelProvider
79-
from ...services.models import ModelSource
79+
from ...services.models.providers import get_provider_class
8080

8181
logger = logging.getLogger(__name__)
8282

@@ -102,6 +102,7 @@ def __init__(
102102
self.splitter = splitter
103103
self.llm = llm
104104
self.embedding_model = embedding_model
105+
self.summary_vector_store = VectorStoreFactory.for_summaries(data_source_id)
105106

106107
@staticmethod
107108
def __database_dir(data_source_id: int) -> str:
@@ -133,9 +134,7 @@ def __index_configuration(
133134
embed_summaries: bool = True,
134135
) -> Dict[str, Any]:
135136
prompt_helper: Optional[PromptHelper] = None
136-
model_source: ModelSource = (
137-
ModelProvider.get_provider_class().get_model_source()
138-
)
137+
model_source: ModelSource = get_provider_class().get_model_source()
139138
if model_source == "CAII":
140139
# if we're using CAII, let's be conservative and use a small context window to account for mistral's small context
141140
prompt_helper = PromptHelper(context_window=3000)
@@ -180,19 +179,20 @@ def __summary_indexer(
180179
return SummaryIndexer.__summary_indexer_with_config(
181180
persist_dir=persist_dir,
182181
index_configuration=self.__index_kwargs(embed_summaries),
182+
summary_vector_store=self.summary_vector_store,
183183
)
184184
except (ValueError, FileNotFoundError):
185185
doc_summary_index = self.__init_summary_store(persist_dir)
186186
return doc_summary_index
187187

188188
@staticmethod
189189
def __summary_indexer_with_config(
190-
persist_dir: str, index_configuration: Dict[str, Any]
190+
persist_dir: str, index_configuration: Dict[str, Any],
191+
summary_vector_store: VectorStore,
191192
) -> DocumentSummaryIndex:
192-
data_source_id: int = index_configuration.get("data_source_id")
193193
storage_context = SummaryIndexer.create_storage_context(
194194
persist_dir,
195-
VectorStoreFactory.for_summaries(data_source_id).llama_vector_store(),
195+
summary_vector_store.llama_vector_store(),
196196
)
197197
doc_summary_index: DocumentSummaryIndex = cast(
198198
DocumentSummaryIndex,
@@ -296,6 +296,8 @@ def index_file(self, file_path: Path, document_id: str) -> None:
296296
with _write_lock:
297297
persist_dir = self.__persist_dir()
298298
summary_store: DocumentSummaryIndex = self.__summary_indexer(persist_dir)
299+
if self.summary_vector_store.flat_metadata:
300+
nodes = [self._flatten_metadata(node) for node in nodes]
299301
summary_store.insert_nodes(nodes)
300302
summary_store.storage_context.persist(persist_dir=persist_dir)
301303

@@ -314,7 +316,7 @@ def __update_global_summary_store(
314316
# and re-index it with the addition/removal.
315317
global_persist_dir = self.__persist_root_dir()
316318
global_summary_store = self.__summary_indexer(
317-
global_persist_dir, embed_summaries=False
319+
global_persist_dir, embed_summaries=False,
318320
)
319321
data_source_node = Document(doc_id=str(self.data_source_id))
320322

@@ -496,7 +498,8 @@ def delete_data_source_by_id(data_source_id: int) -> None:
496498
embed_summaries=False,
497499
)
498500
global_summary_store = SummaryIndexer.__summary_indexer_with_config(
499-
global_persist_dir, configuration
501+
global_persist_dir, configuration,
502+
summary_vector_store=vector_store,
500503
)
501504
except FileNotFoundError:
502505
## global summary store doesn't exist, nothing to do

0 commit comments

Comments
 (0)