[Question]: How can I utilize JSONQueryEngine with MongoDB documents ? #8410

Exorcismus · 2023-09-08T13:59:16Z

Exorcismus
Sep 8, 2023

Question Validation

I have searched both the documentation and discord for an answer.

Question

Am using the below code snippet to read data from a mongoDB

as my data structure is different from expected, I had to combine multiple values into text

documents = reader.load_data( db_name, collection_name, field_names, query_dict) index = GPTVectorStoreIndex.from_documents(documents) index.storage_context.persist() query_engine = index.as_query_engine( similarity_top_k=5, service_context=service_context, response_synthesizer=response_synthesizer, verbose=True)

I can see there is a JSON Query Engine but I can't see an easy way to combine both.

is there a best practice on how to JSONQueryEngine on Mongo documents directly ?

@dosu-bot · 2023-09-08T14:11:30Z

dosubot[bot]
bot Sep 8, 2023

🤖

Hello,

Thanks for your question. If you want to use JSONQueryEngine directly on MongoDB documents, you can use the query method of the MongoDBAtlasVectorSearch class in LlamaIndex. This method takes a VectorStoreQuery object as an argument and returns a VectorStoreQueryResult object.

Here is a relevant code snippet:

def query(self, query: VectorStoreQuery, **kwargs: Any) -> VectorStoreQueryResult:
    knn_beta: Dict[str, Any] = {
        "vector": query.query_embedding,
        "path": self._embedding_key,
        "k": query.similarity_top_k,
    }
    if query.filters:
        knn_beta["filter"] = _to_mongodb_filter(query.filters)

    pipeline = [
        {
            "$search": {
                "index": self._index_name,
                "knnBeta": knn_beta,
            }
        },
        {"$project": {"score": {"$meta": "searchScore"}, self._embedding_key: 0}},
    ]
    cursor = self._collection.aggregate(pipeline)
    ...
    return result

This method constructs a MongoDB aggregation pipeline that uses the $search operator with the knnBeta option to perform a k-nearest neighbors search on the MongoDB collection. The knnBeta option takes a dictionary that specifies the query vector, the path to the embedding field in the MongoDB documents, and the number of similar nodes to return. If there are any filters specified in the VectorStoreQuery object, they are converted to a MongoDB filter dictionary and included in the knnBeta dictionary. The aggregation pipeline also includes a $project stage that excludes the embedding field from the returned documents and includes the search score. The method then executes the aggregation pipeline on the MongoDB collection and processes the results to return a VectorStoreQueryResult object.

You can find more details in the source code of the MongoDBAtlasVectorSearch class in the llama_index/vector_stores/mongodb.py file.

Please let me know if you have any other questions.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Exorcismus · 2023-09-08T15:45:54Z

Exorcismus
Sep 8, 2023
Author

@dosu-bot this doesn't seem to address the case, I don't have mongo vector database, I have JSON mongodb documents, stored in mongodb, and I need to perform queries on them using JSONQueryEngine

0 replies

@dosu-bot · 2023-09-08T16:14:50Z

dosubot[bot]
bot Sep 8, 2023

🤖

Hello,

To use the JSONQueryEngine directly on MongoDB documents in the LlamaIndex framework, you would need to convert your MongoDB documents into JSON format and then pass them to the JSONQueryEngine. Here's how you can do it:

from llama_index.indices.struct_store.json_query import JSONQueryEngine, JSONType
from llama_index.indices.query.schema import QueryBundle
from llama_index.indices.service_context import ServiceContext

# Assuming you have a MongoDB document as a dictionary
mongo_doc = {
    "field1": "value1",
    "field2": "value2",
    # ...
}

# Convert MongoDB document to JSON
json_val = cast(JSONType, mongo_doc)

# Define your JSON schema
json_schema = cast(JSONType, {})

# Create a service context
service_context = ServiceContext()

# Create a JSONQueryEngine instance
query_engine = JSONQueryEngine(
    json_value=json_val,
    json_schema=json_schema,
    service_context=service_context,
    verbose=True,
    synthesize_response=True,
)

# Use the query engine
response = cast(Response, query_engine.query(QueryBundle("your_query_here")))

# The response object now contains the result of the query
print(response.response)

Please note that you need to define your own output_processor function to process the output as needed. The JSON schema (json_schema) should also be defined according to your specific needs. The example above assumes a simple case where the schema is an empty dictionary. In the context shared, the JSONQueryEngine class has a default_output_processor function that executes the JSON Path query, which can be used if no custom output_processor is provided.

The JSONQueryEngine class in the LlamaIndex framework is used to handle queries on JSON data. It takes in a JSON value, a JSON schema, a service context, an output processor, and some other parameters to initialize. The JSONQueryEngine uses the provided output_processor to process the output of the LLM predictor. The output_processor is a function that takes the LLM output and the JSON value as input and returns a JSONType object.

The JSONQueryEngine also has a synthesize_response parameter. If synthesize_response is set to True, the JSONQueryEngine will use the LLM predictor to synthesize a response using the JSON results. If synthesize_response is set to False, the JSONQueryEngine will return the JSON Path output as the response.

For more details, you can refer to the source code of the JSONQueryEngine class in the LlamaIndex framework here and the related tests here.

I hope this helps! Let me know if you have any other questions.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Exorcismus · 2023-09-08T16:21:36Z

Exorcismus
Sep 8, 2023
Author

@dosu-beta the example you provided is assuming I have a single json document, in my case it's couple of thousands documents, in your case you don'treally connect to the database and read the data, show me how to connect to the mongodb, query on a collection using JSONQueryEngine

0 replies

@dosu-bot · 2023-09-08T16:29:56Z

dosubot[bot]
bot Sep 8, 2023

🤖

Sure, I can provide an example of how to connect to a MongoDB, read data from a specific collection, and use the JSONQueryEngine in the LlamaIndex framework to perform queries on these documents. Here's a Python script that does just that:

# Import necessary libraries
import logging
import sys
import json
from llama_index import SummaryIndex, SimpleMongoReader, ServiceContext
from llama_index.indices.struct_store.json_query import JSONQueryEngine
from IPython.display import Markdown, display
import os

# Set up logging
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# Define MongoDB connection parameters
host = "<host>"
port = "<port>"
db_name = "<db_name>"
collection_name = "<collection_name>"

# Define query parameters
query_dict = {}
field_names = ["text"]

# Create a MongoDB reader and load data
reader = SimpleMongoReader(host, port)
documents = reader.load_data(
    db_name, collection_name, field_names, query_dict=query_dict
)

# Create a SummaryIndex from the documents
index = SummaryIndex.from_documents(documents)

# Create a ServiceContext
service_context = ServiceContext()

# Create a JSONQueryEngine and perform a query
json_value = json.loads(documents)
json_schema = {}  # Define your JSON schema here
query_engine = JSONQueryEngine(json_value, json_schema, service_context)
response = query_engine.query("<query_text>")

# Display the response
display(Markdown(f"<b>{response}</b>"))

Please replace <host>, <port>, <db_name>, <collection_name>, and <query_text> with your actual MongoDB host, port, database name, collection name, and the query you want to perform, respectively. Also, you need to define your JSON schema in the json_schema variable.

The SimpleMongoReader class in the LlamaIndex uses the following parameters and methods for connecting to a MongoDB database and reading data from a collection:

Parameters:

host: The host address of the MongoDB server. It is an optional string parameter.
port: The port number on which the MongoDB server is running. It is an optional integer parameter.
uri: The MongoDB connection string URI. It is an optional string parameter.
max_docs: The maximum number of documents to load from the MongoDB collection. It is an integer parameter with a default value of 1000.

Methods:

__init__: This method is used to initialize the SimpleMongoReader object. It imports the pymongo package and creates a MongoClient instance using either the provided host and port or the uri. It also sets the max_docs attribute.
load_data: This method is used to load data from a MongoDB collection. It takes the following parameters:
- db_name: The name of the MongoDB database.
- collection_name: The name of the MongoDB collection.
- field_names: A list of field names to be concatenated from each MongoDB document. The default value is ["text"].
- query_dict: An optional dictionary to filter the documents in the MongoDB collection. The default value is None.

The load_data method connects to the specified MongoDB database and collection, performs a find operation with the provided query (if any), concatenates the specified fields from each document, and returns a list of Document objects. The content of each Document object can be accessed using the get_content method.

For more details, you can refer to the MongoDemo notebook and the json_query.py file in the LlamaIndex repository.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Exorcismus · 2023-09-09T16:04:41Z

Exorcismus
Sep 9, 2023
Author

@dosu-bot am getting now an error with

json_value = json.loads(documents) File "/usr/local/lib/python3.10/json/__init__.py", line 339, in loads raise TypeError(f'the JSON object must be str, bytes or bytearray, ' TypeError: the JSON object must be str, bytes or bytearray, not list

0 replies

@dosu-bot · 2023-09-09T16:11:13Z

dosubot[bot]
bot Sep 9, 2023

🤖

Hello,

The error you're encountering is due to the fact that the json.loads() function is expecting a string, bytes, or bytearray object, but it's receiving a list. This is because the load_data function of the SimpleMongoReader class in LlamaIndex returns a list of Document objects, not a JSON string.

To resolve this issue, you don't need to convert the documents into a JSON object. Instead, you can directly use the list of Document objects returned by the load_data function. Here's how you can modify your code:

# Create a MongoDB reader and load data
reader = SimpleMongoReader(host, port)
documents = reader.load_data(
    db_name, collection_name, field_names, query_dict=query_dict
)

# Now, 'documents' is a list of Document objects. You can iterate over this list and access the data of each document.
for document in documents:
    print(document.text)  # or any other operation you want to perform on the document

In this code, documents is a list of Document objects. You can iterate over this list and access the data of each document using the .text attribute.

I hope this helps! Let me know if you have any other questions.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Exorcismus · 2023-09-09T16:48:26Z

Exorcismus
Sep 9, 2023
Author

@dosu-beta is that a best practice to loopover all my documents ?my goald is to use LLM on my mongodb json documents

0 replies

@dosu-bot · 2023-09-09T16:53:57Z

dosubot[bot]
bot Sep 9, 2023

🤖

Hello,

The error you're encountering is due to the fact that json.loads() expects a string, bytes, or bytearray object, but it's receiving a list. This is because the load_data method of the SimpleMongoReader class returns a list of Document objects, not a JSON string.

To use the Language Model (LLM) on MongoDB JSON documents in the LlamaIndex framework, you should first convert the BSON documents into JSON. This can be done using the bson.json_util module in the pymongo library, which provides tools for working with MongoDB BSON documents.

Here is an example of how you can modify the load_data method in the SimpleMongoReader class to handle MongoDB BSON documents:

from bson import json_util

def load_data(
    self,
    db_name: str,
    collection_name: str,
    field_names: List[str] = ["text"],
    query_dict: Optional[Dict] = None,
) -> List[Document]:
    documents = []
    db = self.client[db_name]
    if query_dict is None:
        cursor = db[collection_name].find()
    else:
        cursor = db[collection_name].find(query_dict)

    for item in cursor:
        text = ""
        for field_name in field_names:
            if field_name not in item:
                raise ValueError(
                    f"`{field_name}` field not found in Mongo document."
                )
            text += json_util.dumps(item[field_name])  # Use bson.json_util.dumps instead of just appending the field

        documents.append(Document(text))

    return documents

This modification uses bson.json_util.dumps to convert the BSON field into a JSON string, which can then be processed by the rest of the method as before.

As for your concern about efficiency, this approach should be reasonably efficient even for large numbers of documents. The bson.json_util.dumps function is quite fast, and the time complexity of the load_data method is linear with respect to the number of documents.

I hope this helps! Let me know if you have any other questions.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

lalitbyteiq · 2024-01-24T09:47:16Z

lalitbyteiq
Jan 24, 2024

@Exorcismus dis you got any solution for this ??

0 replies

[Question]: How can I utilize JSONQueryEngine with MongoDB documents ? #8410

Uh oh!

Exorcismus Sep 8, 2023

Question Validation

Question

Replies: 10 comments

Uh oh!

dosubot[bot] bot Sep 8, 2023

Sources

Uh oh!

Uh oh!

Exorcismus Sep 8, 2023 Author

Uh oh!

dosubot[bot] bot Sep 8, 2023

Sources

Uh oh!

Exorcismus Sep 8, 2023 Author

Uh oh!

dosubot[bot] bot Sep 8, 2023

Sources

Uh oh!

Exorcismus Sep 9, 2023 Author

Uh oh!

dosubot[bot] bot Sep 9, 2023

Sources

Uh oh!

Exorcismus Sep 9, 2023 Author

Uh oh!

dosubot[bot] bot Sep 9, 2023

Sources

Uh oh!

lalitbyteiq Jan 24, 2024

Exorcismus
Sep 8, 2023

dosubot[bot]
bot Sep 8, 2023

Exorcismus
Sep 8, 2023
Author

dosubot[bot]
bot Sep 8, 2023

Exorcismus
Sep 8, 2023
Author

dosubot[bot]
bot Sep 8, 2023

Exorcismus
Sep 9, 2023
Author

dosubot[bot]
bot Sep 9, 2023

Exorcismus
Sep 9, 2023
Author

dosubot[bot]
bot Sep 9, 2023

lalitbyteiq
Jan 24, 2024