-
We utilize structured data queries through a Pandas DataFrame query pipeline and have observed promising results. However, our data tables are huge, containing millions of rows. What are the best practices for handling these production pipelines? Should we continue using the same method, or would it be more effective to convert our structured data into unstructured embeddings? Any suggestions would be greatly appreciated. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 4 replies
-
For handling large data tables via the Pandas query pipeline, here are some best practices:
Regarding whether to continue using the Pandas query pipeline or convert structured data into unstructured embeddings, it depends on your specific use case:
LlamaIndex provides built-in functionalities for converting structured data into unstructured embeddings, such as the For handling large data tables, LlamaIndex also recommends specific optimizations like adjusting chunk sizes, using hybrid search, and applying metadata filters [3]. By following these practices, you can effectively manage and query large datasets using the Pandas query pipeline or consider transitioning to unstructured embeddings based on your needs. |
Beta Was this translation helpful? Give feedback.
-
@dosu thank you. But my question is 'what are the best practices for llamaindex to read millions of records in one shot and give better query results?' |
Beta Was this translation helpful? Give feedback.
-
@dosu. Thank you again. This is my 'pandas query pipe line' code. How I incorporate your snippet into my code? Sorry, I am new and learning fast :-) My code:
|
Beta Was this translation helpful? Give feedback.
To incorporate the LlamaIndex snippet into your existing Pandas query pipeline code, you can follow these steps. I'll integrate the necessary parts from the provided snippet into your code:
Here's how you can modify your code: