Skip to content

Commit 7316747

Browse files
prakriti-solankeypraveshkumar1988kartikpersistentvasanthasaikalluriaashipandya
authored
Staging (#361)
* Remove unused library and commented code * Issue fixed * 224 color mismatch in graph viz model (#225) * count changes * added legend count * bloom url changes * lint changes * removal of console --------- Co-authored-by: kartikpersistent <[email protected]> * Modified retrieval query (#226) * Manage file status (#227) * manage status of processing file * Remove progress bar from Generate Graph Document button * 224 color mismatch in graph viz model (#225) * count changes * added legend count * bloom url changes * lint changes * removal of console --------- Co-authored-by: kartikpersistent <[email protected]> * Modified retrieval query (#226) * Convert KNN score value string to Float --------- Co-authored-by: Prakriti Solankey <[email protected]> Co-authored-by: kartikpersistent <[email protected]> Co-authored-by: vasanthasaikalluri <[email protected]> * Chatbot optimization (#230) * Optimised and cleaned Chatbot Integration * modified chat integration functions * bug changes (#231) * batch queries and relationship count correction (#232) * batch queries and relationship count correction * status should not be processing * 'url_changes' (#235) * Color mismatch in graph viz model (#233) * count changes * added legend count * bloom url changes * lint changes * removal of console * 'colour' * 'color' --------- Co-authored-by: kartikpersistent <[email protected]> * lint fixes * Create schema endpoint to get labels and relationtypes * source link fixes * Handle exception when youtube Api unable to fetch transcript youtube_transcript_api._errors.TranscriptsDisabled * configured backend status based the ENV Variable (#246) * configured backend status based the ENV Variable * removed the connection status check in PROD enviournment * Requirement split gcs and s3 icons on the page (#247) * separated S3 and GCS * resolved the conflicts * Update error message in response * dev env * Chatbot optimization (#250) * Optimised and cleaned Chatbot Integration * modified chat integration functions * Modified max_tokens and min_score * Modified prompt and added error message * Modified Prompt and error message * 245 bug chatbot UI (#252) * fixed chatbot aspect ratio/width issue * fixed chat bot ui issue * 'hoverchanges' (#254) * added settings panel for relationship type and node label selection (#234) * added settings panel for relationship type and node label selection * added checkbox for fetching existing scehma * integrated /schema api * added dependency in the useCallback * usercredentials payload fix * Accept param in Extract API to filter graph to allowedNode and allowedRealationship * CHange param type in extract * Issue fixed * integrated extract api * updated string as list for allowednodes and allowedrelations * removed button on settings * format fixes * Added baseEntityLabel as True --------- Co-authored-by: Pravesh Kumar <[email protected]> Co-authored-by: aashipandya <[email protected]> * Handle File status for long time (#256) * format fixes * fixed failed status bug * Fixed list.split issue in allowed nodes * Issue fixed * Updated check of empty allowed nodes and allowed relations list (#258) * added settings panel for relationship type and node label selection * added checkbox for fetching existing scehma * integrated /schema api * added dependency in the useCallback * usercredentials payload fix * Accept param in Extract API to filter graph to allowedNode and allowedRealationship * CHange param type in extract * Issue fixed * integrated extract api * updated string as list for allowednodes and allowedrelations * check for empty list of nodes and relations --------- Co-authored-by: kartikpersistent <[email protected]> Co-authored-by: Pravesh Kumar <[email protected]> * Removed wrong commit * Updated condition for allowed nodes relations (#265) * added settings panel for relationship type and node label selection * added checkbox for fetching existing scehma * integrated /schema api * added dependency in the useCallback * usercredentials payload fix * Accept param in Extract API to filter graph to allowedNode and allowedRealationship * CHange param type in extract * Issue fixed * integrated extract api * updated string as list for allowednodes and allowedrelations * check for empty list of nodes and relations * condition updated * removed frontend changes --------- Co-authored-by: kartikpersistent <[email protected]> Co-authored-by: Pravesh Kumar <[email protected]> * changed the checkbox to button (#266) * Adding link to Aura on the connection modal (#263) * Remove node id title changes (#264) * Remove the id and type changes from the nodes as that makes them incompatible with the relationships * common function for saving nodes and relations to graph --------- Co-authored-by: aashipandya <[email protected]> * fixed the legend container height issue (#267) * added supported files description (#268) * fixed legends gap issue * format fixes * parameter should be none not str (#269) * Chatbot latency optimization (#270) * Added graph Object and Modified Retrieval query * Added Database parameter to API * Modified Database parameter * added connect in place of submit ,added connect to neo4j aura in place of connect to neo4j (#271) * added connect in place of submit added connect to neo4j aura inplace of connect to neo4j * added open graph with bloom * removed the Aura as it can connect with any neo4j db * label colour fix (#273) * removed default Person and Works AT for allowed nodes and relationship types * changed the Wikipedia input label * removed unused constants * wikipedia whitespaces fix * wikipedia url and youtube white spaces error (#280) * urgent fix (#281) * Info in the chat response (#282) * Added graph Object and Modified Retrieval query * Added Database parameter to API * Modified Database parameter * Added info parameter to output * reestablished the sse on page refresh to sync the processing status (#285) * UI bugs/features (#284) * disabled the use existing schema on no node labels * added docs Icon * decreased the alert window in the success scenario * added trim for inputs for white space handling in the youtube wikipedia gcs * Time estimation alert for large files (#287) * reestablished the sse on page refresh to sync the processing status * added the time estimation message for large files * showing alert only once * delete api for removing documents (#290) * Show connection uri (#291) * added Connection URI * UI updated * removed duplicate useEffect * Backend queries (#257) * created backend queries for graph * Modified username parameter * Added GET request * Modified exceptions * 'frontendHandling' * removed session id parameter * doc_limit * 'type_changes' * 'nameChanges' * graph viz ui * legend renamed * renamed * removed import * removed duplicate useEffect --------- Co-authored-by: kartikpersistent <[email protected]> Co-authored-by: Prakriti Solankey <[email protected]> Co-authored-by: Prakriti Solankey <[email protected]> * Create Local-to-global-genAI_GraphRAG_V1 (#292) created summary of some papers and designed the complete flow _v1 * Delete list of documents from db (#293) * delete api for removing documents * Added list of documents for deletion * Update exception to track Json_payload * Delete api (#296) * delete api for removing documents * Added list of documents for deletion * added delete functionality --------- Co-authored-by: aashipandya <[email protected]> * Delete api (#298) * delete api for removing documents * Added list of documents for deletion * added delete functionality * changed the message and disabled the delete files if there is no selected files * format fixes --------- Co-authored-by: aashipandya <[email protected]> * removed duplicate variables * css change * upgraded the nvl package * removed duplicate delete button * closing the event source on failed condition * nvl issue 261 - private package (#299) * Fix issue #261 #261 * Fix issue #261 #261 --------- Co-authored-by: kartikpersistent <[email protected]> * Delete with entities switch (#300) * added delete entities switch * added the hover message on checkboxes * changed query for deletion of files * changed the font size the confimation message --------- Co-authored-by: aashipandya <[email protected]> * docker changes * disabled the checkbox when File status is uploading or processing * Added Cloud logging library for strucred logs * replaced switch with checkbox * removed unused imports * spell mistake * removed the cancel button on delete popup modal * bug_fix_labels_mismatch_count * deletion scenarios * fixed / trailing bug in s3 bucket url * Create Local_to_global poc v1.1 (#303) V1.1 the extension of local to global V1,where the each element are analyzed and described in deteail.At the end conclusion is made based on the analysis in the paper.Other features and optimisation to improve the robustness of the sytem is under investigation. * Switch frontend port in docker-compose to 8080 to match with the frontend Dockerfile (#305) * Add in Each api google log struct * Implemented polling for status update (#309) * Implemented polling for status update * status updation for large files * added example env in the frontend * updated the readme with frontend env info * readme changes * readme updates * setting up failed status * Chatbot info icon (#297) * Added Info to the chat response * UI changes * Modified chat response * added entities to response info * modified entities in response info * Modified entities response count in info * clearhistory * chatbot * typeCheck * state management * chatbot-ui-overflow * css_changes --------- Co-authored-by: vasanthasaikalluri <[email protected]> * ellipsis * dockerfile * Failed status update fix (#315) * removed Failed status update on failure of servers side event * Update .gitignore * url spell fix * Msenechal/issue295 (#314) * Removed triton package from requirements.txt * Fixed Google Cloud logging + some docker ENV overwritten * Removed ENV print logs * delete local file in case processing failed (#316) * table-css * added placement for tooltip * DEV to STAGING (#324) * Remove unused library and commented code * Issue fixed * 224 color mismatch in graph viz model (#225) * count changes * added legend count * bloom url changes * lint changes * removal of console --------- Co-authored-by: kartikpersistent <[email protected]> * Modified retrieval query (#226) * Manage file status (#227) * manage status of processing file * Remove progress bar from Generate Graph Document button * 224 color mismatch in graph viz model (#225) * count changes * added legend count * bloom url changes * lint changes * removal of console --------- Co-authored-by: kartikpersistent <[email protected]> * Modified retrieval query (#226) * Convert KNN score value string to Float --------- Co-authored-by: Prakriti Solankey <[email protected]> Co-authored-by: kartikpersistent <[email protected]> Co-authored-by: vasanthasaikalluri <[email protected]> * Chatbot optimization (#230) * Optimised and cleaned Chatbot Integration * modified chat integration functions * bug changes (#231) * batch queries and relationship count correction (#232) * batch queries and relationship count correction * status should not be processing * 'url_changes' (#235) * Color mismatch in graph viz model (#233) * count changes * added legend count * bloom url changes * lint changes * removal of console * 'colour' * 'color' --------- Co-authored-by: kartikpersistent <[email protected]> * lint fixes * Create schema endpoint to get labels and relationtypes * source link fixes * Handle exception when youtube Api unable to fetch transcript youtube_transcript_api._errors.TranscriptsDisabled * configured backend status based the ENV Variable (#246) * configured backend status based the ENV Variable * removed the connection status check in PROD enviournment * Requirement split gcs and s3 icons on the page (#247) * separated S3 and GCS * resolved the conflicts * Update error message in response * dev env * Chatbot optimization (#250) * Optimised and cleaned Chatbot Integration * modified chat integration functions * Modified max_tokens and min_score * Modified prompt and added error message * Modified Prompt and error message * 245 bug chatbot UI (#252) * fixed chatbot aspect ratio/width issue * fixed chat bot ui issue * 'hoverchanges' (#254) * added settings panel for relationship type and node label selection (#234) * added settings panel for relationship type and node label selection * added checkbox for fetching existing scehma * integrated /schema api * added dependency in the useCallback * usercredentials payload fix * Accept param in Extract API to filter graph to allowedNode and allowedRealationship * CHange param type in extract * Issue fixed * integrated extract api * updated string as list for allowednodes and allowedrelations * removed button on settings * format fixes * Added baseEntityLabel as True --------- Co-authored-by: Pravesh Kumar <[email protected]> Co-authored-by: aashipandya <[email protected]> * Handle File status for long time (#256) * format fixes * fixed failed status bug * Fixed list.split issue in allowed nodes * Issue fixed * Updated check of empty allowed nodes and allowed relations list (#258) * added settings panel for relationship type and node label selection * added checkbox for fetching existing scehma * integrated /schema api * added dependency in the useCallback * usercredentials payload fix * Accept param in Extract API to filter graph to allowedNode and allowedRealationship * CHange param type in extract * Issue fixed * integrated extract api * updated string as list for allowednodes and allowedrelations * check for empty list of nodes and relations --------- Co-authored-by: kartikpersistent <[email protected]> Co-authored-by: Pravesh Kumar <[email protected]> * Removed wrong commit * Updated condition for allowed nodes relations (#265) * added settings panel for relationship type and node label selection * added checkbox for fetching existing scehma * integrated /schema api * added dependency in the useCallback * usercredentials payload fix * Accept param in Extract API to filter graph to allowedNode and allowedRealationship * CHange param type in extract * Issue fixed * integrated extract api * updated string as list for allowednodes and allowedrelations * check for empty list of nodes and relations * condition updated * removed frontend changes --------- Co-authored-by: kartikpersistent <[email protected]> Co-authored-by: Pravesh Kumar <[email protected]> * changed the checkbox to button (#266) * Adding link to Aura on the connection modal (#263) * Remove node id title changes (#264) * Remove the id and type changes from the nodes as that makes them incompatible with the relationships * common function for saving nodes and relations to graph --------- Co-authored-by: aashipandya <[email protected]> * fixed the legend container height issue (#267) * added supported files description (#268) * fixed legends gap issue * format fixes * parameter should be none not str (#269) * Chatbot latency optimization (#270) * Added graph Object and Modified Retrieval query * Added Database parameter to API * Modified Database parameter * added connect in place of submit ,added connect to neo4j aura in place of connect to neo4j (#271) * added connect in place of submit added connect to neo4j aura inplace of connect to neo4j * added open graph with bloom * removed the Aura as it can connect with any neo4j db * label colour fix (#273) * removed default Person and Works AT for allowed nodes and relationship types * changed the Wikipedia input label * removed unused constants * wikipedia whitespaces fix * wikipedia url and youtube white spaces error (#280) * urgent fix (#281) * Info in the chat response (#282) * Added graph Object and Modified Retrieval query * Added Database parameter to API * Modified Database parameter * Added info parameter to output * reestablished the sse on page refresh to sync the processing status (#285) * UI bugs/features (#284) * disabled the use existing schema on no node labels * added docs Icon * decreased the alert window in the success scenario * added trim for inputs for white space handling in the youtube wikipedia gcs * Time estimation alert for large files (#287) * reestablished the sse on page refresh to sync the processing status * added the time estimation message for large files * showing alert only once * delete api for removing documents (#290) * Show connection uri (#291) * added Connection URI * UI updated * removed duplicate useEffect * Backend queries (#257) * created backend queries for graph * Modified username parameter * Added GET request * Modified exceptions * 'frontendHandling' * removed session id parameter * doc_limit * 'type_changes' * 'nameChanges' * graph viz ui * legend renamed * renamed * removed import * removed duplicate useEffect --------- Co-authored-by: kartikpersistent <[email protected]> Co-authored-by: Prakriti Solankey <[email protected]> Co-authored-by: Prakriti Solankey <[email protected]> * Delete list of documents from db (#293) * delete api for removing documents * Added list of documents for deletion * Update exception to track Json_payload * Delete api (#296) * delete api for removing documents * Added list of documents for deletion * added delete functionality --------- Co-authored-by: aashipandya <[email protected]> * Delete api (#298) * delete api for removing documents * Added list of documents for deletion * added delete functionality * changed the message and disabled the delete files if there is no selected files * format fixes --------- Co-authored-by: aashipandya <[email protected]> * removed duplicate variables * css change * upgraded the nvl package * removed duplicate delete button * closing the event source on failed condition * nvl issue 261 - private package (#299) * Fix issue #261 #261 * Fix issue #261 #261 --------- Co-authored-by: kartikpersistent <[email protected]> * Delete with entities switch (#300) * added delete entities switch * added the hover message on checkboxes * changed query for deletion of files * changed the font size the confimation message --------- Co-authored-by: aashipandya <[email protected]> * docker changes * disabled the checkbox when File status is uploading or processing * Added Cloud logging library for strucred logs * replaced switch with checkbox * removed unused imports * spell mistake * removed the cancel button on delete popup modal * bug_fix_labels_mismatch_count * deletion scenarios * fixed / trailing bug in s3 bucket url * Switch frontend port in docker-compose to 8080 to match with the frontend Dockerfile (#305) * Add in Each api google log struct * Implemented polling for status update (#309) * Implemented polling for status update * status updation for large files * added example env in the frontend * updated the readme with frontend env info * readme changes * readme updates * setting up failed status * Chatbot info icon (#297) * Added Info to the chat response * UI changes * Modified chat response * added entities to response info * modified entities in response info * Modified entities response count in info * clearhistory * chatbot * typeCheck * state management * chatbot-ui-overflow * css_changes --------- Co-authored-by: vasanthasaikalluri <[email protected]> * ellipsis * dockerfile * Failed status update fix (#315) * removed Failed status update on failure of servers side event * Update .gitignore * url spell fix * Msenechal/issue295 (#314) * Removed triton package from requirements.txt * Fixed Google Cloud logging + some docker ENV overwritten * Removed ENV print logs * delete local file in case processing failed (#316) * table-css * added placement for tooltip --------- Co-authored-by: Pravesh Kumar <[email protected]> Co-authored-by: kartikpersistent <[email protected]> Co-authored-by: vasanthasaikalluri <[email protected]> Co-authored-by: aashipandya <[email protected]> Co-authored-by: Morgan Senechal <[email protected]> Co-authored-by: Michael Hunger <[email protected]> * s3 url fix * s3 url fix * Added gpt 4o model and fix for gcs bucket (#329) * Added gpt 4o model and fix for gcs bucket * OpenAI GPT 4o model label changes --------- Co-authored-by: kartikpersistent <[email protected]> * added drop zone icon * Dev (#330) * s3 url fix * Added gpt 4o model and fix for gcs bucket (#329) * Added gpt 4o model and fix for gcs bucket * OpenAI GPT 4o model label changes --------- Co-authored-by: kartikpersistent <[email protected]> * added drop zone icon --------- Co-authored-by: aashipandya <[email protected]> * removed cloud icon from button * removed cloud icon from button * exponential backoff implementation * Create RAPTOR_RECURSIVE ABSTRACTIVE PROCESSING v1 Tree based DB approach * Update RAPTOR_RECURSIVE ABSTRACTIVE PROCESSING v1 (#334) Tree based DB * Drag the legends panel similar to workspace (#335) * resize-legends * css change * lint * Create Data_Analysis (#339) This is created for neaw experint on data processing and analysis * added driver config (#341) * added driver config * remove score.py * Debug config in graph (#342) * added driver config * remove score.py * added user agent in env * Modified chatbot for increased performance and chat history issues (#340) * 321 documents selection for processing and graph visualization (#343) * added multi select extraction * added support for multiple documents * integrated api * format fixes * conditional rendering of limit input * Graph Query : Added Doc Limit parameter * showing the selected files length * handled generating graph using multiselect * added the count to respective buttons * removed doclimit * removed the doc limit * fixed inspectedname issue * format fixes * added toottip message on show graph message --------- Co-authored-by: vasanthasaikalluri <[email protected]> * Add files via upload * Page number of pdf (#347) * added page number to chunk node * page_number for only local file upload * connect fix (#349) * Remove neo4j.debug watch and added , refresh_schema=False, sanitize=True * Update graph chunk processed (#358) * update graph after fixed number of chunk processed * update node_count based on no of chunks processed * Update graph after spefic number of chunks * removed the large file check * added missing dependency --------- Co-authored-by: kartikpersistent <[email protected]> * Support for url parameters (#357) * added the helper method * integrated the URL search params without password * integrated password for url params * added password * removed unused code * format fixes * format fixes and removed console logs * DEV to STAGING (#360) * s3 url fix * Added gpt 4o model and fix for gcs bucket (#329) * Added gpt 4o model and fix for gcs bucket * OpenAI GPT 4o model label changes --------- Co-authored-by: kartikpersistent <[email protected]> * added drop zone icon * removed cloud icon from button * exponential backoff implementation * Drag the legends panel similar to workspace (#335) * resize-legends * css change * lint * added driver config (#341) * added driver config * remove score.py * Debug config in graph (#342) * added driver config * remove score.py * added user agent in env * Modified chatbot for increased performance and chat history issues (#340) * 321 documents selection for processing and graph visualization (#343) * added multi select extraction * added support for multiple documents * integrated api * format fixes * conditional rendering of limit input * Graph Query : Added Doc Limit parameter * showing the selected files length * handled generating graph using multiselect * added the count to respective buttons * removed doclimit * removed the doc limit * fixed inspectedname issue * format fixes * added toottip message on show graph message --------- Co-authored-by: vasanthasaikalluri <[email protected]> * Page number of pdf (#347) * added page number to chunk node * page_number for only local file upload * connect fix (#349) * Remove neo4j.debug watch and added , refresh_schema=False, sanitize=True * Update graph chunk processed (#358) * update graph after fixed number of chunk processed * update node_count based on no of chunks processed * Update graph after spefic number of chunks * removed the large file check * added missing dependency --------- Co-authored-by: kartikpersistent <[email protected]> * Support for url parameters (#357) * added the helper method * integrated the URL search params without password * integrated password for url params * added password * removed unused code * format fixes * format fixes and removed console logs --------- Co-authored-by: kartikpersistent <[email protected]> Co-authored-by: aashipandya <[email protected]> Co-authored-by: Prakriti Solankey <[email protected]> Co-authored-by: vasanthasaikalluri <[email protected]> * vite version upgradtion * Closing neo4j connection and common llm initialization (#364) * Closing neo4j connection and common llm initialization * updated version of neo4j * "," included graph view issue fixed * Logging merged file path * removed the dotty animation * logging removed merges * Added uvicorn wrokers in Docker file and Issue fixed for delete file. * Persisting the node label ,rel label values from locastorage (#363) * Persisting the node label ,rel label values from locastorage * format fixes * restricted the alert foe only large files * added filesize * Added gunicorn in docker * Gcs auth login (#310) * changes for gcloud auth login backend using client json * passing project id as parameter to gcs bucket * gcloud auth apis * creted source node for gcs bucket files * added google auth * added google auth login * commented token request temporary * node backend for refresh token * ignore changes * access token from frontend * Integrated the google auth login flow * clearing the project id after success or failure * added project id in scan response * added project for the extract api * added error messages * bucket name check * check for bucket exist * message fixes * showing the alert messages in snackbar * added client id in example env --------- Co-authored-by: kartikpersistent <[email protected]> * url params fix * time fix alert in seconds * Wikipedia source to accept all valid urls (#371) * docker file changes * Graph view from info model (#369) * info modal * info modal * Added chunk entities and modified chat response * added page numbers * api changes * changes for backend type * linting * added contsants.py * Modified sources in chatbot * Modified sources in chatbot * format changes * Graph view from chat info * Format changes * icon tooltip * li css changes * type , format fixes * node changes * refactoring code --------- Co-authored-by: vasanthasaikalluri <[email protected]> Co-authored-by: kartikpersistent <[email protected]> * file name in chunk property (#377) * Cancelled processing job (#374) * User can cancel the running job process * Change the status as Cancelled * Add processed chunk in source node to processing progress bar on UI * added button for cancelling processing job * Disabled state updation * status processed_chunk progress * extra comma fix * stopping the sse on cancelled status * processing progress * yarn lint fix and format fixes * progress bar UI fixes * Fixed issue of status when user immediately cancelled the job --------- Co-authored-by: Pravesh Kumar <[email protected]> * Wikipedia to accept all ids and multiple languages (#376) * Wikipedia source to accept all valid urls * wikipedia url to support multiple languages * integrated wiki langauge param for extract api --------- Co-authored-by: kartikpersistent <[email protected]> * changed the Disabled check for view grap and db url fixed * format and lint fixes * added loading state for Use existing schema * Add timeout in docker for gunicorn workers * Add cancel icon to info popup (#384) * Info Modal Changes * css changes * removed document status --------- Co-authored-by: Pravesh Kumar <[email protected]> Co-authored-by: kartikpersistent <[email protected]> Co-authored-by: vasanthasaikalluri <[email protected]> Co-authored-by: aashipandya <[email protected]> Co-authored-by: Morgan Senechal <[email protected]> Co-authored-by: Michael Hunger <[email protected]> Co-authored-by: ManjuPatel1 <[email protected]>
1 parent 6fa7f11 commit 7316747

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

74 files changed

+2679
-705
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,3 +165,5 @@ google-cloud-cli-469.0.0-linux-x86_64.tar.gz
165165
/data/llm-experiments-387609-c73d512ca3b1.json
166166
/backend/src/merged_files
167167
/backend/src/chunks
168+
/backend/merged_files
169+
google-cloud-cli-476.0.0-linux-x86_64.tar.gz
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
Graph DB Connectors and GenAI Integrations POC_v1
2+
3+
“” This is the version v1, where main content is taken with the below paper1 and some other contents are added with other papers and blogs.””
4+
5+
Paper1: From Local to Global: A Graph RAG Approach to
6+
Query-Focused Summarization
7+
8+
“” Python-based implementation of both global and local Graph RAG
9+
approaches are forthcoming at https://aka.ms/graphrag.””
10+
11+
Graph RAG approach uses the natural modularity of graphs to partition data for global summarization. It uses an LLM to build a graph-based text index in two stages:
12+
1. Derive an entity knowledge graph from the source documents,
13+
2. Pre-generate community summaries for all groups of closely related entities.
14+
15+
IT can answer such questions like “What are the main themes in the dataset?”, Basically, it is an inherently query focused summarization (QFS) task. Graph RAG approach improves the question answering over private text corpora that scales with both the generality of user questions and the quantity of source text to be indexed. Graph RAG leads to substantial improvements for both the comprehensiveness and diversity of generated answers.
16+
17+
community descriptions provide complete coverage of the underlying graph index, and the input documents it represents. Query-focused summarization of an entire corpus is then made possible using a map-reduce approach: first using each community summary to answer the query independently and in parallel, then summarizing all relevant partial answers into a final global answer.
18+
19+
20+
Figure 1: Graph RAG pipeline using an LLM-derived graph index of source document text
21+
22+
I. Data Ingestion:
23+
1. Documents/chunks/Text Preprocessing:
24+
To reduce document size and improve latency use text summarization for heavy documents or multi-documents with the below steps:
25+
Step1: LLM (use a specific LLM embedding to summarize documents)
26+
Step2: Knowledge graph to reduce size with entities, relationship, and their properties with subgraphs.
27+
Note: Above steps can be followed bidirectionally.
28+
29+
2. Create vector DB/Embedding/Indexing with LLM embedding.
30+
31+
II. vector Embedding/Indexing Storage
32+
33+
Generate KG with the embedding and store in graph DB or store the embedding in FAISS/pinecone to improve latency and accuracy.
34+
or
35+
Both the methods can be combined (KG+ vector embedding) and store DB to handle both structure and unstructured data
36+
37+
Generate four communities (C0, C1, C2, C3) Graph RAG summary from the embedding/KG of the document/multi-documents/embeddings by using text summarized Map Reduced approach.
38+
C0: Uses root-level community summaries (fewest in number) to answer user queries.
39+
C1: Uses high-level community summaries to answer queries. These are sub-communities.
40+
of C0, if present, otherwise C0 communities projected down.
41+
C2: Uses intermediate-level community summaries to answer queries. These are subcommunities of C1, if present, otherwise C1 communities projected down.
42+
C3: Uses low-level community summaries (greatest in number) to answer queries. These
43+
are sub-communities of C2, if present, otherwise C2 communities projected down.
44+
45+
Figure 2.1 Communities’ Summary Figure.2.2 Communities Graph
46+
47+
Figure 3. Summarized Community Graph
48+
49+
III. Chat Response/Architecture:
50+
Approaches: Multi-hope RAG, memory-based response, Head-to-Head measures.
51+
Head-To-Head measures can use as a performance metrics using an LLM evaluator are as follows:
52+
• Comprehensiveness: How much detail does the answer provide to cover all aspects and
53+
details of the question?
54+
• Diversity: How varied and rich is the answer in providing different perspectives and insights on the question?
55+
56+
57+
For a given community levels (Fig.2.2,2.2 & 3), the global answer to any user query is generated as follows:
58+
59+
• Prepare community summaries. Community summaries are randomly shuffled and divided
60+
into chunks of pre-specified token size. This ensures relevant information is distributed
61+
across chunks, rather than concentrated (and potentially lost) in a single context window.
62+
63+
• Map community answers. Generate intermediate answers in parallel, one for each chunk.
64+
The LLM is also asked to generate a score between 0-100 indicating how helpful the generated
65+
answer is in answering the target question. Answers with score 0 are filtered out.
66+
67+
• Reduce to global answer. Intermediate community answers are sorted in descending order
68+
of helpfulness score and iteratively added into a new context window until the token limit
69+
is reached. This final context is used to generate the global answer returned to the user.
70+
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
From Local to Global V1.1
2+
“” Paper1: From Local to Global: A Graph RAG Approach to Query-Focused Summarization””
3+
4+
1. Graph RAG Approach & Pipeline (Figure1_v1):
5+
1. Source document- -> Text chunks (token size: 600-2400)
6+
2. Text Chunks- Elements Instances:
7+
• To domain the document for context learning, use multipart LLM prompt to Identify and extract instances for the graph nodes and edges including source and target from each chunk of source document.
8+
i. Text chunk  Multilevel LLM prompt (to domain the document for context learning) find all entities (name, type, description (including source & target))  identify instances for nodes & edges.
9+
ii. Abstractive Summary/Generate tuple ((source/object entities, name, type, description of entities/relationships/claims), (all entities, name, type, description of entities/relationships/claims))
10+
Note: It also supports a secondary extraction prompt for any additional covariates associated with the extracted node instances. It extracts the claim linked to detected entities, including the subject, object, type, description, source text span, and start and end dates. And make sure no entities are left in multistage LLM prompt by asking YES (many entities are left)/NO.
11+
3.Element instances  Element Summaries:
12+
• Element instances/Abstractive Summary(tuples) LLM to summarize Semantic instance level Element Summary
13+
• To convert all such instance-level summaries into single blocks of descriptive text for each graph element (i.e., entity node, relationship edge, and claim covariate) requires a further round of LLM summarization over matching groups of instances. i.e.
14+
Instance-level-summaryLLM Vector embedding  KNN/cosine similarity search to find homogeneous summary clusters  LLM to summarize homogeneous clusters  single block summary of similar instances Elements’/homogenous clusters’ summary.
15+
Note: potential concern at this stage is that the LLM may not consistently extract references to the same entity in the same text format, resulting in duplicate entity elements and thus duplicate nodes in the entity graph. However, since all closely-related “communities” of entities will be detected and summarized in the following step, and given that LLMs can understand the common entity. There should be sufficient connectivity from all variations to a shared set of closely-related entities.
16+
17+
18+
4. Element summariesGraph Communities:
19+
i. Indexed Element summary of homogeneous clusters neo4j Homogeneous weighted undirected Graph
20+
ii. Homogeneous weighted undirected Graph Graph Community Detection algorithms (Hierarchical community structure) Partition graph into communities of nodes
21+
Note: To recover hierarchical community structure of large-scale graphs efficiently. Each level of this hierarchy provides a community partition that covers the nodes of the graph in a mutually-exclusive, collective-exhaustive way, enabling divide-and-conquer global summarization.
22+
23+
24+
5. Graph Communities Communities Summaries:
25+
• Graph communitiesLieden Hierarchy method  community summaries (on global summarized graph summaries
26+
Graph based communities are used to generate Communities’ summaries. These communities’ summaries are independently useful to understand the global structure and semantics of the dataset and may themselves be used to make sense of a corpus in the absence of a question. For example, a user may scan through community summaries at one level looking for general themes of interest, then follow links to the reports at the lower level that provide more details for each of the subtopics.
27+
• Leaf-level communities. The element summaries of a leaf-level community (nodes, edges, covariates) are prioritized and then iteratively added to the LLM context window until the token limit is reached. The prioritization is as follows: for each community edge in decreasing order of combined source and target node degree (i.e., overall prominence), add descriptions of the source node, target node, linked covariates, and the edge itself.
28+
• Higher-level communities. If all element summaries fit within the token limit of the context window, proceed as for leaf-level communities, and summarize all element summaries within the community. Otherwise, rank sub-communities in decreasing order of element summary tokens and iteratively substitute sub-community summaries (shorter) for their associated element summaries (longer) until fit within the context window is achieved.
29+
30+
6. Community Summaries Community AnswersGlobal Answers:
31+
a. For a given community level, the global answer to any user query is generated as follows:
32+
Divide randomly shuffled Community summary into chunks (community summaries preparation)  parallelly generates answers from each chunk (Answer mapping)  Reduce to global answer.
33+
• Prepare community summaries: community summaries are randomly shuffled and divided into chunks of pre-specified token size. This ensures relevant information is distributed across chunks, rather than concentrated (and potentially lost) in a single context window.
34+
• Map community answers: Generate intermediate answers in parallel, one for each chunk. The LLM is also asked to generate a score between 0-100 indicating how helpful the generated answer is in answering the target question. Answers with score 0 are filtered out.
35+
• Reduce to global answer: Intermediate community answers are sorted in descending order of helpfulness score and iteratively added into a new context window until the token limit is reached. This final context is used to generate the global answer returned to the user.
36+
37+
38+
39+
• Communities Comparison
40+
Six conditions can compare six, including Graph RAG using four levels of graph communities (C0, C1, C2, C3), a text summarization method applying our map-reduce approach directly to source texts (TS), and a naive “semantic search” RAG approach (SS):
41+
a) CO: Uses root-level community summaries (fewest in number) to answer user queries.
42+
b) C1: Uses high-level community summaries to answer queries. These are sub-communities of C0, if present, otherwise C0 communities projected down.
43+
c) C2: Uses intermediate-level community summaries to answer queries. These are subcommunities of C1, if present, otherwise C1 communities projected down.
44+
d) C3: Uses low-level community summaries (greatest in number) to answer queries. These are sub-communities of C2, if present, otherwise C2 communities projected down.
45+
e) TS: Except source texts (rather than community summaries) are shuffled and chunked for the map-reduce summarization stages.
46+
f) SS: An implementation of naive RAG in which text chunks are retrieved and added to the available context window until the specified token limit is reached.
47+
48+
The size of the context window and the prompts used for answer generation are the same across all six conditions (except for minor modifications to reference styles to match the types of contexts.
49+
50+
Conclusion: A Trade-offs of building a graph index achieves the best head-to-head results against other methods, but in many cases the graph-free approach to global summarization of source texts performed competitively. The real-world decision about whether to invest in building a graph index depends on multiple factors, including the compute budget, expected number of lifetime queries per dataset, and value obtained from other aspects of the graph index (including the generic communities’ summaries and the use of other graph-related RAG approaches).
51+
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
RAPTOR: RECURSIVE ABSTRACTIVE PROCESSING
2+
FOR TREE-ORGANIZED RETRIEVAL v1:
3+
4+
Source Code The source code for RAPTOR will be publicly available in https://github.com/parthsarthi03/raptor.
5+
6+
Step1. Documentchunks(100)(to preserve the contextual and semantic coherence, if sentence exceed 100 tokens the whole sentence moves to the next chunk rather than cutting into mid)  clustered  summarized(GPT3.5-turbo) re-embedding(SBERT)
7+
Leafe nodes have two values: chunk and its embedding using SBERT
8+
Step2. Follow step1 until further clustering becomes infeasible, resulting in a structured, multi-layered tree representation of the original documents.
9+
Note: Scalability of both build time and token expenditure is shown below
10+
11+
Step3: For querying within the tree, two distinct strategies are used: tree traversal and collapsed tree. The tree traversal method traverses the tree layer-by-layer, pruning and selecting the most relevant nodes at each level. The collapsed tree method evaluates nodes collectively across all layers to find the most relevant ones.
12+
13+
Clustering Algorithm:
14+
GMM: It offers both flexibility and a probabilistic framework. where nodes can belong to multiple clusters without requiring a fixed number of clusters. This flexibility is essential because individual text segments often contain information relevant to various topics, thereby warranting their inclusion in multiple summaries.
15+
The high dimensionality of vector embeddings presents a challenge for traditional GMMs, as distance metrics may behave poorly when used to measure similarity in high-dimensional spaces. To mitigate this, we employ Uniform Manifold Approximation and Projection (UMAP), a manifold learning technique for dimensionality reduction . The number of nearest neighbors’ parameter, n neighbors, in UMAP determines the balance between the preservation of local and global structures. Our algorithm varies n neighbors to create a hierarchical clustering structure: it first identifies global clusters and then performs local clustering within these global clusters. This two-step clustering process captures a broad spectrum of relationships among the text data, from broad themes to specific details.
16+
Should a local cluster’s combined context ever exceed the summarization model’s token threshold, our algorithm recursively applies clustering within the cluster, ensuring that the context remains within the token threshold.
17+
In GMM, the number of parameters k is a function of the dimensionality of the input vectors and the number of clusters.
18+
With the optimal number of clusters determined by BIC, the Expectation-Maximization algorithm is then used to estimate the GMM parameters, namely the means, covariances, and mixture weights. While the Gaussian assumption in GMMs may not perfectly align with the nature of text data, which often exhibits a sparse and skewed distribution, our empirical observations suggest that it offers an effective model for our purpose. We run an ablation comparing GMM Clustering with summarizing contiguous chunks and provide details.
19+
Querying:
20+
21+
Tree traverse method: method first selects the top-k most relevant root nodes based on their cosine similarity to the query embedding. The children of these selected nodes are considered at the next layer and the top-k nodes are selected from this pool again based on their cosine similarity to the query vector. This process is repeated until we reach the leaf nodes. Finally, the text from all selected nodes is concatenated to form the retrieved context.
22+
1. Start at the root layer of the RAPTOR tree. Compute the cosine similarity between the
23+
query embedding and the embeddings of all nodes present at this initial layer.
24+
2. Choose the top-k nodes based on the highest cosine similarity scores, forming the set S1.
25+
3. Proceed to the child nodes of the elements in set S1. Compute the cosine similarity between
26+
the query vector and the vector embeddings of these child nodes.
27+
4. Select the top k child nodes with the highest cosine similarity scores to the query, forming
28+
the set S2.
29+
5. Continue this process recursively for d layers, producing sets S1, S2, . . . , Sd.
30+
6. Concatenate sets S1 through Sd to assemble the relevant context to the query.
31+
32+
Collapse tree methos: It searches relevant information by considering all nodes in the tree simultaneously,
33+
1. First, collapse the entire RAPTOR tree into a single layer. This new set of nodes, denoted
34+
as C, contains nodes from every layer of the original tree.
35+
36+
2. Next, calculate the cosine similarity between the query embedding and the embeddings of
37+
all nodes present in the collapsed set C.
38+
3. 3. Finally, pick the top-k nodes that have the highest cosine similarity scores with the query.
39+
Keep adding nodes to the result set until you reach a predefined maximum number
40+
41+
42+
43+
44+
45+
46+
Figure 3: Comparison of querying methods.
47+
Results on 20 stories from the QASPER dataset
48+
using tree traversal with different top-k values,
49+
and collapsed tree with different context lengths.
50+
Collapsed tree with 2000 tokens produces the best
51+
results, so we use this querying strategy for main results.
52+
53+
54+
55+
CONCLUSION
56+
RAPTOR is a novel tree-based retrieval system that augments the parametric knowledge of large language models with contextual information at various levels of abstraction. By employing recursive clustering and summarization techniques, RAPTOR creates a hierarchical tree structure that is capable of synthesizing information across various sections of the retrieval corpora. During the query phase, RAPTOR leverages this tree structure for more effective retrieval. RAPTOR not only outperforms traditional retrieval methods but also sets new performance benchmarks on several question-answering tasks.

POC_Documents/V1/figure.2,3.jpg

83.2 KB
Loading

POC_Documents/V1/figure.4.jpg

133 KB
Loading

POC_Experiments/Data_Analysis

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+

backend/Dockerfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,5 +10,6 @@ RUN apt-get update \
1010
&& export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH \
1111
&& pip install --no-cache-dir --upgrade -r /code/requirements.txt
1212

13-
CMD ["uvicorn", "score:app", "--host", "0.0.0.0", "--port", "8000"]
13+
# CMD ["uvicorn", "score:app", "--host", "0.0.0.0", "--port", "8000","--workers", "4"]
14+
CMD ["gunicorn", "score:app","--workers","4","--worker-class","uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:8000", "--timeout", "300"]
1415

backend/example.env

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,6 @@ NUMBER_OF_CHUNKS_TO_COMBINE = ""
1919
GEMINI_ENABLED = True|False
2020
# Enable Google Cloud logs (default is True)
2121
GCP_LOG_METRICS_ENABLED = True|False
22-
NEO4J_USER_AGENT = ""
22+
UPDATE_GRAPH_CHUNKS_PROCESSED = 20
23+
NEO4J_USER_AGENT = ""
24+
UPDATE_GRAPH_CHUNKS_PROCESSED = 20

backend/requirements.txt

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ frozenlist==1.4.1
3838
fsspec==2024.2.0
3939
google-api-core==2.18.0
4040
google-auth==2.29.0
41+
google_auth_oauthlib
4142
google-cloud-aiplatform
4243
google-cloud-bigquery==3.19.0
4344
google-cloud-core==2.4.1
@@ -87,7 +88,7 @@ matplotlib==3.7.2
8788
mpmath==1.3.0
8889
multidict==6.0.5
8990
mypy-extensions==1.0.0
90-
neo4j==5.18.0
91+
neo4j==5.20.0
9192
networkx==3.2.1
9293
nltk==3.8.1
9394
numpy==1.26.4
@@ -139,6 +140,7 @@ sniffio==1.3.1
139140
soupsieve==2.5
140141
SQLAlchemy==2.0.28
141142
starlette==0.36.3
143+
starlette-session
142144
sympy==1.12
143145
tabulate==0.9.0
144146
tenacity==8.2.3
@@ -158,6 +160,7 @@ unstructured-inference
158160
unstructured.pytesseract
159161
urllib3
160162
uvicorn
163+
gunicorn
161164
wikipedia==1.4.0
162165
wrapt==1.16.0
163166
yarl==1.9.4

0 commit comments

Comments
 (0)