Skip to content

Commit d327368

Browse files
praveshkumar1988aashipandyaabhishekkumar-27kartikpersistentvasanthasaikalluri
authored
Staging to main (#467)
* Dev (#433) * Integration_qa test (#375) * Test IntegrationQA added * update test cases * update test * update node count assertions * test changes * update changes * modification test * Code refatctor test cases * Handle allowedlist issue in test * test changes * update test * test case execution * test chatbot updates * test case update file * added file --------- Co-authored-by: Pravesh Kumar <[email protected]> * recent merges * pdf deletion due to out of diskspace * fixed status blank issue * Rendering the file name instead of link for gcs and s3 sources in the info modal * Convert is_cancelled value from string to bool * added the default page size * Issue fixed Processed chunked as 0 when file re-process again * Youtube timestamps (#386) * Wikipedia source to accept all valid urls * wikipedia url to support multiple languages * integrated wiki langauge param for extract api * Youtube video timestamps --------- Co-authored-by: kartikpersistent <[email protected]> * groq llm integration backend (#286) * groq llm integration backend * groq and description in node properties * added groq in options --------- Co-authored-by: kartikpersistent <[email protected]> * offset in chunks (#389) * page number in gcs loader (#393) * added youtube timestamps (#392) * chat pop up button (#387) * expand * minimize-icon * css changes * chat history * chatbot wider Side Nav * expand icon * chatbot UI * Delete * merge fixes * code suggestions --------- Co-authored-by: kartikpersistent <[email protected]> * chunks create before extraction using is_pre_process variable (#383) * chunks create before extraction using is_pre_process variable * Return total pages for Model * update requirement.txt * total pages on uplaod API * added the Confirmation Dialog * added the selected files into the confirmation modal * format and lint fixes * added the stop watch image * fileselection on alert dialog * Add timeout in docker for gunicorn workers * Add cancel icon to info popup (#384) * Info Modal Changes * css changes * recent merges * Integration_qa test (#375) * Test IntegrationQA added * update test cases * update test * update node count assertions * test changes * update changes * modification test * Code refatctor test cases * Handle allowedlist issue in test * test changes * update test * test case execution * test chatbot updates * test case update file * added file --------- Co-authored-by: Pravesh Kumar <[email protected]> * fixed status blank issue * Rendering the file name instead of link for gcs and s3 sources in the info modal * added the default page size * Convert is_cancelled value from string to bool * Issue fixed Processed chunked as 0 when file re-process again * Youtube timestamps (#386) * Wikipedia source to accept all valid urls * wikipedia url to support multiple languages * integrated wiki langauge param for extract api * Youtube video timestamps --------- Co-authored-by: kartikpersistent <[email protected]> * groq llm integration backend (#286) * groq llm integration backend * groq and description in node properties * added groq in options --------- Co-authored-by: kartikpersistent <[email protected]> * Save Total Pages in DB * Added total Pages * file selection when we didn't select anything from Main table * added the danger icon only for large files * added the overflow for more files and file selection for all new files * moved the interface to types * added the icon accoroding to the source * set total page for wiki and youtube * h3 heading * merge * updated the alert on basis if total pages * deleted chunks * polling based on total pages * isNan check * large file based on file size for s3 and gcs * file source in server side event * time calculation based on chunks for gcs and s3 --------- Co-authored-by: kartikpersistent <[email protected]> Co-authored-by: Prakriti Solankey <[email protected]> Co-authored-by: abhishekkumar-27 <[email protected]> Co-authored-by: aashipandya <[email protected]> * fixed the layout issue * Populate graph schema (#399) * crreate new endpoint populate_graph_schema and update the query for getting lables from DB * Added main.py changes * conditionally-including-the-gcs-login-flow-in-gcs-as-source (#396) * added the condtion * removed llms * Fixed issue : Remove extra unused param * get emb only if used (#278) * Chatbot chunks (#402) * Added file name to the content sent to LLM * added chunk text in the response * increased the docs parts sent to llm * Modified graph query * mardown rendering * youtube starttime * icons * offset changes * removed the files due to codespace space issue --------- Co-authored-by: vasanthasaikalluri <[email protected]> Co-authored-by: kartikpersistent <[email protected]> * Settings modal to support generating the labels from the llm by using text given by user (#405) * added the json * added schema from text dialog * integrated the schemaAPI * added the alert * resize fixes * fixed css issue * fixed status blank issue * Modified response when no docs is retrived (#413) * Fixed env/docker-compose for local deployments + README doc (#410) * Fixed env/docker-compose for local deployments + README doc * wrong place for ENV in README * by default, removed langsmith + fixed knn score string to float * by default, removed langsmith + fixed knn score string to float * Fixed strings in docker-compose env * Added requirements (neo4j 5.15 or later, APOC, and instructions for Neo4j Desktop) * Missed the TIME_PER_PAGE env, was causing NaN issue in the approx time processing notification. fixed that * Support for all unstructured files (#401) * all unstructured files * responsiveness * added file type * added the extensions * spell mistake * ppt file changes --------- Co-authored-by: kartikpersistent <[email protected]> * Settings modal to support generating the labels from the llm by using text given by user with checkbox (#415) * added the json * added schema from text dialog * integrated the schemaAPI * added the alert * resize fixes * Extract schema using direct ChatOpenAI API and Chain * integrated the checkbox for schema to text dialog * Update SettingModal.tsx --------- Co-authored-by: Pravesh Kumar <[email protected]> * gcs file content read via storage client (#417) * gcs file content read via storage client * added the access token the file state --------- Co-authored-by: kartikpersistent <[email protected]> * pypdf2 to read files from gcs (#420) * 407 remove driver from frontend (#416) * removed driver * removed API * connecting to database on page refresh --------- Co-authored-by: kartikpersistent <[email protected]> * Css handling of info modal and Tooltips (#418) * css change * toolTips * Sidebar Tooltips * copy to clip * css change * added image types * added gcs * type fix * docker changes * speech * added the toolip for dropzone sources --------- Co-authored-by: kartikpersistent <[email protected]> * Fixed retrival bugs (#421) * yarn format fixes * changed the delete message * added the cancel button * changed the message on tooltip * added space * UI fixes * tooltip for setting * updated req * wikipedia URL input (#424) * accept only wikipedia links * added wikipedia link * added wikilink regex * wikipedia single url only * changed the alert message * wording change * pushed validation state persist error --------- Co-authored-by: aashipandya <[email protected]> * speech and copy (#422) * speech and copy * startTime * added chunk properties * tooltips --------- Co-authored-by: vasanthasaikalluri <[email protected]> Co-authored-by: kartikpersistent <[email protected]> * Fixed issue for out of range in KNN API * solved conflicts * conflict solved * Remove logging info from update KNN API * tooltip changes * format and lint fixes * responsiveness changes * Fixed issue for total pages GCS, S3 * UI polishing (#428) * button and tooltip changes * checking validation on change * settings module populate fix * format fixes * opening the modal after auth success * removed the limit * added the scrobar for dropdowns * speech state (#426) * speech state * Button Details changes * delete wording change * Total pages in buckets (#431) * page number NA for buckets * added N/A for gcs and s3 pages * total pages for gcs * remove unwanted logger --------- Co-authored-by: kartikpersistent <[email protected]> * removed the max width * Update FileTable.tsx * Update the docker file * Modified prompt (#438) * Update Dockerfile * Update Dockerfile * Update Dockerfile * rendering Fix * Local file upload gcs (#442) * Uplaod file to GCS * GCS local upload fixed issue and delete file from GCS after processing and failed or cancelled * Add life cycle rule on uploaded bucket * pdf upload local and gcs bucket check * delete files when processed and extract changes --------- Co-authored-by: Pravesh Kumar <[email protected]> * Modified chat length and entities used (#443) * metadata for unstructured files (#446) * Unstructured file metadata (#447) * metadata for unstructured files * sleep in gcs upload * updated * icons added to chunks (#435) * icons added to chunks * info modal icons --------- Co-authored-by: abhishekkumar-27 <[email protected]> Co-authored-by: Pravesh Kumar <[email protected]> Co-authored-by: kartikpersistent <[email protected]> Co-authored-by: vasanthasaikalluri <[email protected]> Co-authored-by: Prakriti Solankey <[email protected]> Co-authored-by: Ajay Meena <[email protected]> Co-authored-by: Morgan Senechal <[email protected]> Co-authored-by: karanchellani <[email protected]> * DEV to STAGING (#461) * Integration_qa test (#375) * Test IntegrationQA added * update test cases * update test * update node count assertions * test changes * update changes * modification test * Code refatctor test cases * Handle allowedlist issue in test * test changes * update test * test case execution * test chatbot updates * test case update file * added file --------- Co-authored-by: Pravesh Kumar <[email protected]> * recent merges * pdf deletion due to out of diskspace * fixed status blank issue * Rendering the file name instead of link for gcs and s3 sources in the info modal * Convert is_cancelled value from string to bool * added the default page size * Issue fixed Processed chunked as 0 when file re-process again * Youtube timestamps (#386) * Wikipedia source to accept all valid urls * wikipedia url to support multiple languages * integrated wiki langauge param for extract api * Youtube video timestamps --------- Co-authored-by: kartikpersistent <[email protected]> * groq llm integration backend (#286) * groq llm integration backend * groq and description in node properties * added groq in options --------- Co-authored-by: kartikpersistent <[email protected]> * offset in chunks (#389) * page number in gcs loader (#393) * added youtube timestamps (#392) * chat pop up button (#387) * expand * minimize-icon * css changes * chat history * chatbot wider Side Nav * expand icon * chatbot UI * Delete * merge fixes * code suggestions --------- Co-authored-by: kartikpersistent <[email protected]> * chunks create before extraction using is_pre_process variable (#383) * chunks create before extraction using is_pre_process variable * Return total pages for Model * update requirement.txt * total pages on uplaod API * added the Confirmation Dialog * added the selected files into the confirmation modal * format and lint fixes * added the stop watch image * fileselection on alert dialog * Add timeout in docker for gunicorn workers * Add cancel icon to info popup (#384) * Info Modal Changes * css changes * recent merges * Integration_qa test (#375) * Test IntegrationQA added * update test cases * update test * update node count assertions * test changes * update changes * modification test * Code refatctor test cases * Handle allowedlist issue in test * test changes * update test * test case execution * test chatbot updates * test case update file * added file --------- Co-authored-by: Pravesh Kumar <[email protected]> * fixed status blank issue * Rendering the file name instead of link for gcs and s3 sources in the info modal * added the default page size * Convert is_cancelled value from string to bool * Issue fixed Processed chunked as 0 when file re-process again * Youtube timestamps (#386) * Wikipedia source to accept all valid urls * wikipedia url to support multiple languages * integrated wiki langauge param for extract api * Youtube video timestamps --------- Co-authored-by: kartikpersistent <[email protected]> * groq llm integration backend (#286) * groq llm integration backend * groq and description in node properties * added groq in options --------- Co-authored-by: kartikpersistent <[email protected]> * Save Total Pages in DB * Added total Pages * file selection when we didn't select anything from Main table * added the danger icon only for large files * added the overflow for more files and file selection for all new files * moved the interface to types * added the icon accoroding to the source * set total page for wiki and youtube * h3 heading * merge * updated the alert on basis if total pages * deleted chunks * polling based on total pages * isNan check * large file based on file size for s3 and gcs * file source in server side event * time calculation based on chunks for gcs and s3 --------- Co-authored-by: kartikpersistent <[email protected]> Co-authored-by: Prakriti Solankey <[email protected]> Co-authored-by: abhishekkumar-27 <[email protected]> Co-authored-by: aashipandya <[email protected]> * fixed the layout issue * Populate graph schema (#399) * crreate new endpoint populate_graph_schema and update the query for getting lables from DB * Added main.py changes * conditionally-including-the-gcs-login-flow-in-gcs-as-source (#396) * added the condtion * removed llms * Fixed issue : Remove extra unused param * get emb only if used (#278) * Chatbot chunks (#402) * Added file name to the content sent to LLM * added chunk text in the response * increased the docs parts sent to llm * Modified graph query * mardown rendering * youtube starttime * icons * offset changes * removed the files due to codespace space issue --------- Co-authored-by: vasanthasaikalluri <[email protected]> Co-authored-by: kartikpersistent <[email protected]> * Settings modal to support generating the labels from the llm by using text given by user (#405) * added the json * added schema from text dialog * integrated the schemaAPI * added the alert * resize fixes * fixed css issue * fixed status blank issue * Modified response when no docs is retrived (#413) * Fixed env/docker-compose for local deployments + README doc (#410) * Fixed env/docker-compose for local deployments + README doc * wrong place for ENV in README * by default, removed langsmith + fixed knn score string to float * by default, removed langsmith + fixed knn score string to float * Fixed strings in docker-compose env * Added requirements (neo4j 5.15 or later, APOC, and instructions for Neo4j Desktop) * Missed the TIME_PER_PAGE env, was causing NaN issue in the approx time processing notification. fixed that * Support for all unstructured files (#401) * all unstructured files * responsiveness * added file type * added the extensions * spell mistake * ppt file changes --------- Co-authored-by: kartikpersistent <[email protected]> * Settings modal to support generating the labels from the llm by using text given by user with checkbox (#415) * added the json * added schema from text dialog * integrated the schemaAPI * added the alert * resize fixes * Extract schema using direct ChatOpenAI API and Chain * integrated the checkbox for schema to text dialog * Update SettingModal.tsx --------- Co-authored-by: Pravesh Kumar <[email protected]> * gcs file content read via storage client (#417) * gcs file content read via storage client * added the access token the file state --------- Co-authored-by: kartikpersistent <[email protected]> * pypdf2 to read files from gcs (#420) * 407 remove driver from frontend (#416) * removed driver * removed API * connecting to database on page refresh --------- Co-authored-by: kartikpersistent <[email protected]> * Css handling of info modal and Tooltips (#418) * css change * toolTips * Sidebar Tooltips * copy to clip * css change * added image types * added gcs * type fix * docker changes * speech * added the toolip for dropzone sources --------- Co-authored-by: kartikpersistent <[email protected]> * Fixed retrival bugs (#421) * yarn format fixes * changed the delete message * added the cancel button * changed the message on tooltip * added space * UI fixes * tooltip for setting * updated req * wikipedia URL input (#424) * accept only wikipedia links * added wikipedia link * added wikilink regex * wikipedia single url only * changed the alert message * wording change * pushed validation state persist error --------- Co-authored-by: aashipandya <[email protected]> * speech and copy (#422) * speech and copy * startTime * added chunk properties * tooltips --------- Co-authored-by: vasanthasaikalluri <[email protected]> Co-authored-by: kartikpersistent <[email protected]> * Fixed issue for out of range in KNN API * solved conflicts * conflict solved * Remove logging info from update KNN API * tooltip changes * format and lint fixes * responsiveness changes * Fixed issue for total pages GCS, S3 * UI polishing (#428) * button and tooltip changes * checking validation on change * settings module populate fix * format fixes * opening the modal after auth success * removed the limit * added the scrobar for dropdowns * speech state (#426) * speech state * Button Details changes * delete wording change * Total pages in buckets (#431) * page number NA for buckets * added N/A for gcs and s3 pages * total pages for gcs * remove unwanted logger --------- Co-authored-by: kartikpersistent <[email protected]> * removed the max width * Update FileTable.tsx * Update the docker file * Modified prompt (#438) * Update Dockerfile * Update Dockerfile * Update Dockerfile * rendering Fix * Local file upload gcs (#442) * Uplaod file to GCS * GCS local upload fixed issue and delete file from GCS after processing and failed or cancelled * Add life cycle rule on uploaded bucket * pdf upload local and gcs bucket check * delete files when processed and extract changes --------- Co-authored-by: Pravesh Kumar <[email protected]> * Modified chat length and entities used (#443) * metadata for unstructured files (#446) * Unstructured file metadata (#447) * metadata for unstructured files * sleep in gcs upload * updated * icons added to chunks (#435) * icons added to chunks * info modal icons * fixed gcs status message issue * added if check for failed count * Null issue Fixed from backend for upload API and graph_document when model name mismatch * added word break issue * Added neo4j-rust-ext * processing time estimation based on bytes * File extension upper case fixed, File delete from GCS or local based on env variable. * timer per byte * Update Dockerfile * Adding sort rows on the table (#451) * Gcs upload folder hashed (#453) * implement foldername hashed in GCS bucket uplaod * Raise exception if invalid model selected * folder name for gcs upload --------- Co-authored-by: aashipandya <[email protected]> * upload all unstructuredfiles to gcs (#455) * Mofified chunk query (#454) * Added libre office for fixing error -- soffice command was not found. Please install libreoffice on your system and try again. - Install instructions: https://www.libreoffice.org/get-help/install-howto/ - Mac: https://formulae.brew.sh/cask/libreoffice - Debian: https://wiki.debian.org/LibreOffice" * Fix the PARTIAL CONTENT issue * File-table no data found (#456) * 'file-table'' * review comment * Llm format change (#459) * changed the llm models format to lowercase * added the error message * llm model changes * format fixes * removed unused import * added the capitalize method * delete files from merged_file_path only if source is local file --------- Co-authored-by: aashipandya <[email protected]> * commented total page code (#460) * format fixes * removed the disabled check on dropdown * Large file env --------- Co-authored-by: abhishekkumar-27 <[email protected]> Co-authored-by: kartikpersistent <[email protected]> Co-authored-by: aashipandya <[email protected]> Co-authored-by: vasanthasaikalluri <[email protected]> Co-authored-by: Prakriti Solankey <[email protected]> Co-authored-by: Ajay Meena <[email protected]> Co-authored-by: Morgan Senechal <[email protected]> Co-authored-by: karanchellani <[email protected]> * DEV to STAGING (#462) * Integration_qa test (#375) * Test IntegrationQA added * update test cases * update test * update node count assertions * test changes * update changes * modification test * Code refatctor test cases * Handle allowedlist issue in test * test changes * update test * test case execution * test chatbot updates * test case update file * added file --------- Co-authored-by: Pravesh Kumar <[email protected]> * recent merges * pdf deletion due to out of diskspace * fixed status blank issue * Rendering the file name instead of link for gcs and s3 sources in the info modal * Convert is_cancelled value from string to bool * added the default page size * Issue fixed Processed chunked as 0 when file re-process again * Youtube timestamps (#386) * Wikipedia source to accept all valid urls * wikipedia url to support multiple languages * integrated wiki langauge param for extract api * Youtube video timestamps --------- Co-authored-by: kartikpersistent <[email protected]> * groq llm integration backend (#286) * groq llm integration backend * groq and description in node properties * added groq in options --------- Co-authored-by: kartikpersistent <[email protected]> * offset in chunks (#389) * page number in gcs loader (#393) * added youtube timestamps (#392) * chat pop up button (#387) * expand * minimize-icon * css changes * chat history * chatbot wider Side Nav * expand icon * chatbot UI * Delete * merge fixes * code suggestions --------- Co-authored-by: kartikpersistent <[email protected]> * chunks create before extraction using is_pre_process variable (#383) * chunks create before extraction using is_pre_process variable * Return total pages for Model * update requirement.txt * total pages on uplaod API * added the Confirmation Dialog * added the selected files into the confirmation modal * format and lint fixes * added the stop watch image * fileselection on alert dialog * Add timeout in docker for gunicorn workers * Add cancel icon to info popup (#384) * Info Modal Changes * css changes * recent merges * Integration_qa test (#375) * Test IntegrationQA added * update test cases * update test * update node count assertions * test changes * update changes * modification test * Code refatctor test cases * Handle allowedlist issue in test * test changes * update test * test case execution * test chatbot updates * test case update file * added file --------- Co-authored-by: Pravesh Kumar <[email protected]> * fixed status blank issue * Rendering the file name instead of link for gcs and s3 sources in the info modal * added the default page size * Convert is_cancelled value from string to bool * Issue fixed Processed chunked as 0 when file re-process again * Youtube timestamps (#386) * Wikipedia source to accept all valid urls * wikipedia url to support multiple languages * integrated wiki langauge param for extract api * Youtube video timestamps --------- Co-authored-by: kartikpersistent <[email protected]> * groq llm integration backend (#286) * groq llm integration backend * groq and description in node properties * added groq in options --------- Co-authored-by: kartikpersistent <[email protected]> * Save Total Pages in DB * Added total Pages * file selection when we didn't select anything from Main table * added the danger icon only for large files * added the overflow for more files and file selection for all new files * moved the interface to types * added the icon accoroding to the source * set total page for wiki and youtube * h3 heading * merge * updated the alert on basis if total pages * deleted chunks * polling based on total pages * isNan check * large file based on file size for s3 and gcs * file source in server side event * time calculation based on chunks for gcs and s3 --------- Co-authored-by: kartikpersistent <[email protected]> Co-authored-by: Prakriti Solankey <[email protected]> Co-authored-by: abhishekkumar-27 <[email protected]> Co-authored-by: aashipandya <[email protected]> * fixed the layout issue * Populate graph schema (#399) * crreate new endpoint populate_graph_schema and update the query for getting lables from DB * Added main.py changes * conditionally-including-the-gcs-login-flow-in-gcs-as-source (#396) * added the condtion * removed llms * Fixed issue : Remove extra unused param * get emb only if used (#278) * Chatbot chunks (#402) * Added file name to the content sent to LLM * added chunk text in the response * increased the docs parts sent to llm * Modified graph query * mardown rendering * youtube starttime * icons * offset changes * removed the files due to codespace space issue --------- Co-authored-by: vasanthasaikalluri <[email protected]> Co-authored-by: kartikpersistent <[email protected]> * Settings modal to support generating the labels from the llm by using text given by user (#405) * added the json * added schema from text dialog * integrated the schemaAPI * added the alert * resize fixes * fixed css issue * fixed status blank issue * Modified response when no docs is retrived (#413) * Fixed env/docker-compose for local deployments + README doc (#410) * Fixed env/docker-compose for local deployments + README doc * wrong place for ENV in README * by default, removed langsmith + fixed knn score string to float * by default, removed langsmith + fixed knn score string to float * Fixed strings in docker-compose env * Added requirements (neo4j 5.15 or later, APOC, and instructions for Neo4j Desktop) * Missed the TIME_PER_PAGE env, was causing NaN issue in the approx time processing notification. fixed that * Support for all unstructured files (#401) * all unstructured files * responsiveness * added file type * added the extensions * spell mistake * ppt file changes --------- Co-authored-by: kartikpersistent <[email protected]> * Settings modal to support generating the labels from the llm by using text given by user with checkbox (#415) * added the json * added schema from text dialog * integrated the schemaAPI * added the alert * resize fixes * Extract schema using direct ChatOpenAI API and Chain * integrated the checkbox for schema to text dialog * Update SettingModal.tsx --------- Co-authored-by: Pravesh Kumar <[email protected]> * gcs file content read via storage client (#417) * gcs file content read via storage client * added the access token the file state --------- Co-authored-by: kartikpersistent <[email protected]> * pypdf2 to read files from gcs (#420) * 407 remove driver from frontend (#416) * removed driver * removed API * connecting to database on page refresh --------- Co-authored-by: kartikpersistent <[email protected]> * Css handling of info modal and Tooltips (#418) * css change * toolTips * Sidebar Tooltips * copy to clip * css change * added image types * added gcs * type fix * docker changes * speech * added the toolip for dropzone sources --------- Co-authored-by: kartikpersistent <[email protected]> * Fixed retrival bugs (#421) * yarn format fixes * changed the delete message * added the cancel button * changed the message on tooltip * added space * UI fixes * tooltip for setting * updated req * wikipedia URL input (#424) * accept only wikipedia links * added wikipedia link * added wikilink regex * wikipedia single url only * changed the alert message * wording change * pushed validation state persist error --------- Co-authored-by: aashipandya <[email protected]> * speech and copy (#422) * speech and copy * startTime * added chunk properties * tooltips --------- Co-authored-by: vasanthasaikalluri <[email protected]> Co-authored-by: kartikpersistent <[email protected]> * Fixed issue for out of range in KNN API * solved conflicts * conflict solved * Remove logging info from update KNN API * tooltip changes * format and lint fixes * responsiveness changes * Fixed issue for total pages GCS, S3 * UI polishing (#428) * button and tooltip changes * checking validation on change * settings module populate fix * format fixes * opening the modal after auth success * removed the limit * added the scrobar for dropdowns * speech state (#426) * speech state * Button Details changes * delete wording change * Total pages in buckets (#431) * page number NA for buckets * added N/A for gcs and s3 pages * total pages for gcs * remove unwanted logger --------- Co-authored-by: kartikpersistent <[email protected]> * removed the max width * Update FileTable.tsx * Update the docker file * Modified prompt (#438) * Update Dockerfile * Update Dockerfile * Update Dockerfile * rendering Fix * Local file upload gcs (#442) * Uplaod file to GCS * GCS local upload fixed issue and delete file from GCS after processing and failed or cancelled * Add life cycle rule on uploaded bucket * pdf upload local and gcs bucket check * delete files when processed and extract changes --------- Co-authored-by: Pravesh Kumar <[email protected]> * Modified chat length and entities used (#443) * metadata for unstructured files (#446) * Unstructured file metadata (#447) * metadata for unstructured files * sleep in gcs upload * updated * icons added to chunks (#435) * icons added to chunks * info modal icons * fixed gcs status message issue * added if check for failed count * Null issue Fixed from backend for upload API and graph_document when model name mismatch * added word break issue * Added neo4j-rust-ext * processing time estimation based on bytes * File extension upper case fixed, File delete from GCS or local based on env variable. * timer per byte * Update Dockerfile * Adding sort rows on the table (#451) * Gcs upload folder hashed (#453) * implement foldername hashed in GCS bucket uplaod * Raise exception if invalid model selected * folder name for gcs upload --------- Co-authored-by: aashipandya <[email protected]> * upload all unstructuredfiles to gcs (#455) * Mofified chunk query (#454) * Added libre office for fixing error -- soffice command was not found. Please install libreoffice on your system and try again. - Install instructions: https://www.libreoffice.org/get-help/install-howto/ - Mac: https://formulae.brew.sh/cask/libreoffice - Debian: https://wiki.debian.org/LibreOffice" * Fix the PARTIAL CONTENT issue * File-table no data found (#456) * 'file-table'' * review comment * Llm format change (#459) * changed the llm models format to lowercase * added the error message * llm model changes * format fixes * removed unused import * added the capitalize method * delete files from merged_file_path only if source is local file --------- Co-authored-by: aashipandya <[email protected]> * commented total page code (#460) * format fixes * removed the disabled check on dropdown * Large file env --------- Co-authored-by: abhishekkumar-27 <[email protected]> Co-authored-by: kartikpersistent <[email protected]> Co-authored-by: aashipandya <[email protected]> Co-authored-by: vasanthasaikalluri <[email protected]> Co-authored-by: Prakriti Solankey <[email protected]> Co-authored-by: Ajay Meena <[email protected]> Co-authored-by: Morgan Senechal <[email protected]> Co-authored-by: karanchellani <[email protected]> * Dev to staging (#466) * Integration_qa test (#375) * Test IntegrationQA added * update test cases * update test * update node count assertions * test changes * update changes * modification test * Code refatctor test cases * Handle allowedlist issue in test * test changes * update test * test case execution * test chatbot updates * test case update file * added file --------- Co-authored-by: Pravesh Kumar <[email protected]> * recent merges * pdf deletion due to out of diskspace * fixed status blank issue * Rendering the file name instead of link for gcs and s3 sources in the info modal * Convert is_cancelled value from string to bool * added the default page size * Issue fixed Processed chunked as 0 when file re-process again * Youtube timestamps (#386) * Wikipedia source to accept all valid urls * wikipedia url to support multiple languages * integrated wiki langauge param for extract api * Youtube video timestamps --------- Co-authored-by: kartikpersistent <[email protected]> * groq llm integration backend (#286) * groq llm integration backend * groq and description in node properties * added groq in options --------- Co-authored-by: kartikpersistent <[email protected]> * offset in chunks (#389) * page number in gcs loader (#393) * added youtube timestamps (#392) * chat pop up button (#387) * expand * minimize-icon * css changes * chat history * chatbot wider Side Nav * expand icon * chatbot UI * Delete * merge fixes * code suggestions --------- Co-authored-by: kartikpersistent <[email protected]> * chunks create before extraction using is_pre_process variable (#383) * chunks create before extraction using is_pre_process variable * Return total pages for Model * update requirement.txt * total pages on uplaod API * added the Confirmation Dialog * added the selected files into the confirmation modal * format and lint fixes * added the stop watch image * fileselection on alert dialog * Add timeout in docker for gunicorn workers * Add cancel icon to info popup (#384) * Info Modal Changes * css changes * recent merges * Integration_qa test (#375) * Test IntegrationQA added * update test cases * update test * update node count assertions * test changes * update changes * modification test * Code refatctor test cases * Handle allowedlist issue in test * test changes * update test * test case execution * test chatbot updates * test case update file * added file --------- Co-authored-by: Pravesh Kumar <[email protected]> * fixed status blank issue * Rendering the file name instead of link for gcs and s3 sources in the info modal * added the default page size * Convert is_cancelled value from string to bool * Issue fixed Processed chunked as 0 when file re-process again * Youtube timestamps (#386) * Wikipedia source to accept all valid urls * wikipedia url to support multiple languages * integrated wiki langauge param for extract api * Youtube video timestamps --------- Co-authored-by: kartikpersistent <[email protected]> * groq llm integration backend (#286) * groq llm integration backend * groq and description in node properties * added groq in options --------- Co-authored-by: kartikpersistent <[email protected]> * Save Total Pages in DB * Added total Pages * file selection when we didn't select anything from Main table * added the danger icon only for large files * added the overflow for more files and file selection for all new files * moved the interface to types * added the icon accoroding to the source * set total page for wiki and youtube * h3 heading * merge * updated the alert on basis if total pages * deleted chunks * polling based on total pages * isNan check * large file based on file size for s3 and gcs * file source in server side event * time calculation based on chunks for gcs and s3 --------- Co-authored-by: kartikpersistent <[email protected]> Co-authored-by: Prakriti Solankey <[email protected]> Co-authored-by: abhishekkumar-27 <[email protected]> Co-authored-by: aashipandya <[email protected]> * fixed the layout issue * Populate graph schema (#399) * crreate new endpoint populate_graph_schema and update the query for getting lables from DB * Added main.py changes * conditionally-including-the-gcs-login-flow-in-gcs-as-source (#396) * added the condtion * removed llms * Fixed issue : Remove extra unused param * get emb only if used (#278) * Chatbot chunks (#402) * Added file name to the content sent to LLM * added chunk text in the response * increased the docs parts sent to llm * Modified graph query * mardown rendering * youtube starttime * icons * offset changes * removed the files due to codespace space issue --------- Co-authored-by: vasanthasaikalluri <[email protected]> Co-authored-by: kartikpersistent <[email protected]> * Settings modal to support generating the labels from the llm by using text given by user (#405) * added the json * added schema from text dialog * integrated the schemaAPI * added the alert * resize fixes * fixed css issue * fixed status blank issue * Modified response when no docs is retrived (#413) * Fixed env/docker-compose for local deployments + README doc (#410) * Fixed env/docker-compose for local deployments + README doc * wrong place for ENV in README * by default, removed langsmith + fixed knn score string to float * by default, removed langsmith + fixed knn score string to float * Fixed strings in docker-compose env * Added requirements (neo4j 5.15 or later, APOC, and instructions for Neo4j Desktop) * Missed the TIME_PER_PAGE env, was causing NaN issue in the approx time processing notification. fixed that * Support for all unstructured files (#401) * all unstructured files * responsiveness * added file type * added the extensions * spell mistake * ppt file changes --------- Co-authored-by: kartikpersistent <[email protected]> * Settings modal to support generating the labels from the llm by using text given by user with checkbox (#415) * added the json * added schema from text dialog * integrated the schemaAPI * added the alert * resize fixes * Extract schema using direct ChatOpenAI API and Chain * integrated the checkbox for schema to text dialog * Update SettingModal.tsx --------- Co-authored-by: Pravesh Kumar <[email protected]> * gcs file content read via storage client (#417) * gcs file content read via storage client * added the access token the file state --------- Co-authored-by: kartikpersistent <[email protected]> * pypdf2 to read files from gcs (#420) * 407 remove driver from frontend (#416) * removed driver * removed API * connecting to database on page refresh --------- Co-authored-by: kartikpersistent <[email protected]> * Css handling of info modal and Tooltips (#418) * css change * toolTips * Sidebar Tooltips * copy to clip * css change * added image types * added gcs * type fix * docker changes * speech * added the toolip for dropzone sources --------- Co-authored-by: kartikpersistent <[email protected]> * Fixed retrival bugs (#421) * yarn format fixes * changed the delete message * added the cancel button * changed the message on tooltip * added space * UI fixes * tooltip for setting * updated req * wikipedia URL input (#424) * accept only wikipedia links * added wikipedia link * added wikilink regex * wikipedia single url only * changed the alert message * wording change * pushed validation state persist error --------- Co-authored-by: aashipandya <[email protected]> * speech and copy (#422) * speech and copy * startTime * added chunk properties * tooltips --------- Co-authored-by: vasanthasaikalluri <[email protected]> Co-authored-by: kartikpersistent <[email protected]> * Fixed issue for out of range in KNN API * solved conflicts * conflict solved * Remove logging info from update KNN API * tooltip changes * format and lint fixes * responsiveness changes * Fixed issue for total pages GCS, S3 * UI polishing (#428) * button and tooltip changes * checking validation on change * settings module populate fix * format fixes * opening the modal after auth success * removed the limit * added the scrobar for dropdowns * speech state (#426) * speech state * Button Details changes * delete wording change * Total pages in buckets (#431) * page number NA for buckets * added N/A for gcs and s3 pages * total pages for gcs * remove unwanted logger --------- Co-authored-by: kartikpersistent <[email protected]> * removed the max width * Update FileTable.tsx * Update the docker file * Modified prompt (#438) * Update Dockerfile * Update Dockerfile * Update Dockerfile * rendering Fix * Local file upload gcs (#442) * Uplaod file to GCS * GCS local upload fixed issue and delete file from GCS after processing and failed or cancelled * Add life cycle rule on uploaded bucket * pdf upload local and gcs bucket check * delete files when processed and extract changes --------- Co-authored-by: Pravesh Kumar <[email protected]> * Modified chat length and entities used (#443) * metadata for unstructured files (#446) * Unstructured file metadata (#447) * metadata for unstructured files * sleep in gcs upload * updated * icons added to chunks (#435) * icons added to chunks * info modal icons * fixed gcs status message issue * added if check for failed count * Null issue Fixed from backend for upload API and graph_document when model name mismatch * added word break issue * Added neo4j-rust-ext * processing time estimation based on bytes * File extension upper case fixed, File delete from GCS or local based on env variable. * timer per byte * Update Dockerfile * Adding sort rows on the table (#451) * Gcs upload folder hashed (#453) * implement foldername hashed in GCS bucket uplaod * Raise exception if invalid model selected * folder name for gcs upload --------- Co-authored-by: aashipandya <[email protected]> * upload all unstructuredfiles to gcs (#455) * Mofified chunk query (#454) * Added libre office for fixing error -- soffice command was not found. Please install libreoffice on your system and try again. - Install instructions: https://www.libreoffice.org/get-help/install-howto/ - Mac: https://formulae.brew.sh/cask/libreoffice - Debian: https://wiki.debian.org/LibreOffice" * Fix the PARTIAL CONTENT issue * File-table no data found (#456) * 'file-table'' * review comment * Llm format change (#459) * changed the llm models format to lowercase * added the error message * llm model changes * format fixes * removed unused import * added the capitalize method * delete files from merged_file_path only if source is local file --------- Co-authored-by: aashipandya <[email protected]> * commented total page code (#460) * format fixes * removed the disabled check on dropdown * Large file env * added upload api * changed the dropzone error message --------- Co-authored-by: abhishekkumar-27 <[email protected]> Co-authored-by: kartikpersistent <[email protected]> Co-authored-by: aashipandya <[email protected]> Co-authored-by: vasanthasaikalluri <[email protected]> Co-authored-by: Prakriti Solankey <[email protected]> Co-authored-by: Ajay Meena <[email protected]> Co-authored-by: Morgan Senechal <[email protected]> Co-authored-by: karanchellani <[email protected]> --------- Co-authored-by: aashipandya <[email protected]> Co-authored-by: abhishekkumar-27 <[email protected]> Co-authored-by: kartikpersistent <[email protected]> Co-authored-by: vasanthasaikalluri <[email protected]> Co-authored-by: Prakriti Solankey <[email protected]> Co-authored-by: Ajay Meena <[email protected]> Co-authored-by: Morgan Senechal <[email protected]> Co-authored-by: karanchellani <[email protected]>
1 parent 1513bf8 commit d327368

29 files changed

+423
-342
lines changed

backend/Dockerfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ EXPOSE 8000
66
RUN apt-get update && \
77
apt-get install -y --no-install-recommends \
88
libgl1-mesa-glx \
9+
libreoffice \
910
cmake \
1011
poppler-utils \
1112
tesseract-ocr && \
@@ -19,4 +20,4 @@ RUN pip install --no-cache-dir --upgrade -r requirements.txt
1920
# Copy application code
2021
COPY . /code
2122
# Set command
22-
CMD ["gunicorn", "score:app", "--workers", "2", "--worker-class", "uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:8000", "--timeout", "300"]
23+
CMD ["gunicorn", "score:app", "--workers", "8","--threads", "8", "--worker-class", "uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:8000", "--timeout", "300"]

backend/requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ matplotlib==3.7.2
8989
mpmath==1.3.0
9090
multidict==6.0.5
9191
mypy-extensions==1.0.0
92-
neo4j==5.20.0
92+
neo4j-rust-ext
9393
networkx==3.2.1
9494
nltk==3.8.1
9595
numpy==1.26.4

backend/score.py

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -162,7 +162,7 @@ async def extract_knowledge_graph_from_file(
162162
merged_file_path = os.path.join(MERGED_DIR,file_name)
163163
logging.info(f'File path:{merged_file_path}')
164164
result = await asyncio.to_thread(
165-
extract_graph_from_file_local_file, graph, model, merged_file_path, file_name, allowedNodes, allowedRelationship)
165+
extract_graph_from_file_local_file, graph, model, merged_file_path, file_name, allowedNodes, allowedRelationship, uri)
166166

167167
elif source_type == 's3 bucket' and source_url:
168168
result = await asyncio.to_thread(
@@ -190,8 +190,14 @@ async def extract_knowledge_graph_from_file(
190190
message=f"Failed To Process File:{file_name} or LLM Unable To Parse Content "
191191
error_message = str(e)
192192
graphDb_data_Access.update_exception_db(file_name,error_message)
193+
gcs_file_cache = os.environ.get('GCS_FILE_CACHE')
193194
if source_type == 'local file':
194-
delete_file_from_gcs(BUCKET_UPLOAD,file_name)
195+
if gcs_file_cache == 'True':
196+
folder_name = create_gcs_bucket_folder_name_hashed(uri,file_name)
197+
delete_file_from_gcs(BUCKET_UPLOAD,folder_name,file_name)
198+
else:
199+
logging.info(f'Deleted File Path: {merged_file_path} and Deleted File Name : {file_name}')
200+
delete_uploaded_local_file(merged_file_path,file_name)
195201
josn_obj = {'message':message,'error_message':error_message, 'file_name': file_name,'status':'Failed','db_url':uri,'failed_count':1, 'source_type': source_type}
196202
logger.log_struct(josn_obj)
197203
logging.exception(f'File Failed in extraction: {josn_obj}')
@@ -351,10 +357,11 @@ async def upload_large_file_into_chunks(file:UploadFile = File(...), chunkNumber
351357
return create_api_response('Success', message=result)
352358
except Exception as e:
353359
# job_status = "Failed"
354-
message="Unable to upload large file into chunks or saving the chunks"
360+
message="Unable to upload large file into chunks. "
355361
error_message = str(e)
356362
logging.info(message)
357363
logging.exception(f'Exception:{error_message}')
364+
return create_api_response('Failed', message=message + error_message[:100], error=error_message, file_name = originalname)
358365
finally:
359366
close_db_connection(graph, 'upload')
360367

@@ -368,11 +375,11 @@ async def get_structured_schema(uri=Form(None), userName=Form(None), password=Fo
368375
logger.log_struct(josn_obj)
369376
return create_api_response('Success', data=result)
370377
except Exception as e:
371-
job_status = "Failed"
372378
message="Unable to get the labels and relationtypes from neo4j database"
373379
error_message = str(e)
374380
logging.info(message)
375381
logging.exception(f'Exception:{error_message}')
382+
return create_api_response("Failed", message=message, error=error_message)
376383
finally:
377384
close_db_connection(graph, 'schema')
378385

@@ -427,7 +434,7 @@ async def delete_document_and_entities(uri=Form(None),
427434
try:
428435
graph = create_graph_database_connection(uri, userName, password, database)
429436
graphDb_data_Access = graphDBdataAccess(graph)
430-
result, files_list_size = await asyncio.to_thread(graphDb_data_Access.delete_file_from_graph, filenames, source_types, deleteEntities, MERGED_DIR)
437+
result, files_list_size = await asyncio.to_thread(graphDb_data_Access.delete_file_from_graph, filenames, source_types, deleteEntities, MERGED_DIR, uri)
431438
entities_count = result[0]['deletedEntities'] if 'deletedEntities' in result[0] else 0
432439
message = f"Deleted {files_list_size} documents with {entities_count} entities from database"
433440
josn_obj = {'api_name':'delete_document_and_entities','db_url':uri}
@@ -480,7 +487,7 @@ async def get_document_status(file_name, url, userName, password, database):
480487
async def cancelled_job(uri=Form(None), userName=Form(None), password=Form(None), database=Form(None), filenames=Form(None), source_types=Form(None)):
481488
try:
482489
graph = create_graph_database_connection(uri, userName, password, database)
483-
result = manually_cancelled_job(graph,filenames, source_types, MERGED_DIR)
490+
result = manually_cancelled_job(graph,filenames, source_types, MERGED_DIR, uri)
484491

485492
return create_api_response('Success',message=result)
486493
except Exception as e:

backend/src/QA_integration.py

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -78,14 +78,13 @@ def get_llm(model: str,max_tokens=1000) -> Any:
7878
"""Retrieve the specified language model based on the model name."""
7979

8080
model_versions = {
81-
"OpenAI GPT 3.5": "gpt-3.5-turbo-16k",
82-
"Gemini Pro": "gemini-1.0-pro-001",
83-
"Gemini 1.5 Pro": "gemini-1.5-pro-preview-0409",
84-
"OpenAI GPT 4": "gpt-4-0125-preview",
85-
"Diffbot" : "gpt-4-0125-preview",
86-
"OpenAI GPT 4o":"gpt-4o"
81+
"gpt-3.5": "gpt-3.5-turbo-16k",
82+
"gemini-1.0-pro": "gemini-1.0-pro-001",
83+
"gemini-1.5-pro": "gemini-1.5-pro-preview-0409",
84+
"gpt-4": "gpt-4-0125-preview",
85+
"diffbot" : "gpt-4-0125-preview",
86+
"gpt-4o":"gpt-4o"
8787
}
88-
8988
if model in model_versions:
9089
model_version = model_versions[model]
9190
logging.info(f"Chat Model: {model}, Model Version: {model_version}")

backend/src/chunkid_entities.py

Lines changed: 41 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,7 @@
11
import logging
22
from neo4j import graph
3-
from langchain_community.graphs import graph_document
43
from src.graph_query import *
54

6-
75
CHUNK_QUERY = """
86
match (chunk:Chunk) where chunk.id IN $chunksIds
97
@@ -15,41 +13,51 @@
1513
RETURN collect(distinct r) as rels
1614
}
1715
WITH d, collect(distinct chunk) as chunks, apoc.coll.toSet(apoc.coll.flatten(collect(rels))) as rels
18-
return [r in rels | [startNode(r), endNode(r), r]] as entities
16+
RETURN d as doc, [chunk in chunks | chunk {.*, embedding:null}] as chunks,
17+
[r in rels | {startNode:{element_id:elementId(startNode(r)), labels:labels(startNode(r)), properties:{id:startNode(r).id,description:startNode(r).description}},
18+
endNode:{element_id:elementId(endNode(r)), labels:labels(endNode(r)), properties:{id:endNode(r).id,description:endNode(r).description}},
19+
relationship: {type:type(r), element_id:elementId(r)}}] as entities
1920
"""
2021

21-
CHUNK_TEXT_QUERY = "match (doc)<-[:PART_OF]-(chunk:Chunk) WHERE chunk.id IN $chunkIds RETURN doc, collect(chunk {.*, embedding:null}) as chunks"
22-
2322

24-
def process_record(record, elements_data):
23+
def process_records(records):
2524
"""
2625
Processes a record to extract and organize node and relationship data.
2726
"""
28-
try:
29-
entities = record["entities"]
30-
for entity in entities:
31-
for element in entity:
32-
element_id = element.element_id
33-
if isinstance(element, graph.Node):
34-
if element_id not in elements_data["seen_nodes"]:
35-
elements_data["seen_nodes"].add(element_id)
36-
node_element = process_node(element)
37-
elements_data["nodes"].append(node_element)
38-
else:
39-
if element_id not in elements_data["seen_relationships"]:
40-
elements_data["seen_relationships"].add(element_id)
41-
nodes = element.nodes
42-
if len(nodes) < 2:
43-
logging.warning(f"Relationship with ID {element_id} does not have two nodes.")
44-
continue
45-
relationship = {
46-
"element_id": element_id,
47-
"type": element.type,
48-
"start_node_element_id": process_node(nodes[0])["element_id"],
49-
"end_node_element_id": process_node(nodes[1])["element_id"],
50-
}
51-
elements_data["relationships"].append(relationship)
52-
return elements_data
27+
try:
28+
nodes = []
29+
relationships = []
30+
seen_nodes = set()
31+
seen_relationships = set()
32+
33+
for record in records:
34+
for element in record["entities"]:
35+
start_node = element['startNode']
36+
end_node = element['endNode']
37+
relationship = element['relationship']
38+
39+
if start_node['element_id'] not in seen_nodes:
40+
nodes.append(start_node)
41+
seen_nodes.add(start_node['element_id'])
42+
43+
if end_node['element_id'] not in seen_nodes:
44+
nodes.append(end_node)
45+
seen_nodes.add(end_node['element_id'])
46+
47+
if relationship['element_id'] not in seen_relationships:
48+
relationships.append({
49+
"element_id": relationship['element_id'],
50+
"type": relationship['type'],
51+
"start_node_element_id": start_node['element_id'],
52+
"end_node_element_id": end_node['element_id']
53+
})
54+
seen_relationships.add(relationship['element_id'])
55+
output = {
56+
"nodes": nodes,
57+
"relationships": relationships
58+
}
59+
60+
return output
5361
except Exception as e:
5462
logging.error(f"chunkid_entities module: An error occurred while extracting the nodes and relationships from records: {e}")
5563

@@ -97,24 +105,9 @@ def get_entities_from_chunkids(uri, username, password, chunk_ids):
97105
chunk_ids_list = chunk_ids.split(",")
98106
driver = get_graphDB_driver(uri, username, password)
99107
records, summary, keys = driver.execute_query(CHUNK_QUERY, chunksIds=chunk_ids_list)
100-
elements_data = {
101-
"nodes": [],
102-
"seen_nodes": set(),
103-
"relationships": [],
104-
"seen_relationships": set()
105-
}
106-
for record in records:
107-
elements_data = process_record(record, elements_data)
108-
108+
result = process_records(records)
109109
logging.info(f"Nodes and relationships are processed")
110-
111-
chunk_data,summary, keys = driver.execute_query(CHUNK_TEXT_QUERY, chunkIds=chunk_ids_list)
112-
chunk_properties = process_chunk_data(chunk_data)
113-
result = {
114-
"nodes": elements_data["nodes"],
115-
"relationships": elements_data["relationships"],
116-
"chunk_data": chunk_properties
117-
}
110+
result["chunk_data"] = process_chunk_data(records)
118111
logging.info(f"Query process completed successfully for chunk ids")
119112
return result
120113

backend/src/document_sources/gcs_bucket.py

Lines changed: 71 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
import io
99
from google.oauth2.credentials import Credentials
1010
import time
11+
from .local_file import load_document_content, get_pages_with_page_numbers
1112

1213
def get_gcs_bucket_files_info(gcs_project_id, gcs_bucket_name, gcs_bucket_folder, creds):
1314
storage_client = storage.Client(project=gcs_project_id, credentials=creds)
@@ -51,88 +52,90 @@ def get_documents_from_gcs(gcs_project_id, gcs_bucket_name, gcs_bucket_folder, g
5152
blob_name = gcs_bucket_folder+'/'+gcs_blob_filename
5253
else:
5354
blob_name = gcs_blob_filename
54-
#credentials, project_id = google.auth.default()
55+
5556
logging.info(f"GCS project_id : {gcs_project_id}")
56-
#loader = GCSFileLoader(project_name=gcs_project_id, bucket=gcs_bucket_name, blob=blob_name, loader_func=load_pdf)
57-
# pages = loader.load()
58-
# file_name = gcs_blob_filename
59-
#creds= Credentials(access_token)
57+
6058
if access_token is None:
6159
storage_client = storage.Client(project=gcs_project_id)
60+
loader = GCSFileLoader(project_name=gcs_project_id, bucket=gcs_bucket_name, blob=blob_name, loader_func=load_document_content)
61+
pages = loader.load()
62+
if (gcs_blob_filename.split('.')[-1]).lower() != 'pdf':
63+
pages = get_pages_with_page_numbers(pages)
6264
else:
6365
creds= Credentials(access_token)
6466
storage_client = storage.Client(project=gcs_project_id, credentials=creds)
65-
print(f'BLOB Name: {blob_name}')
66-
bucket = storage_client.bucket(gcs_bucket_name)
67-
blob = bucket.blob(blob_name)
68-
content = blob.download_as_bytes()
69-
pdf_file = io.BytesIO(content)
70-
pdf_reader = PdfReader(pdf_file)
71-
72-
# Extract text from all pages
73-
text = ""
74-
for page in pdf_reader.pages:
75-
text += page.extract_text()
76-
pages = [Document(page_content = text)]
77-
return gcs_blob_filename, pages
78-
79-
def upload_file_to_gcs(file_chunk, chunk_number, original_file_name, bucket_name):
80-
storage_client = storage.Client()
8167

82-
file_name = f'{original_file_name}_part_{chunk_number}'
83-
bucket = storage_client.bucket(bucket_name)
84-
file_data = file_chunk.file.read()
85-
# print(f'data after read {file_data}')
86-
87-
blob = bucket.blob(file_name)
88-
file_io = io.BytesIO(file_data)
89-
blob.upload_from_file(file_io)
90-
# Define the lifecycle rule to delete objects after 6 hours
91-
# rule = {
92-
# "action": {"type": "Delete"},
93-
# "condition": {"age": 1} # Age in days (24 hours = 1 days)
94-
# }
95-
96-
# # Get the current lifecycle policy
97-
# lifecycle = list(bucket.lifecycle_rules)
98-
99-
# # Add the new rule
100-
# lifecycle.append(rule)
68+
bucket = storage_client.bucket(gcs_bucket_name)
69+
blob = bucket.blob(blob_name)
70+
if blob.exists():
71+
content = blob.download_as_bytes()
72+
pdf_file = io.BytesIO(content)
73+
pdf_reader = PdfReader(pdf_file)
74+
# Extract text from all pages
75+
text = ""
76+
for page in pdf_reader.pages:
77+
text += page.extract_text()
78+
pages = [Document(page_content = text)]
79+
else:
80+
raise Exception('Blob Not Found')
81+
return gcs_blob_filename, pages
10182

102-
# # Set the lifecycle policy on the bucket
103-
# bucket.lifecycle_rules = lifecycle
104-
# bucket.patch()
105-
time.sleep(1)
106-
logging.info('Chunk uploaded successfully in gcs')
107-
108-
def merge_file_gcs(bucket_name, original_file_name: str):
83+
def upload_file_to_gcs(file_chunk, chunk_number, original_file_name, bucket_name, folder_name_sha1_hashed):
84+
try:
10985
storage_client = storage.Client()
110-
# Retrieve chunks from GCS
111-
blobs = storage_client.list_blobs(bucket_name, prefix=f"{original_file_name}_part_")
112-
chunks = []
113-
for blob in blobs:
114-
chunks.append(blob.download_as_bytes())
115-
blob.delete()
116-
117-
# Merge chunks into a single file
118-
merged_file = b"".join(chunks)
119-
blob = storage_client.bucket(bucket_name).blob(original_file_name)
120-
logging.info('save the merged file from chunks in gcs')
121-
file_io = io.BytesIO(merged_file)
86+
87+
file_name = f'{original_file_name}_part_{chunk_number}'
88+
bucket = storage_client.bucket(bucket_name)
89+
file_data = file_chunk.file.read()
90+
file_name_with__hashed_folder = folder_name_sha1_hashed +'/'+file_name
91+
logging.info(f'GCS folder pathin upload: {file_name_with__hashed_folder}')
92+
blob = bucket.blob(file_name_with__hashed_folder)
93+
file_io = io.BytesIO(file_data)
12294
blob.upload_from_file(file_io)
123-
pdf_reader = PdfReader(file_io)
124-
file_size = len(merged_file)
125-
total_pages = len(pdf_reader.pages)
126-
127-
return total_pages, file_size
95+
time.sleep(1)
96+
logging.info('Chunk uploaded successfully in gcs')
97+
except Exception as e:
98+
raise Exception('Error in while uploading the file chunks on GCS')
99+
100+
def merge_file_gcs(bucket_name, original_file_name: str, folder_name_sha1_hashed, total_chunks):
101+
try:
102+
storage_client = storage.Client()
103+
bucket = storage_client.bucket(bucket_name)
104+
# Retrieve chunks from GCS
105+
# blobs = storage_client.list_blobs(bucket_name, prefix=folder_name_sha1_hashed)
106+
# print(f'before sorted blobs: {blobs}')
107+
chunks = []
108+
for i in range(1,total_chunks+1):
109+
blob_name = folder_name_sha1_hashed + '/' + f"{original_file_name}_part_{i}"
110+
blob = bucket.blob(blob_name)
111+
if blob.exists():
112+
print(f'Blob Name: {blob.name}')
113+
chunks.append(blob.download_as_bytes())
114+
blob.delete()
115+
116+
merged_file = b"".join(chunks)
117+
file_name_with__hashed_folder = folder_name_sha1_hashed +'/'+original_file_name
118+
logging.info(f'GCS folder path in merge: {file_name_with__hashed_folder}')
119+
blob = storage_client.bucket(bucket_name).blob(file_name_with__hashed_folder)
120+
logging.info('save the merged file from chunks in gcs')
121+
file_io = io.BytesIO(merged_file)
122+
blob.upload_from_file(file_io)
123+
# pdf_reader = PdfReader(file_io)
124+
file_size = len(merged_file)
125+
# total_pages = len(pdf_reader.pages)
126+
127+
return file_size
128+
except Exception as e:
129+
raise Exception('Error in while merge the files chunks on GCS')
128130

129-
def delete_file_from_gcs(bucket_name, file_name):
131+
def delete_file_from_gcs(bucket_name,folder_name, file_name):
130132
try:
131133
storage_client = storage.Client()
132134
bucket = storage_client.bucket(bucket_name)
133-
blob = bucket.blob(file_name)
135+
folder_file_name = folder_name +'/'+file_name
136+
blob = bucket.blob(folder_file_name)
134137
if blob.exists():
135138
blob.delete()
136139
logging.info('File deleted from GCS successfully')
137-
except:
138-
raise Exception('BLOB not exists in GCS')
140+
except Exception as e:
141+
raise Exception(e)

0 commit comments

Comments
 (0)