-
Notifications
You must be signed in to change notification settings - Fork 11
Document Migration for Cropwizard 1.6 #392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This pull request has been ignored for the connected project Preview Branches by Supabase. |
|
Overall this PR looks good @drshika, I do see that if the script errors out or fails, then there is no way to resume and it will create duplicate records for what has already been processed. Can you please fix that? |
…ion URLs arguments.
|
error message: I've been running 4 threads with a batch size of 1000 at a time but this crashes supabase. |
Utilities for Document Migration for Cropwizard 1.5
Summary
This PR introduces two utility scripts for managing course vectors in Qdrant:
export_course_vectors.py: Exports all vectors for a given course to a JSON file or stdout.copy_course_vectors.py: Copies all vectors from a source course to a destination course, assigning new UUIDs to avoid ID collisions.document_copier.py: Copies all Supabase Documents from a source course to a destination course. Includes a Dry Run mode to ensure the documents being copied are correct.Currently these rely on .env file as I didn't incorporate them into the flask app.
Usage
(make sure you run with infiscal variables set for source/dest course URLs, keys and collection names.)
Export vectors:
python export_course_vectors.py "<course_name>" [output.json]Copy vectors between courses:
Don't use quotes for this:
Copy documents between courses
Test Results
Vector Copying
Tested on two courses:
Example output:
New documents added:
QMK-Setup-MacOS.sh,latex.pdfhttps://uiuc.chat/hihello (feel free to test yourself by asking the chatbot questions about these docs, it will respond and cite the docs correctly).
Document Copying
Testing with no documents. NOTE: make sure you use the service API key and not the anonymous key to bypass RLS.
Testing with documents that have already been ingested: