Skip to content

Conversation

@drshika
Copy link
Contributor

@drshika drshika commented Apr 22, 2025

Utilities for Document Migration for Cropwizard 1.5

Summary

This PR introduces two utility scripts for managing course vectors in Qdrant:

  • export_course_vectors.py: Exports all vectors for a given course to a JSON file or stdout.
  • copy_course_vectors.py: Copies all vectors from a source course to a destination course, assigning new UUIDs to avoid ID collisions.
  • document_copier.py: Copies all Supabase Documents from a source course to a destination course. Includes a Dry Run mode to ensure the documents being copied are correct.

Currently these rely on .env file as I didn't incorporate them into the flask app.

Usage

(make sure you run with infiscal variables set for source/dest course URLs, keys and collection names.)

Export vectors:

python export_course_vectors.py "<course_name>" [output.json]

Copy vectors between courses:

python copy_course_vectors.py "<source_course>" "<destination_course>"

Don't use quotes for this:
Copy documents between courses

python document_copier.py --source-course <source_course> --target-course <destination_course> \
  --source-url "[redacted]" --source-key "[redacted]" \
  --destination-url "[redacted]" --destination-key "[redacted]"

Test Results

Vector Copying

Tested on two courses:

Course Vectors Exported (before copy) Vectors Copied Vectors Exported (after copy)
hihello 69 +33 102
sep13test 33 33

Example output:

❯ python export_course_vectors.py "hihello" output.json
Exported 69 vectors to output.json

❯ python export_course_vectors.py "sep13test" output.json
Exported 33 vectors to output.json

❯ python copy_course_vectors.py "sep13test" "hihello"
Copied 33 vectors so far...
Done! Total vectors copied from 'sep13test' to 'hihello': 33

❯ python export_course_vectors.py "hihello" output.json
Exported 102 vectors to output.json

New documents added: QMK-Setup-MacOS.sh, latex.pdf
https://uiuc.chat/hihello (feel free to test yourself by asking the chatbot questions about these docs, it will respond and cite the docs correctly).

Document Copying

Testing with no documents. NOTE: make sure you use the service API key and not the anonymous key to bypass RLS.

python document_copier.py --source-course hihello --target-course hihello1 \
  --source-url "[redacted]" --source-key "[redacted]" \
  --destination-url "[redacted]" --destination-key "[redacted]"
Successfully connected to both source and destination databases for course: hihello

First 10 documents in source course 'hihello':
- 24-SEGIP-ADA-041123-A11Y-2023-4-16-.pdf (ID: 595984)
- realm-export.json (ID: 666162)
- latex.pdf (ID: 1131619)
- QMK-Setup-MacOS.sh (ID: 1131620)

First 10 documents in destination course 'hihello1':
Starting batch copy from hihello to hihello1 (batch size: 1000)
Fetching documents 0 to 999...
Processing batch of 4 documents...
Copied document: 24-SEGIP-ADA-041123-A11Y-2023-4-16-.pdf to hihello1
Copied document: realm-export.json to hihello1
Copied document: latex.pdf to hihello1
Copied document: QMK-Setup-MacOS.sh to hihello1
Operation completed. 4 documents copied.

Testing with documents that have already been ingested:

python document_copier.py --source-course hihello --target-course hihello1 \
  --source-url "[redacted]" --source-key "[redacted]" \
  --destination-url "[redacted]" --destination-key "[redacted]"
  
Successfully connected to both source and destination databases for course: hihello

First 10 documents in source course 'hihello':
- 24-SEGIP-ADA-041123-A11Y-2023-4-16-.pdf (ID: 595984)
- realm-export.json (ID: 666162)
- latex.pdf (ID: 1131619)
- QMK-Setup-MacOS.sh (ID: 1131620)

First 10 documents in destination course 'hihello1':
- 24-SEGIP-ADA-041123-A11Y-2023-4-16-.pdf (ID: 17)
- realm-export.json (ID: 18)
- latex.pdf (ID: 19)
- QMK-Setup-MacOS.sh (ID: 20)
Starting batch copy from hihello to hihello1 (batch size: 1000)
Fetching documents 0 to 999...
Processing batch of 4 documents...
Skipping existing document: 24-SEGIP-ADA-041123-A11Y-2023-4-16-.pdf in hihello1
Skipping existing document: realm-export.json in hihello1
Skipping existing document: latex.pdf in hihello1
Skipping existing document: QMK-Setup-MacOS.sh in hihello1
Operation completed. 0 documents copied.

Note:
Warnings about API key usage over insecure connection and Qdrant client/server version mismatch are shown but do not affect functionality.

@supabase
Copy link

supabase bot commented Apr 22, 2025

This pull request has been ignored for the connected project twzwfuydgnnjcaopyfdv because there are no changes detected in supabase directory. You can change this behaviour in Project Integrations Settings ↗︎.


Preview Branches by Supabase.
Learn more about Supabase Branching ↗︎.

@drshika drshika requested a review from rohan-uiuc April 29, 2025 20:19
@drshika drshika requested a review from rohan-uiuc May 1, 2025 15:40
@rohan-uiuc
Copy link
Contributor

Overall this PR looks good @drshika, I do see that if the script errors out or fails, then there is no way to resume and it will create duplicate records for what has already been processed. Can you please fix that?
Also, I'm setting up qdrant, and we will need a different destination for SQL, can you please add support for a new destination url for both of them?

@drshika
Copy link
Contributor Author

drshika commented Jun 10, 2025

error message:

Fetching documents 519100 to 520099...
Error: The read operation timed out
Worker 2 exited with code 1
Worker 3 exited with code 1
Error: The read operation timed out

I've been running 4 threads with a batch size of 1000 at a time but this crashes supabase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants