-
Notifications
You must be signed in to change notification settings - Fork 60
feat(vespa-data-migration): migrating vespa using s3 #1101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,40 @@ | ||||||||||||
| #!/bin/bash | ||||||||||||
| set -e | ||||||||||||
| set -o pipefail | ||||||||||||
|
|
||||||||||||
| # ------------------------------------------------------------ | ||||||||||||
| # STEP 7: Retrieve and Decrypt Vespa Dump | ||||||||||||
| # ------------------------------------------------------------ | ||||||||||||
|
|
||||||||||||
| # ---------- Option 1 — using AWS S3 ---------- | ||||||||||||
| # ⚠️ Replace with your actual bucket name and path | ||||||||||||
| aws s3 cp s3://your-bucket-name/dumps/dump.json.gz.enc . | ||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Replace placeholder bucket name. The S3 bucket name Consider using environment variables for configuration: -aws s3 cp s3://your-bucket-name/dumps/dump.json.gz.enc .
+# Read bucket name from environment variable
+BUCKET_NAME="${VESPA_BACKUP_BUCKET:?Error: VESPA_BACKUP_BUCKET environment variable not set}"
+aws s3 cp "s3://${BUCKET_NAME}/dumps/dump.json.gz.enc" .🤖 Prompt for AI Agents |
||||||||||||
|
|
||||||||||||
| # Decrypt AES-256 encrypted dump (you’ll be prompted for password) | ||||||||||||
| openssl enc -d -aes-256-cbc -pbkdf2 -salt \ | ||||||||||||
| -in dump.json.gz.enc \ | ||||||||||||
| -out dump.json.gz | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| # ---------- Option 2 — using GPG ---------- | ||||||||||||
| # Uncomment these lines if you used GPG encryption instead of OpenSSL | ||||||||||||
|
|
||||||||||||
| # yum install -y pinentry || apt install -y pinentry | ||||||||||||
| # gpgconf --kill gpg-agent | ||||||||||||
| # export GPG_TTY=$(tty) | ||||||||||||
| # echo $GPG_TTY | ||||||||||||
| # gpg --import my-private-key.asc | ||||||||||||
| # gpg --list-secret-keys | ||||||||||||
| # gpg --output dump.json.gz --decrypt dump.json.gz.gpg | ||||||||||||
|
|
||||||||||||
|
|
||||||||||||
| # ------------------------------------------------------------ | ||||||||||||
| # STEP 8: Decompress and Feed into Vespa | ||||||||||||
| # ------------------------------------------------------------ | ||||||||||||
| gunzip dump.json.gz | ||||||||||||
| vespa-feed-client dump.json | ||||||||||||
|
|
||||||||||||
| # ------------------------------------------------------------ | ||||||||||||
| # Done 🎉 | ||||||||||||
| # ------------------------------------------------------------ | ||||||||||||
| echo "✅ Vespa data restored successfully!" | ||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The script leaves behind the downloaded and decompressed dump files (
Suggested change
|
||||||||||||
| Original file line number | Diff line number | Diff line change | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,66 @@ | ||||||||||||||||
| #!/bin/bash | ||||||||||||||||
| set -e | ||||||||||||||||
| set -o pipefail | ||||||||||||||||
|
|
||||||||||||||||
| # ------------------------------------------------------------ | ||||||||||||||||
| # STEP 0: AWS Configuration (Non-interactive) | ||||||||||||||||
| # ------------------------------------------------------------ | ||||||||||||||||
| # ⚠️ Replace with your real credentials | ||||||||||||||||
| export AWS_ACCESS_KEY_ID="AKIAIOSFODNN7EXAMPLE" | ||||||||||||||||
| export AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" | ||||||||||||||||
|
Comment on lines
+9
to
+10
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CRITICAL: Remove hardcoded AWS credentials. Hardcoded AWS credentials in version control is a critical security vulnerability, even if these appear to be example values. Credentials should never be committed to the repository. Remove the hardcoded credentials and use environment variables or AWS credential profiles instead: -# ⚠️ Replace with your real credentials
-export AWS_ACCESS_KEY_ID="AKIAIOSFODNN7EXAMPLE"
-export AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
-export AWS_DEFAULT_REGION="ap-south-1"
-export AWS_DEFAULT_OUTPUT="json"
+# Use AWS credentials from environment or AWS config
+# Ensure AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are set
+if [ -z "$AWS_ACCESS_KEY_ID" ] || [ -z "$AWS_SECRET_ACCESS_KEY" ]; then
+ echo "Error: AWS credentials not configured. Please set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY"
+ echo "Or configure AWS CLI with: aws configure"
+ exit 1
+fi
+
+export AWS_DEFAULT_REGION="${AWS_DEFAULT_REGION:-ap-south-1}"
+export AWS_DEFAULT_OUTPUT="${AWS_DEFAULT_OUTPUT:-json}"Additionally, scan the repository for any committed credentials using tools like 🤖 Prompt for AI Agents |
||||||||||||||||
| export AWS_DEFAULT_REGION="ap-south-1" | ||||||||||||||||
| export AWS_DEFAULT_OUTPUT="json" | ||||||||||||||||
|
Comment on lines
+8
to
+12
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hardcoding AWS credentials in a script, even as examples, is a significant security risk. It encourages a bad practice that can lead to accidentally committing real credentials. The script should rely on the standard AWS CLI credential chain (e.g., IAM roles, environment variables, or the
Suggested change
|
||||||||||||||||
|
|
||||||||||||||||
| # AWS performance tuning (optional) | ||||||||||||||||
| aws configure set default.s3.max_concurrent_requests 20 | ||||||||||||||||
| aws configure set default.s3.multipart_threshold 64MB | ||||||||||||||||
| aws configure set default.s3.multipart_chunksize 64MB | ||||||||||||||||
| aws configure set default.s3.max_queue_size 100 | ||||||||||||||||
| aws configure set default.s3.multipart_upload_threshold 64MB | ||||||||||||||||
| aws configure set default.s3.multipart_max_attempts 5 | ||||||||||||||||
|
Comment on lines
+14
to
+20
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Modifying the user's global AWS configuration with Note that some of these settings, like |
||||||||||||||||
|
|
||||||||||||||||
| aws sts get-caller-identity | ||||||||||||||||
|
|
||||||||||||||||
| # ------------------------------------------------------------ | ||||||||||||||||
| # STEP 1: Start Vespa container (optional if already running) | ||||||||||||||||
| # ------------------------------------------------------------ | ||||||||||||||||
| # docker run -d --name vespa-testing \ | ||||||||||||||||
| # -e VESPA_IGNORE_NOT_ENOUGH_MEMORY=true \ | ||||||||||||||||
| # -p 8181:8080 \ | ||||||||||||||||
| # -p 19171:19071 \ | ||||||||||||||||
| # -p 2224:22 \ | ||||||||||||||||
| # vespaengine/vespa:latest | ||||||||||||||||
|
|
||||||||||||||||
| # ------------------------------------------------------------ | ||||||||||||||||
| # STEP 2: Export Vespa data | ||||||||||||||||
| # ------------------------------------------------------------ | ||||||||||||||||
| vespa visit --content-cluster my_content --make-feed > dump.json | ||||||||||||||||
|
|
||||||||||||||||
| # ------------------------------------------------------------ | ||||||||||||||||
| # STEP 3: Compress dump file | ||||||||||||||||
| # ------------------------------------------------------------ | ||||||||||||||||
| apt install -y pigz || yum install -y pigz | ||||||||||||||||
| pigz -9 dump.json # creates dump.json.gz | ||||||||||||||||
|
|
||||||||||||||||
| # ------------------------------------------------------------ | ||||||||||||||||
| # STEP 4: Encrypt dump file (AES-256) | ||||||||||||||||
| # ------------------------------------------------------------ | ||||||||||||||||
| # ⚠️ You’ll be prompted for password — can automate with -pass if needed | ||||||||||||||||
| openssl enc -aes-256-cbc -pbkdf2 -salt \ | ||||||||||||||||
| -in dump.json.gz \ | ||||||||||||||||
| -out dump.json.gz.enc | ||||||||||||||||
|
|
||||||||||||||||
| # ------------------------------------------------------------ | ||||||||||||||||
| # STEP 5: Upload to AWS S3 | ||||||||||||||||
| # ------------------------------------------------------------ | ||||||||||||||||
| aws s3 cp dump.json.gz.enc s3://your-bucket-name/dumps/ | ||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Replace placeholder S3 bucket name. The S3 bucket name Apply this diff: -aws s3 cp dump.json.gz.enc s3://your-bucket-name/dumps/
+BUCKET_NAME="${VESPA_BACKUP_BUCKET:?Error: VESPA_BACKUP_BUCKET not set}"
+TIMESTAMP=$(date +%Y-%m-%d-%H%M%S)
+aws s3 cp dump.json.gz.enc "s3://${BUCKET_NAME}/dumps/dump-${TIMESTAMP}.json.gz.enc"📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||
|
|
||||||||||||||||
| # Optional: show progress bar (Linux only) | ||||||||||||||||
| # aws s3 cp dump.json.gz.enc s3://your-bucket-name/dumps/ --expected-size $(stat -c%s dump.json.gz.enc) | ||||||||||||||||
|
|
||||||||||||||||
| # ------------------------------------------------------------ | ||||||||||||||||
| # STEP 6: (Optional) Transfer over SSH | ||||||||||||||||
| # ------------------------------------------------------------ | ||||||||||||||||
| # rsync -avzP --inplace --partial --append -e "ssh -p 2224" dump.json.gz.enc root@192.168.1.6:/home/root/ | ||||||||||||||||
|
|
||||||||||||||||
| echo "✅ Vespa dump, compression, encryption, and upload completed successfully!" | ||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,157 @@ | ||
| /* #!/bin/bash | ||
|
|
||
| // ------------------------------------------------------------ | ||
| // STEP 1: Start Vespa container for dump creation | ||
| // ------------------------------------------------------------ | ||
|
|
||
| //docker run -d --name vespa-testing \ | ||
| //-e VESPA_IGNORE_NOT_ENOUGH_MEMORY=true \ | ||
| //-p 8181:8080 \ | ||
| //-p 19171:19071 \ | ||
| //-p 2224:22 \ | ||
| //vespaengine/vespa:latest | ||
|
|
||
| // ------------------------------------------------------------ | ||
| // STEP 2: Export Vespa data | ||
| // ------------------------------------------------------------ | ||
|
|
||
| "vespa visit --content-cluster my_content --make-feed > dump.json" | ||
|
|
||
| // ------------------------------------------------------------ | ||
| // STEP 3: Compress dump file | ||
| // ------------------------------------------------------------ | ||
|
|
||
| "apt install -y pigz" | ||
| //# or yum install pigz | ||
|
|
||
| // pigz is parallel gzip (much faster) | ||
| // pigz -9 (1.15 hr, ~280 GB) or -7 (1 hr, ~320 GB) | ||
| "pigz -9 dump.json" | ||
| // creates dump.json.gz | ||
|
|
||
| // if pigz is not available, fallback to gzip | ||
| //gzip -9 dump.json | ||
| //(gzip -9 -c dump.json > dump.json.gz) | ||
|
|
||
| // ------------------------------------------------------------ | ||
| // STEP 4: Encrypt dump file (OpenSSL password-based) | ||
| // ------------------------------------------------------------ | ||
|
|
||
| "openssl enc -aes-256-cbc -pbkdf2 -salt \ | ||
| -in dump.json.gz \ | ||
| -out dump.json.gz.enc" | ||
|
|
||
| // Strong AES-256 encryption, password will be prompted | ||
| // dump.json.gz.enc → safe to transfer/upload | ||
|
|
||
| // ------------------------------------------------------------ | ||
| // OPTIONAL: GPG-based encryption (if using keypair) | ||
| // ------------------------------------------------------------ | ||
|
|
||
| //gpg --full-generate-key | ||
| //gpg --list-keys | ||
| //gpg --output dump.json.gz.gpg --encrypt --recipient <A1B2C3D4E5F6G7H8> dump.json.gz | ||
|
|
||
| // ------------------------------------------------------------ | ||
| // STEP 5: Set up SSH access for container-to-host transfer | ||
| // ------------------------------------------------------------ | ||
|
|
||
| // ssh-keygen -t rsa -b 4096 -C "mohd.shoaib@juspay.in" | ||
| // cat ~/.ssh/id_rsa.pub | ||
|
|
||
| // On your **remote machine (container)**: | ||
| //docker exec -it --user root vespa-testing /bin/bash | ||
|
|
||
| //apt-get or yum update | ||
| //apt-get or yum install -y openssh-client (openssh first) | ||
| //ssh-keygen -A | ||
| //yum install -y openssh-server | ||
| //mkdir -p /var/run/sshd | ||
| ///usr/sbin/sshd | ||
|
|
||
| //mkdir -p ~/.ssh | ||
| //chmod 700 ~/.ssh | ||
| //echo "PASTE_YOUR_PUBLIC_KEY_HERE" >> ~/.ssh/authorized_keys | ||
| //chmod 600 ~/.ssh/authorized_keys | ||
|
|
||
| //yum install -y rsync | ||
|
|
||
| // ------------------------------------------------------------ | ||
| // STEP 6: Test SSH + Transfer dump or key | ||
| // ------------------------------------------------------------ | ||
|
|
||
| //ssh -p 2224 root@192.168.1.6 - testing | ||
|
|
||
| //yum install -y rsync | ||
|
|
||
| //gpg --export-secret-keys --armor BF4AF7E7E3955EF3A436A4ED7C59556BFC58DFAF > my-private-key.asc | ||
|
|
||
| //rsync -avzP --inplace --partial --append -e "ssh -p 2224" my-private-key.asc root@192.168.1.6:/home/ | ||
|
|
||
| "brew install awscli" | ||
|
|
||
| "aws configure" | ||
|
|
||
| "AWS Access Key ID [None]: ****************" | ||
| "AWS Secret Access Key [None]: ********************" | ||
| "Default region name [None]: ap-south-1" | ||
| "Default output format [None]: json" | ||
|
|
||
| //AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE | ||
| //AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY | ||
| //Default region name [None]: ap-south-1 | ||
| //Default output format [None]: json | ||
|
|
||
|
|
||
| // For fast file transfer | ||
| "aws configure set default.s3.max_concurrent_requests 20" | ||
| "aws configure set default.s3.multipart_threshold 64MB" | ||
| //Check your identity: | ||
| "aws sts get-caller-identity" | ||
|
|
||
| // for Making transfers faster (optional) | ||
| "aws configure set default.s3.multipart_chunksize 64MB" | ||
| "aws configure set default.s3.max_queue_size 100" | ||
| "aws configure set default.s3.multipart_upload_threshold 64MB" | ||
| "aws configure set default.s3.multipart_max_attempts 5" | ||
|
|
||
| "aws s3 cp dump.json.gz.enc s3://your-bucket-name/dumps/" | ||
|
|
||
| //Optional (show progress bar): | ||
| "aws s3 cp dump.json.gz.enc s3://xyne-vespa-backups/2025-10-13/ --expected-size $(stat -c%s dump.json.gz.enc" | ||
|
|
||
| //rsync -avzP --inplace --partial --append -e "ssh -p 2224" dump.json.gz.gpg root@192.168.1.6:/home/root/ | ||
|
|
||
| // ------------------------------------------------------------ | ||
| // STEP 7: On the new machine | ||
| // ------------------------------------------------------------ | ||
|
|
||
| // Option 1 — using AWS S3 | ||
| "aws s3 cp s3://your-bucket-name/dumps/dump.json.gz.enc " | ||
|
|
||
| "openssl enc -d -aes-256-cbc -pbkdf2 -salt \ | ||
| -in dump.json.gz.enc \ | ||
| -out dump.json.gz" | ||
|
|
||
| // Option 2 — if using GPG | ||
| //yum install -y pinentry | ||
| //gpgconf --kill gpg-agent | ||
| //export GPG_TTY=$(tty) | ||
| //echo $GPG_TTY | ||
|
|
||
| //gpg --import my-private-key.asc | ||
| //gpg --list-secret-keys | ||
| //gpg --output dump.json.gz --decrypt dump.json.gz.gpg | ||
|
|
||
| // ------------------------------------------------------------ | ||
| // STEP 8: Decompress and feed into Vespa | ||
| // ------------------------------------------------------------ | ||
|
|
||
| "gunzip dump.json.gz" | ||
|
|
||
| "vespa-feed-client dump.json" | ||
|
|
||
| // ------------------------------------------------------------ | ||
| // Done 🎉 | ||
| // ------------------------------------------------------------ | ||
| */ | ||
|
Comment on lines
+1
to
+157
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This file appears to be a collection of notes and scratchpad commands, and it's entirely commented out. It also contains example AWS credentials (lines 100-101), which is a security risk even when commented, as they can be flagged by security scanners and promote bad practices. It's best to remove this file from the pull request. If these are important notes, they should be moved to a more appropriate place like a README or a wiki, with any sensitive examples removed.
Comment on lines
+1
to
+157
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion | 🟠 Major Convert to appropriate file format for documentation. This file is a Consider one of these options: Option 1 (Recommended): Convert to Markdown Rename to # Vespa Data Migration Guide
## Step 1: Start Vespa Container
\`\`\`bash
docker run -d --name vespa-testing \
-e VESPA_IGNORE_NOT_ENOUGH_MEMORY=true \
-p 8181:8080 \
...
\`\`\`
## Step 2: Export Vespa Data
\`\`\`bash
vespa visit --content-cluster my_content --make-feed > dump.json
\`\`\`
...Option 2: Convert to executable shell script Rename to 🤖 Prompt for AI Agents |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoding values like the S3 bucket name and file paths makes the script less reusable and harder to maintain. It's a best practice to define these as variables at the top of the script. This allows for easier configuration across different environments.