Brief Overview:
-
Install DVC with S3 support:
pip install "dvc[s3]" -
Set environment variables:
export AWS_ACCESS_KEY_ID=<your-access-key> export AWS_SECRET_ACCESS_KEY=<your-secret-key> export AWS_DEFAULT_REGION=ca-central-1 export AWS_REQUEST_CHECKSUM_CALCULATION='WHEN_REQUIRED'
-
Pull data from S3 (approx. 10GB+; this can take some time):
dvc pull
-
All data will be stored in the
datafolder.
To push updates to S3, use DVC similarly to Git:
-
Add your data folder:
dvc add data/<your-folder>
-
Commit changes (updates the hash):
git commit -m "Update data hash" -
Push your data to s3:
dvc push
-
Push changes to Git (open a PR with the new hash). This step is critical—if the hash is corrupted, data retrieval becomes problematic:
git push
Always ensure your DVC and Git operations are synchronized to maintain data integrity.