Conversation
Initial cut at implementation by Github Copilot. mitodl/hq#8703
Summary of ChangesHello @feoh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust solution for maintaining data consistency of static resources across environments, particularly after database restores. By implementing a new Concourse pipeline, the system can now automatically and periodically synchronize content from a designated S3 bucket to the application's primary storage bucket. This feature enhances the resilience and operational efficiency of the platform by ensuring that all referenced static assets are readily available. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Code Review
This pull request introduces a new pipeline to periodically synchronize content between two S3 buckets. The implementation includes a new Concourse pipeline definition, a Celery task to manage it, and a corresponding management command. The changes are generally well-structured, but I've found a critical bug in the pipeline definition's shell command that would cause it to fail, and a minor redundancy in the class initialization. My specific comments are provided below.
| aws s3{CLI_ENDPOINT_URL} sync s3://{import_bucket}/ s3://{storage_bucket}/ | ||
| """ | ||
|
|
There was a problem hiding this comment.
The multiline f-string used for sync_commands includes leading whitespace and newlines. When passed to sh -exc, this will cause the command to fail because the shell will not find the aws executable. To fix this, the command should be defined on a single line without leading whitespace.
sync_commands = f"aws s3{CLI_ENDPOINT_URL} sync s3://{import_bucket}/ s3://{storage_bucket}/"| base.__init__(**kwargs) | ||
|
|
There was a problem hiding this comment.
Pull request overview
Adds a new Concourse pipeline + supporting Celery/management-command wiring to periodically sync static assets between S3 buckets (to address missing referenced resources after DB restores).
Changes:
- Introduces new settings for the source bucket and sync interval (
AWS_IMPORT_STORAGE_BUCKET_NAME,AWS_S3_SYNC_INTERVAL). - Adds a new Concourse pipeline definition and pipeline class to run
aws s3 syncon a timer. - Adds API/task/management command entrypoints to upsert (and optionally unpause) the pipeline.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| main/settings.py | Adds env settings for source bucket + interval for the sync pipeline. |
| content_sync/api.py | Exposes get_s3_bucket_sync_pipeline() for backend-specific pipeline construction. |
| content_sync/tasks.py | Adds Celery task to upsert/unpause the new pipeline. |
| content_sync/pipelines/base.py | Adds a base pipeline type/name for the S3 sync pipeline. |
| content_sync/pipelines/concourse.py | Implements the Concourse S3BucketSyncPipeline that upserts the new definition. |
| content_sync/pipelines/definitions/concourse/s3_bucket_sync_pipeline.py | New pipeline definition with timer resource + AWS CLI sync task. |
| content_sync/management/commands/upsert_s3_bucket_sync_pipeline.py | Adds management command to upsert (and optionally delete/unpause) the pipeline. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @is_publish_pipeline_enabled | ||
| def get_s3_bucket_sync_pipeline(api: object | None = None) -> BasePipeline: | ||
| """Get the S3 bucket sync pipeline""" | ||
| return import_string( | ||
| f"content_sync.pipelines.{settings.CONTENT_SYNC_PIPELINE_BACKEND}.S3BucketSyncPipeline" | ||
| )(api=api) |
There was a problem hiding this comment.
Add unit tests for this new API entrypoint (at minimum: verifies the correct import path is used based on CONTENT_SYNC_PIPELINE_BACKEND, and that the decorator returns None when pipeline backend is unset). Existing pipeline getters in this module are covered by api_test.py, but this one is not.
| @app.task(acks_late=True) | ||
| def upsert_s3_bucket_sync_pipeline( | ||
| unpause=False, # noqa: FBT002 | ||
| ) -> bool: | ||
| """Upsert the S3 bucket sync pipeline""" | ||
| pipeline = api.get_s3_bucket_sync_pipeline() | ||
| pipeline.upsert_pipeline() | ||
| if unpause: | ||
| pipeline.unpause() | ||
| return True |
There was a problem hiding this comment.
Add test coverage for this new Celery task (e.g., mock api.get_s3_bucket_sync_pipeline(), assert upsert_pipeline is always called, and assert unpause() is called only when unpause=True). Similar tasks in this file are covered in tasks_test.py, but this new task currently isn’t.
| class S3BucketSyncPipeline(BaseS3BucketSyncPipeline, GeneralPipeline): | ||
| """Concourse pipeline to sync S3 buckets periodically""" | ||
|
|
||
| PIPELINE_NAME = BaseS3BucketSyncPipeline.PIPELINE_NAME | ||
|
|
||
| def __init__(self, api: PipelineApi | None = None): | ||
| """Initialize the pipeline instance""" | ||
| self.MANDATORY_SETTINGS = [ | ||
| *MANDATORY_CONCOURSE_SETTINGS, | ||
| "AWS_STORAGE_BUCKET_NAME", | ||
| "AWS_IMPORT_STORAGE_BUCKET_NAME", | ||
| ] | ||
| super().__init__(api=api) | ||
|
|
||
| def upsert_pipeline(self): | ||
| """ | ||
| Create or update the concourse pipeline | ||
| """ | ||
| pipeline_definition = S3BucketSyncPipelineDefinition( | ||
| import_bucket=settings.AWS_IMPORT_STORAGE_BUCKET_NAME, | ||
| storage_bucket=settings.AWS_STORAGE_BUCKET_NAME, | ||
| sync_interval=settings.AWS_S3_SYNC_INTERVAL, | ||
| ) | ||
| self.upsert_config(pipeline_definition.json(), self.PIPELINE_NAME) |
There was a problem hiding this comment.
Add tests for the new S3BucketSyncPipeline class (similar to existing ThemeAssetsPipeline/MassBuildSitesPipeline tests): verify mandatory settings are enforced and that upsert_pipeline uses S3BucketSyncPipelineDefinition with the expected buckets/interval, and upserts config under the expected pipeline name.
| class S3BucketSyncPipelineDefinition(Pipeline): | ||
| """ | ||
| A Pipeline that syncs S3 buckets periodically: | ||
|
|
||
| - Triggers on a configurable time interval | ||
| - Uses AWS CLI to sync from AWS_IMPORT_STORAGE_BUCKET_NAME to AWS_STORAGE_BUCKET_NAME | ||
|
|
||
| Args: | ||
| import_bucket(str): The S3 bucket to sync from (source) | ||
| storage_bucket(str): The S3 bucket to sync to (destination) | ||
| sync_interval(str): The time interval for syncing (e.g., "24h", "12h", "6h") | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| import_bucket: str, | ||
| storage_bucket: str, | ||
| sync_interval: str = "24h", | ||
| **kwargs, | ||
| ): | ||
| base = super() | ||
| base.__init__(**kwargs) | ||
|
|
||
| # Time-based trigger resource | ||
| timer_resource = Resource( | ||
| name=s3_sync_timer_identifier, | ||
| type="time", | ||
| icon="clock-outline", | ||
| source={"interval": sync_interval}, | ||
| ) | ||
|
|
||
| resources = [timer_resource] | ||
| resource_types = [] | ||
|
|
||
| # Get step for time trigger | ||
| get_timer_step = GetStep( | ||
| get=s3_sync_timer_identifier, | ||
| trigger=True, | ||
| ) | ||
|
|
||
| # AWS S3 sync task | ||
| sync_commands = f""" | ||
| aws s3{CLI_ENDPOINT_URL} sync s3://{import_bucket}/ s3://{storage_bucket}/ | ||
| """ | ||
|
|
||
| s3_sync_task = TaskStep( | ||
| task=s3_sync_task_identifier, | ||
| timeout="2h", | ||
| attempts=3, | ||
| config=TaskConfig( | ||
| platform="linux", | ||
| image_resource=AWS_CLI_REGISTRY_IMAGE, | ||
| params=( | ||
| {} | ||
| if not is_dev() | ||
| else { | ||
| "AWS_ACCESS_KEY_ID": settings.AWS_ACCESS_KEY_ID or "", | ||
| "AWS_SECRET_ACCESS_KEY": settings.AWS_SECRET_ACCESS_KEY or "", | ||
| } | ||
| ), | ||
| run=Command( | ||
| path="sh", | ||
| args=["-exc", sync_commands], | ||
| ), | ||
| ), | ||
| ) | ||
|
|
||
| tasks = [get_timer_step, s3_sync_task] | ||
| job = Job(name=s3_sync_job_identifier, serial=True, plan=tasks) | ||
|
|
||
| base.__init__( | ||
| resource_types=resource_types, resources=resources, jobs=[job], **kwargs | ||
| ) |
There was a problem hiding this comment.
Add unit tests for this new pipeline definition (consistent with other definitions in this directory): assert the generated JSON includes the time resource interval, the job name, and the expected aws s3 sync command (including the dev endpoint URL behavior).
| sync_commands = f""" | ||
| aws s3{CLI_ENDPOINT_URL} sync s3://{import_bucket}/ s3://{storage_bucket}/ | ||
| """ | ||
|
|
||
| s3_sync_task = TaskStep( | ||
| task=s3_sync_task_identifier, | ||
| timeout="2h", | ||
| attempts=3, | ||
| config=TaskConfig( | ||
| platform="linux", | ||
| image_resource=AWS_CLI_REGISTRY_IMAGE, | ||
| params=( | ||
| {} | ||
| if not is_dev() | ||
| else { | ||
| "AWS_ACCESS_KEY_ID": settings.AWS_ACCESS_KEY_ID or "", | ||
| "AWS_SECRET_ACCESS_KEY": settings.AWS_SECRET_ACCESS_KEY or "", | ||
| } | ||
| ), | ||
| run=Command( | ||
| path="sh", | ||
| args=["-exc", sync_commands], | ||
| ), | ||
| ), |
There was a problem hiding this comment.
This sync task doesn’t currently apply the repository’s AWS CLI concurrency tuning (AWS_MAX_CONCURRENT_CONNECTIONS). Other large s3 sync operations configure default.s3.max_concurrent_requests from this setting before running aws s3 sync, which can materially affect runtime for large buckets. Consider setting that config at the start of this task and passing AWS_MAX_CONCURRENT_CONNECTIONS as a param/env so the setting is honored here too.
Initial cut at implementation by Github Copilot.
mitodl/hq#8703
What are the relevant tickets?
Fixes #8703
Description (What does it do?)
When doing a database restore in ocw-studio, you will have WebsiteContent that references static resources that don't exist in your environment. We would like to be able to also periodically sync all static resources stored in the AWS S3 bucket denoted by the env variable AWS_STORAGE_BUCKET_NAME from another instance of ocw-studio.
How can this be tested?
I have no earthly idea, and that's a problem. I'll talk to @gumaerc about this on Monday.