Skip to content

feat: add auto update script for all datasets in marked channels #178#181

Merged
Fedir-Yatsenko merged 6 commits intodevelopmentfrom
feat/178-add-auto-update-script-for-all-datasets-in-marked-channels
Mar 5, 2026
Merged

feat: add auto update script for all datasets in marked channels #178#181
Fedir-Yatsenko merged 6 commits intodevelopmentfrom
feat/178-add-auto-update-script-for-all-datasets-in-marked-channels

Conversation

@Fedir-Yatsenko
Copy link
Collaborator

@Fedir-Yatsenko Fedir-Yatsenko commented Mar 5, 2026

Applicable issues

Description of changes

Adds a new ADMIN_MODE=AUTO_UPDATE container mode that runs a batch auto-update script for all datasets in channels with allow_auto_update enabled.

The script (statgpt/admin/auto_update.py) performs four stages:

  1. Discovery — finds eligible channels and creates auto-update jobs for their datasets
  2. Processing — runs all jobs concurrently (respecting the @background_task semaphore)
  3. Deduplication — deduplicates dimensions for channels that triggered a reindex
  4. Results — logs per-channel summary with reindex status breakdown, e.g. channel 'statgpt-sample' (id=1): 3 NO_CHANGES, 2 REINDEX_TRIGGERED (1 COMPLETED, 1 FAILED)

Also adds make statgpt_auto_update and make statgpt_fix_statuses targets.

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the MIT license.

Fedir-Yatsenko and others added 3 commits March 3, 2026 14:53
Add batch auto-update functionality that processes all datasets in channels with `allow_auto_update` enabled in their data_query config.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Refactor run_auto_update into discrete stage functions with log separators
- Add per-channel result summary in check_auto_update_results (grouped by AutoUpdateResult)
- Run deduplication automatically for channels that triggered reindex
- Refactor create_auto_update_jobs to accept channel_ids and return job schemas
- Use deployment_id in channel log messages for consistency

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Split check_auto_update_results into get_reindex_channel_ids and
  get_auto_update_results (returns AutoUpdateChannelResult data)
- Move result logging to the auto_update script
- Run deduplication before printing final statistics
- Show reindex version status breakdown for REINDEX_TRIGGERED results

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Fedir-Yatsenko Fedir-Yatsenko self-assigned this Mar 5, 2026
@Fedir-Yatsenko Fedir-Yatsenko requested a review from ypldan as a code owner March 5, 2026 09:43
@Fedir-Yatsenko Fedir-Yatsenko added enhancement New feature or request python Pull requests that update python code labels Mar 5, 2026
@Fedir-Yatsenko Fedir-Yatsenko linked an issue Mar 5, 2026 that may be closed by this pull request
@Fedir-Yatsenko
Copy link
Collaborator Author

Fedir-Yatsenko commented Mar 5, 2026

/deploy-review

GitHub actions run: 22712419305
Environment URL: review-environment | pipeline

@Fedir-Yatsenko
Copy link
Collaborator Author

Fedir-Yatsenko commented Mar 5, 2026

Example of script logs

NOTE: Tested locally with docker-compose.

2026-03-05T09:26:36.074072625Z ADMIN_MODE = 'AUTO_UPDATE'
2026-03-05T09:26:55.031056018Z INFO: 2026-03-05 09:26:55 Starting batch auto-update script...
2026-03-05T09:26:55.034524310Z INFO: 2026-03-05 09:26:55 --------------------------------------------------
2026-03-05T09:26:55.035763515Z INFO: 2026-03-05 09:26:55 Attempting to create default engine (attempt 1/5)
2026-03-05T09:26:55.197901928Z INFO: 2026-03-05 09:26:55 default engine created and connection verified
2026-03-05T09:26:55.331037559Z INFO: 2026-03-05 09:26:55 Found 5 channel(s) with auto-update enabled
2026-03-05T09:26:55.409303046Z INFO: 2026-03-05 09:26:55 Created 4 auto-update job(s) for channel '{channel11}' (id=11)
2026-03-05T09:26:55.409379529Z INFO: 2026-03-05 09:26:55 Created 17 auto-update job(s) for channel '{channel5}' (id=5)
2026-03-05T09:26:55.409800505Z INFO: 2026-03-05 09:26:55 Created 2 auto-update job(s) for channel 'statgpt-sample' (id=4)
2026-03-05T09:26:55.409857146Z INFO: 2026-03-05 09:26:55 Created 8 auto-update job(s) for channel '{channel6}' (id=6)
2026-03-05T09:26:55.409948473Z INFO: 2026-03-05 09:26:55 Created 2 auto-update job(s) for channel 'statgpt-sample-hf' (id=10)
2026-03-05T09:26:55.419488304Z INFO: 2026-03-05 09:26:55 --------------------------------------------------
2026-03-05T09:26:55.419572105Z INFO: 2026-03-05 09:26:55 Created 33 auto-update job(s), starting processing...
2026-03-05T09:26:55.420155382Z INFO: 2026-03-05 09:26:55 [auto_update_in_background_task_1] Acquired semaphore. Available slots: 4/5
...
2026-03-05T09:35:04.459887331Z INFO: 2026-03-05 09:35:04 [auto_update_in_background_task_30] Completed successfully
2026-03-05T09:35:04.459933984Z INFO: 2026-03-05 09:35:04 [auto_update_in_background_task_30] Released semaphore. Available slots: 5/5
2026-03-05T09:35:04.471423827Z INFO: 2026-03-05 09:35:04 --------------------------------------------------
2026-03-05T09:35:04.471702934Z INFO: 2026-03-05 09:35:04 Running deduplication for 1 channel(s) with reindex: [6]
2026-03-05T09:35:04.471724916Z INFO: 2026-03-05 09:35:04 [deduplicate_dimensions_in_background_task_34] Acquired semaphore. Available slots: 4/5
2026-03-05T09:35:04.476420845Z INFO: 2026-03-05 09:35:04 Deduplicating available_dimensions for channel 6
2026-03-05T09:35:04.476531895Z INFO: 2026-03-05 09:35:04 creating langchain embeddings with the following params: ...
2026-03-05T09:35:04.504235577Z INFO: 2026-03-05 09:35:04 Initializing pgvector storage with following options: ...
2026-03-05T09:35:04.512037254Z INFO: 2026-03-05 09:35:04 Acquired exclusive locks on AvailableDimensions_6_metadata and AvailableDimensions_6
2026-03-05T09:35:04.518961784Z INFO: 2026-03-05 09:35:04 Remapped 2 metadata rows to keeper documents
2026-03-05T09:35:04.520781183Z INFO: 2026-03-05 09:35:04 Deleted 2 orphaned documents
2026-03-05T09:35:04.525029240Z INFO: 2026-03-05 09:35:04 Content deduplication completed: 2 metadata rows remapped, 2 documents deleted
2026-03-05T09:35:04.525193500Z INFO: 2026-03-05 09:35:04 Deduplicating special_dimensions for channel 6
2026-03-05T09:35:04.525341716Z INFO: 2026-03-05 09:35:04 creating langchain embeddings with the following params: ...
2026-03-05T09:35:04.556330604Z INFO: 2026-03-05 09:35:04 Initializing pgvector storage with following options: ...
2026-03-05T09:35:04.567479412Z INFO: 2026-03-05 09:35:04 Acquired exclusive locks on SpecialDimensions_6_metadata and SpecialDimensions_6
2026-03-05T09:35:04.572991397Z INFO: 2026-03-05 09:35:04 Remapped 0 metadata rows to keeper documents
2026-03-05T09:35:04.575890873Z INFO: 2026-03-05 09:35:04 Deleted 0 orphaned documents
2026-03-05T09:35:04.577614423Z INFO: 2026-03-05 09:35:04 Content deduplication completed: 0 metadata rows remapped, 0 documents deleted
2026-03-05T09:35:04.578218033Z INFO: 2026-03-05 09:35:04 Deduplication completed for channel 6
2026-03-05T09:35:04.578580573Z INFO: 2026-03-05 09:35:04 [deduplicate_dimensions_in_background_task_34] Completed successfully
2026-03-05T09:35:04.578597647Z INFO: 2026-03-05 09:35:04 [deduplicate_dimensions_in_background_task_34] Released semaphore. Available slots: 5/5
2026-03-05T09:35:04.578600052Z INFO: 2026-03-05 09:35:04 Deduplication complete
2026-03-05T09:35:04.578601806Z INFO: 2026-03-05 09:35:04 --------------------------------------------------
2026-03-05T09:35:04.611971757Z INFO: 2026-03-05 09:35:04 channel '{channel11}' (id=11): 4 NO_COMPLETED_VERSION
2026-03-05T09:35:04.612026779Z INFO: 2026-03-05 09:35:04 channel '{channel5}' (id=5): 7 NO_CHANGES, 10 NO_COMPLETED_VERSION
2026-03-05T09:35:04.612031456Z INFO: 2026-03-05 09:35:04 channel '{channel6}' (id=6): 1 REINDEX_TRIGGERED (1 COMPLETED), 7 CONFIG_UPDATED (7 COMPLETED)
2026-03-05T09:35:04.612129491Z INFO: 2026-03-05 09:35:04 channel 'statgpt-sample' (id=4): 2 NO_CHANGES
2026-03-05T09:35:04.612359561Z INFO: 2026-03-05 09:35:04 channel 'statgpt-sample-hf' (id=10): 2 NO_CHANGES
2026-03-05T09:35:04.612391830Z INFO: 2026-03-05 09:35:04 Auto-update complete: 33 succeeded, 0 failed out of 33 total
2026-03-05T09:35:04.612680730Z INFO: 2026-03-05 09:35:04 --------------------------------------------------
2026-03-05T09:35:04.612943892Z INFO: 2026-03-05 09:35:04 Batch auto-update script completed successfully

@Fedir-Yatsenko
Copy link
Collaborator Author

Fedir-Yatsenko commented Mar 5, 2026

/deploy-review

The script was tested locally, but I verified that the admin backend still starts and runs in the Review environment.

GitHub actions run: 22713562504
Environment URL: review-environment | pipeline

@Fedir-Yatsenko Fedir-Yatsenko requested a review from kryachkow March 5, 2026 10:28
kryachkow
kryachkow previously approved these changes Mar 5, 2026
- Log exceptions from asyncio.gather in job processing and deduplication
- Move auto-update channel filtering from ChannelService to the script
- Remove unused get_auto_update_channels and selectinload import
- Use most_common() for deterministic result summary ordering
- Batch commit instead of per-iteration flush when creating jobs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fedir-Yatsenko and others added 2 commits March 5, 2026 14:33
Sort channel IDs before asyncio.gather to ensure correct
error attribution in the zip loop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Fedir-Yatsenko Fedir-Yatsenko merged commit 8f927f8 into development Mar 5, 2026
9 checks passed
@Fedir-Yatsenko Fedir-Yatsenko deleted the feat/178-add-auto-update-script-for-all-datasets-in-marked-channels branch March 5, 2026 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python Pull requests that update python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add auto-update script for all datasets in marked channels

2 participants