Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
8aad275
Set up customizable local processing service
vanessavmac Mar 23, 2025
61b45a4
Set up separate docker compose stack, rename ml backend services
vanessavmac Apr 3, 2025
4a03c7e
WIP: README.md
vanessavmac Apr 4, 2025
09d7dfb
Improve processing flow
vanessavmac Apr 5, 2025
996674e
fix: tests and postgres connection
vanessavmac Apr 5, 2025
ce973fc
Update READMEs with minimal/example setups
vanessavmac Apr 5, 2025
bf7178d
fix: transformers fixed version
vanessavmac Apr 5, 2025
41efa42
Add tests
vanessavmac Apr 5, 2025
78babeb
Typos, warn --> warnings
vanessavmac Apr 5, 2025
8d28d01
Add support for Darsa flat-bug
vanessavmac Apr 6, 2025
bb22514
chore: Change the Pipeline class name to FlatBugDetectorPipeline to a…
mohamedelabbas1996 Apr 7, 2025
1dbc5f0
Move README
vanessavmac Apr 8, 2025
fe1a9f4
Address comment tasks
vanessavmac Apr 13, 2025
7747f3a
Merge branch 'main' into 747-get-antenna-to-work-locally-on-laptops-f…
vanessavmac Apr 13, 2025
1978cbe
Update README
vanessavmac Apr 13, 2025
82ac82d
Pass in pipeline request config, properly cache models, simplifications
vanessavmac Apr 15, 2025
7d733f9
Pass in pipeline request config, properly cache models, simplifications
vanessavmac Apr 15, 2025
07d61d9
fix: update docker compose instructions & build path
mihow Apr 16, 2025
d129029
feat: use ["insect"] for the default zero-shot class
mihow Apr 16, 2025
76ce2d8
feat: try to use faster version of zero-shot detector
mihow Apr 16, 2025
035b952
feat: use gpu if available
mihow Apr 17, 2025
1230386
fix: update minimal docker compose build path
vanessavmac Apr 17, 2025
45dbacf
Add back crop_image_url
vanessavmac Apr 26, 2025
7361fb2
Support re-processing detections and skipping localizer
vanessavmac Apr 27, 2025
3f722c8
fix: correctly pass candidate labels for zero shot object detector
vanessavmac Apr 27, 2025
075a7ec
Support re-processing detections and skipping localizer
vanessavmac Apr 27, 2025
85c676d
fix: merge conflict
vanessavmac Apr 27, 2025
cbd7ae0
fix: allow empty pipeline request config
vanessavmac Apr 27, 2025
7d15ffb
fix: allow empty pipeline request config
vanessavmac Apr 27, 2025
c2881b4
clean up
vanessavmac Apr 27, 2025
14396ba
fix: ignore detection algorithm during reprocessing
vanessavmac Apr 29, 2025
6613366
remove flat bug
vanessavmac Apr 29, 2025
2cf0c0a
feat: only use zero shot and HF classifier algorithms
vanessavmac Apr 29, 2025
1dbf3b1
clean up
vanessavmac Apr 29, 2025
c82c076
Merge branch '747-get-antenna-to-work-locally-on-laptops-for-panama-t…
vanessavmac Apr 29, 2025
fb874c4
Function for creating detection instances from requests
vanessavmac May 17, 2025
f2ef5ff
Add reprocessing to minimal app
vanessavmac May 17, 2025
b6ce90f
Merge branch 'main' into 706-support-for-reprocessing-detections-and-…
vanessavmac May 17, 2025
8fe8b1d
Merge branch 'main' into 706-support-for-reprocessing-detections-and-…
vanessavmac Jun 28, 2025
d0f4f26
Add re-processing test
vanessavmac Jun 28, 2025
fc8470d
Merge branch 'main' into 706-support-for-reprocessing-detections-and-…
mihow Jul 7, 2025
3d3b820
Fix requirements
vanessavmac Jul 12, 2025
5c7af56
Address review comments
vanessavmac Jul 12, 2025
e7e579e
Only open source image once
vanessavmac Jul 12, 2025
cb74eac
Merge branch 'main' into 706-support-for-reprocessing-detections-and-…
vanessavmac Jul 12, 2025
ffea1aa
Setup processing service celery workers; basic task queueing/processing
vanessavmac Jul 28, 2025
6cb852b
Save results; update job progress
vanessavmac Aug 4, 2025
d0380b9
Improvements to handle large batches
vanessavmac Aug 11, 2025
2594049
Merge branch 'main' of github.com:RolnickLab/antenna into 515-new-asy…
mihow Aug 16, 2025
57e6691
Add batch processing unit test; bulk db updates; fix duplicate logs; …
vanessavmac Aug 30, 2025
f785dda
Fix for "get() returned more than one AlgorithmCategoryMap" error
vanessavmac Aug 30, 2025
0a22d53
Allow synchronous
vanessavmac Aug 31, 2025
d139734
Fix job progress if no images are submitted
vanessavmac Aug 31, 2025
a83dd20
Subscribe antenna celeryworker to all pipeline queues; add more task …
vanessavmac Sep 2, 2025
0707433
Rename celery to antenna queue; only query ml task records created af…
vanessavmac Sep 4, 2025
0103b7e
Merge branch 'main' into 515-new-async-distributed-ml-backend
vanessavmac Sep 4, 2025
3a3b881
Re-subscribe to queues before processing images; fix test issues
vanessavmac Sep 4, 2025
7c86612
Add missing migration; rename antenna celeryworker
vanessavmac Sep 4, 2025
c016d47
Use transaction.on_commit with all async celery tasks
vanessavmac Sep 5, 2025
fa510ed
Test clean up
vanessavmac Sep 5, 2025
6da55a9
feat: isolate the CI / test compose stack from other containers
mihow Sep 8, 2025
652f47f
feat: fix isoloated CI stack (rely on compose project name)
mihow Sep 8, 2025
f3b588a
fix: run migrations from celery start command, other fixes for tests
mihow Sep 9, 2025
5654ed0
fix: rabbitmq credentials for tests & local dev
mihow Sep 9, 2025
875d3cb
draft: methods for inspecting celery tasks during tests
mihow Sep 9, 2025
be351f8
feat: add health check; fix: rabbitmq credentials, minio ci set up
vanessavmac Sep 13, 2025
e550531
draft: unit test changes
vanessavmac Sep 15, 2025
2fa57ef
draft: some more unit test updates (working up to process_pipeline_re…
vanessavmac Sep 15, 2025
5c21be6
test fix: check the job status synchronously additional error logging
vanessavmac Sep 22, 2025
a1e8fa3
test fix: check the job status synchronously, batchify the save_resul…
vanessavmac Oct 17, 2025
57fba22
feat: celerybeat task to prevent dangling ML jobs
vanessavmac Oct 17, 2025
f8e374a
Merge branch 'main' into 515-new-async-distributed-ml-backend
vanessavmac Oct 17, 2025
cd593bc
fix: migration conflicts
vanessavmac Oct 17, 2025
bde423a
Address copilot review
vanessavmac Oct 17, 2025
a1238dc
fix: passing test checks
vanessavmac Oct 17, 2025
03390e2
revoke dangling jobs
vanessavmac Oct 18, 2025
460f27c
feat: move ML celeryworker to separate PR
vanessavmac Oct 18, 2025
6f87ca4
clean up unit test
vanessavmac Oct 18, 2025
bd86042
Admin action to revoke ml tasks; clean up logs; add tests for stale m…
vanessavmac Nov 4, 2025
b552d1e
fix: increase dangling job timeout and timezone bug
vanessavmac Nov 8, 2025
41b8c16
fix: pipeline results error handling
vanessavmac Nov 10, 2025
caa11db
Merge branch 'main' into 515-new-async-distributed-ml-backend
mihow Nov 11, 2025
0fd6369
Merge branch 'main' into 515-new-async-distributed-ml-backend
vanessavmac Nov 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions ami/jobs/admin.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,3 +66,11 @@ class MLTaskRecordAdmin(AdminBase):
"task_name",
"status",
)

@admin.action()
def kill_task(self, request: HttpRequest, queryset: QuerySet[MLTaskRecord]) -> None:
for ml_task_record in queryset:
ml_task_record.kill_task()
self.message_user(request, f"Killed {queryset.count()} ML task(s).")

actions = [kill_task]
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Generated by Django 4.2.10 on 2025-11-04 11:44

import datetime
from django.db import migrations, models


class Migration(migrations.Migration):
dependencies = [
("jobs", "0022_job_last_checked"),
]

operations = [
migrations.AlterField(
model_name="job",
name="last_checked",
field=models.DateTimeField(blank=True, default=datetime.datetime.now, null=True),
),
migrations.AlterField(
model_name="mltaskrecord",
name="status",
field=models.CharField(
choices=[
("PENDING", "PENDING"),
("STARTED", "STARTED"),
("SUCCESS", "SUCCESS"),
("FAIL", "FAIL"),
("REVOKED", "REVOKED"),
],
default="STARTED",
max_length=255,
),
),
Comment on lines +13 to +32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use a timezone-aware default for last_checked.

DateTimeField(default=datetime.datetime.now) returns naïve datetimes when USE_TZ=True, triggering warnings and risking incorrect conversions. Please switch to django.utils.timezone.now, which returns an aware datetime.

Suggested fix:

-import datetime
-from django.db import migrations, models
+import datetime
+from django.db import migrations, models
+from django.utils import timezone
@@
-            field=models.DateTimeField(blank=True, default=datetime.datetime.now, null=True),
+            field=models.DateTimeField(blank=True, default=timezone.now, null=True),

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In ami/jobs/migrations/0023_alter_job_last_checked_alter_mltaskrecord_status.py
around lines 13 to 32, the migrations.AlterField for Job.last_checked uses
datetime.datetime.now which produces naive datetimes under USE_TZ=True; replace
the default with django.utils.timezone.now to return timezone-aware datetimes,
and update the import at the top of the migration file to import timezone.now
(or import from django.utils import timezone and use timezone.now) so the
migration default is timezone-aware.

]
26 changes: 19 additions & 7 deletions ami/jobs/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -396,9 +396,9 @@ def check_inprogress_subtasks(cls, job: "Job") -> bool:
inprogress_subtask.task_id = save_results_task.id
task_id = save_results_task.id
inprogress_subtask.save()
job.logger.info(f"Started save results task {inprogress_subtask.task_id}")
job.logger.debug(f"Started save results task {inprogress_subtask.task_id}")
else:
job.logger.info("A save results task is already in progress, will not start another one yet.")
job.logger.debug("A save results task is already in progress, will not start another one yet.")
continue

task = AsyncResult(task_id)
Expand All @@ -407,12 +407,12 @@ def check_inprogress_subtasks(cls, job: "Job") -> bool:
inprogress_subtask.status = (
MLSubtaskState.SUCCESS.name if task.successful() else MLSubtaskState.FAIL.name
)
inprogress_subtask.raw_traceback = task.traceback

if task.traceback:
# TODO: Error logs will have many tracebacks
# could add some processing to provide a concise error summary
job.logger.error(f"Subtask {task_name} ({task_id}) failed: {task.traceback}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logs the error, but then still tries to parse a successful result. Can you mark the task as failed and then continue to the next one? The status check did it's job correctly! But the subtask failed. Can you update the MLTaskRecord to say it failed?

inprogress_subtask.status = MLSubtaskState.FAIL.name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this get saved?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, with continue we move on to checking the next subtask. And only every 10 subtasks do we do a bulk update to the tasks.

antenna/ami/jobs/models.py

Lines 441 to 452 in bd86042

if len(inprogress_subtasks_to_update) >= 10:
MLTaskRecord.objects.bulk_update(
inprogress_subtasks_to_update,
[
"status",
"raw_traceback",
"raw_results",
"num_captures",
"num_detections",
"num_classifications",
],
)

inprogress_subtask.raw_traceback = task.traceback
continue

results_dict = task.result
if task_name == MLSubtaskNames.process_pipeline_request.name:
Expand Down Expand Up @@ -505,7 +505,7 @@ def check_inprogress_subtasks(cls, job: "Job") -> bool:
f"{inprogress_subtasks.count()} inprogress subtasks remaining out of {total_subtasks} total subtasks."
)
inprogress_task_ids = [task.task_id for task in inprogress_subtasks]
job.logger.info(f"Subtask ids: {inprogress_task_ids}") # TODO: remove this? not very useful to the user
job.logger.debug(f"Subtask ids: {inprogress_task_ids}")
return False
else:
job.logger.info("No inprogress subtasks left.")
Expand Down Expand Up @@ -999,6 +999,7 @@ class MLSubtaskState(str, OrderedEnum):
STARTED = "STARTED"
SUCCESS = "SUCCESS"
FAIL = "FAIL"
REVOKED = "REVOKED"


class MLTaskRecord(BaseModel):
Expand Down Expand Up @@ -1041,6 +1042,17 @@ def clean(self):
if self.status == MLSubtaskState.PENDING.name and self.task_name != MLSubtaskNames.save_results.name:
raise ValueError(f"{self.task_name} tasks cannot have a PENDING status.")

def kill_task(self):
"""
Kill the celery task associated with this MLTaskRecord.
"""
from config.celery_app import app as celery_app

if self.task_id:
celery_app.control.revoke(self.task_id, terminate=True, signal="SIGTERM")
self.status = MLSubtaskState.REVOKED.name
self.save(update_fields=["status"])


class Job(BaseModel):
"""A job to be run by the scheduler"""
Expand All @@ -1050,7 +1062,7 @@ class Job(BaseModel):

name = models.CharField(max_length=255)
queue = models.CharField(max_length=255, default="default")
last_checked = models.DateTimeField(null=True, blank=True)
last_checked = models.DateTimeField(null=True, blank=True, default=datetime.datetime.now)
scheduled_at = models.DateTimeField(null=True, blank=True)
started_at = models.DateTimeField(null=True, blank=True)
finished_at = models.DateTimeField(null=True, blank=True)
Expand Down
93 changes: 92 additions & 1 deletion ami/jobs/tests.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# from rich import print
import datetime
import logging
import time

Expand Down Expand Up @@ -406,4 +407,94 @@ def get_ml_job_subtask_details(task_name, job):
self.assertEqual(job.status, JobState.SUCCESS.value)
self.assertEqual(job.progress.summary.progress, 1)
self.assertEqual(job.progress.summary.status, JobState.SUCCESS)
job.save()


class TestStaleMLJob(TransactionTestCase):
def setUp(self):
self.project = Project.objects.first() # get the original test project
assert self.project
self.source_image_collection = self.project.sourceimage_collections.get(name="Test Source Image Collection")
self.pipeline = Pipeline.objects.get(slug="constant")

# remove existing detections from the source image collection
for image in self.source_image_collection.images.all():
image.detections.all().delete()
image.save()

def test_kill_dangling_ml_job(self):
"""Test killing a dangling ML job."""
from ami.ml.tasks import check_dangling_ml_jobs
from config import celery_app

job = Job.objects.create(
job_type_key=MLJob.key,
project=self.project,
name="Test dangling job",
delay=0,
pipeline=self.pipeline,
source_image_collection=self.source_image_collection,
)

job.run()
connection.commit()
job.refresh_from_db()

# Simulate last_checked being older than 5 minutes
job.last_checked = datetime.datetime.now() - datetime.timedelta(minutes=10)
job.save(update_fields=["last_checked"])

# Run the dangling job checker
check_dangling_ml_jobs()

# Refresh job from DB
job.refresh_from_db()

# Make sure no tasks are still in progress
for ml_task_record in job.ml_task_records.all():
self.assertEqual(ml_task_record.status, MLSubtaskState.REVOKED.value)

# Also check celery queue to make sure all tasks have been revoked
task_id = ml_task_record.task_id

inspector = celery_app.control.inspect()
active = inspector.active() or {}
reserved = inspector.reserved() or {}

not_running = all(
task_id not in [t["id"] for w in active.values() for t in w] for w in active.values()
) and all(task_id not in [t["id"] for w in reserved.values() for t in w] for w in reserved.values())
Comment on lines +466 to +468
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix redundant loop in task presence check.

The logic for verifying that task_id is not running has a redundant outer loop. The inner comprehension already iterates over all workers, so the outer for w in active.values() and for w in reserved.values() are unnecessary and may cause incorrect evaluation.

Apply this diff to simplify and correct the logic:

-            not_running = all(
-                task_id not in [t["id"] for w in active.values() for t in w] for w in active.values()
-            ) and all(task_id not in [t["id"] for w in reserved.values() for t in w] for w in reserved.values())
+            active_task_ids = [t["id"] for tasks in active.values() for t in tasks]
+            reserved_task_ids = [t["id"] for tasks in reserved.values() for t in tasks]
+            not_running = task_id not in active_task_ids and task_id not in reserved_task_ids
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
not_running = all(
task_id not in [t["id"] for w in active.values() for t in w] for w in active.values()
) and all(task_id not in [t["id"] for w in reserved.values() for t in w] for w in reserved.values())
active_task_ids = [t["id"] for tasks in active.values() for t in tasks]
reserved_task_ids = [t["id"] for tasks in reserved.values() for t in tasks]
not_running = task_id not in active_task_ids and task_id not in reserved_task_ids
🤖 Prompt for AI Agents
In ami/jobs/tests.py around lines 466 to 468, the check uses an unnecessary
outer loop causing redundant iteration and potential incorrect results; replace
the double-generator form with a single membership check per collection so that
you test task_id not in [t["id"] for w in active.values() for t in w] and
task_id not in [t["id"] for w in reserved.values() for t in w] (i.e. remove the
extra "for w in ... " wrappers and evaluate each list comprehension once,
combining them with an and).


self.assertTrue(not_running)

self.assertEqual(job.status, JobState.REVOKED.value)

def test_kill_task_prevents_execution(self):
from ami.jobs.models import Job, MLSubtaskNames, MLTaskRecord
from ami.ml.models.pipeline import process_pipeline_request
from config import celery_app

logger.info("Testing that killing a task prevents its execution.")
result = process_pipeline_request.apply_async(args=[{}, 1], countdown=5)
logger.info(f"Scheduled task with id {result.id} to run in 5 seconds.")
task_id = result.id

job = Job.objects.create(
job_type_key=MLJob.key,
project=self.project,
name="Test killing job tasks",
delay=0,
pipeline=self.pipeline,
source_image_collection=self.source_image_collection,
)

ml_task_record = MLTaskRecord.objects.create(
job=job, task_name=MLSubtaskNames.process_pipeline_request.value, task_id=task_id
)
logger.info(f"Killing task {task_id} immediately.")
ml_task_record.kill_task()

async_result = celery_app.AsyncResult(task_id)
time.sleep(5) # the REVOKED STATUS isn't visible until the task is actually run after the delay

self.assertIn(async_result.state, ["REVOKED"])
self.assertEqual(ml_task_record.status, "REVOKED")
80 changes: 1 addition & 79 deletions ami/ml/models/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@
from urllib.parse import urljoin

import requests
from celery.result import AsyncResult
from django.db import models, transaction
from django.utils.text import slugify
from django.utils.timezone import now
Expand Down Expand Up @@ -355,7 +354,7 @@ def handle_async_process_images(
)
ml_task_record.source_images.set(source_image_batches[idx])
ml_task_record.save()
task_logger.info(f"Created MLTaskRecord {ml_task_record} for task {task_id}")
task_logger.debug(f"Created MLTaskRecord {ml_task_record} for task {task_id}")
else:
task_logger.warning("No job ID provided, MLTaskRecord will not be created.")

Expand Down Expand Up @@ -1272,7 +1271,6 @@ def save_results(self, results: PipelineResultsResponse, job_id: int | None = No
def save_results_async(self, results: PipelineResultsResponse, job_id: int | None = None):
# Returns an AsyncResult
results_json = results.json()
logger.info("Submitting save results task...")
return save_results.delay(results_json=results_json, job_id=job_id)

def save(self, *args, **kwargs):
Expand All @@ -1282,79 +1280,3 @@ def save(self, *args, **kwargs):
unique_suffix = str(uuid.uuid4())[:8]
self.slug = f"{slugify(self.name)}-v{self.version}-{unique_suffix}"
return super().save(*args, **kwargs)

def watch_single_batch_task(
self,
task_id: str,
task_logger: logging.Logger | None = None,
) -> PipelineResultsResponse | None:
"""
Helper function to watch a single batch process task and return the result.
"""
task_logger = task_logger or logger

result = AsyncResult(task_id)
if result.ready():
task_logger.info(f"Task {task_id} completed with status: {result.status}")
if result.successful():
task_logger.info(f"Task {task_id} completed successfully with result: {result.result}")
task_logger.warning(f"Task {task_id} result: {result.result}")
return PipelineResultsResponse(**result.result)
else:
task_logger.error(f"Task {task_id} failed with result: {result.result}")
return PipelineResultsResponse(
pipeline="",
algorithms={},
total_time=0.0,
source_images=[],
detections=[],
errors=f"Task {task_id} failed with result: {result.result}",
)
else:
task_logger.warning(f"Task {task_id} is not ready yet.")
return None

def watch_batch_tasks(
self,
task_ids: list[str],
timeout: int = 300,
poll_interval: int = 5,
task_logger: logging.Logger | None = None,
) -> PipelineResultsResponse:
"""
Helper function to watch batch process tasks and aggregate results into a single PipelineResultsResponse.

@TODO: this is only used by the test_process view, keep this as just a useful helper
function for that view? or can we somehow use it in the ML job too?
"""
task_logger = task_logger or logger
start_time = time.time()
remaining = set(task_ids)

results = None
while remaining and (time.time() - start_time) < timeout:
for task_id in list(remaining):
result = self.watch_single_batch_task(task_id, task_logger=task_logger)
if result is not None:
if not results:
results = result
else:
results.combine_with([result])
remaining.remove(task_id)
time.sleep(poll_interval)

if remaining and logger:
logger.error(f"Timeout reached. The following tasks didn't finish: {remaining}")

if results:
results.total_time = time.time() - start_time
return results
else:
return PipelineResultsResponse(
pipeline="",
algorithms={},
total_time=0.0,
source_images=[],
detections=[],
errors="No tasks completed successfully.",
)
8 changes: 8 additions & 0 deletions ami/ml/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,11 @@ def check_ml_job_status(ml_job_id: int):
job.update_status(JobState.FAILURE)
job.finished_at = datetime.datetime.now()
job.save()

# Remove remaining tasks from the queue
for ml_task_record in job.ml_task_records.all():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a Cancel action for the parent Job. Can you move or add this logic there?

ml_task_record.kill_task()

raise Exception(error_msg)


Expand Down Expand Up @@ -171,5 +176,8 @@ def check_dangling_ml_jobs():
job.update_status(JobState.REVOKED)
job.finished_at = datetime.datetime.now()
job.save()

for ml_task_record in job.ml_task_records.all():
ml_task_record.kill_task()
else:
logger.info(f"Job {job.pk} is active. Last checked at {last_checked}.")