Merge pull request #328 from zooniverse/edit-batch-agg

zwolf · web-flow · commit d70def6c7a95 · 2025-06-20T16:28:31.000-05:00
BatchAgg Edits: May 2025
diff --git a/docs/user_guide.rst b/docs/user_guide.rst
@@ -58,10 +58,15 @@ Uploading non-image media types
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 If you wish to upload subjects with non-image media (e.g. audio or video),
-you will need to make sure you have the ``libmagic`` library installed. If you
-don't already have ``libmagic``, please see the `dependency information for
-python-magic <https://github.com/ahupp/python-magic#dependencies>`_ for more
-details.
+it is desirable to have the ``libmagic`` library installed for type detection.
+If you don't already have ``libmagic``, please see the `dependency information 
+for python-magic <https://github.com/ahupp/python-magic#installation>`_ for
+more details.
+
+If `libmagic` is not installed, assignment of MIME types (e.g., image/jpeg,
+video/mp4, text/plain, application/json, etc) will be based on file extensions.
+Be aware that if file names and extension aren't accurate, this could lead to
+issues when the media is loaded.
 
 Usage Examples
 --------------
@@ -414,58 +419,97 @@ You can also pass an optional `new_subject_set_name` parameter and this would be
 
     Project(project_id).copy(new_subject_set_name='My New Subject Set')
 
-Programmatic Data Exports
-~~~~~~~~~~~~~~~~~~~~~~~~~
+Data Exports
+~~~~~~~~~~~~
 The Panoptes Python Client allows you to generate, describe, and download data exports (e.g., classifications, subjects, workflows) via the Python ``panoptes_client`` library.
 
-Multiple types of exports can be generated using the Python Client, including project-level products (classifications, subjects, workflows) as smaller scale classification exports (for workflows and subject sets).
+Multiple types of exports can be generated using the Python Client, including project-level products (classifications, subjects, workflows) and smaller scale classification exports (for workflows and subject sets).
 For the examples below, we will demonstrate commands for a project wide classifications export, but these functions work for any export type.
 
 **Get Exports**
 
-As the name implies, this method downloads a data export over HTTP. This uses the `get_export` method and can be called by passing in the following parameters::
+As the name implies, this method downloads a data export over HTTP. This uses the `get_export` method and can be called by passing in the following parameters:
 
-    export_type #string specifying which type of export should be downloaded
+* *export_type*: string specifying which type of export should be downloaded.
+* *generate*: a boolean specifying whether to generate a new export and wait for it to be ready, or to just download the latest existing export. Default is False.
+* *wait*: a boolean specifying whether to wait for an in-progress export to finish, if there is one. Has no effect if `generate` is true (wait will occur in this case). Default is False.
+* *wait_timeout*: the number of seconds to wait if `wait` is True or `generate` is True. Has no effect if `wait` and `generate` are both False. Default is None (wait indefinetly).
 
-    generate #a boolean specifying if to generate a new export and wait for it to be ready, or to just download the latest existing export
+Examples::
 
-    wait #a boolean specifying whether to wait for an in-progress export to finish, if there is one. Has no effect if generate is true.
+    # Fetch existing export
+    classification_export = Project(1234).get_export('classifications')
 
-    wait_timeout #is the number of seconds to wait if wait is True. Has no effect if wait is False or if generate is True.
+    # Generate export, wait indefinetly for result to complete
+    classification_export = Project(1234).get_export('classifications', generate=True)
 
-    classification_export = Project(project_id).get_export(export_type="classifications")
+    # Fetch export currently being processed, wait up to 600 seconds for export to complete
+    classification_export = Project(1234).get_export('classifications', wait=True, wait_timeout=600)
 
 The returned Response object has two additional attributes as a convenience for working with the CSV content; `csv_reader` and `csv_dictreader`, which are wrappers for `csv.reader()` and `csv.DictReader` respectively.
 These wrappers take care of correctly decoding the export content for the CSV parser::
 
     classification_export = Project(1234).get_export('classifications')
     for row in classification_export.csv_dictreader():
-    print(row)
+        print(row)
 
 **Generate Exports**
 
 As the name implies, this method generates/starts a data export. This uses the `generate_export` method and can be called by passing in the `export_type` parameter::
 
-    export_info = Project(project_id).generate_export(export_type='classifications')
+    export_info = Project(1234).generate_export('classifications')
+
+This kick off the export generation process and returns `export_info` as a dictionary containing the metadata on the selected export.
 
-This would return `export_info` as a dictionary containing the metadata on the selected export
+**Describing Exports**
 
-**Wait Exports**
+This method fetches information/metadata about a specific type of export. This uses the `describe_export` method and can be called by passing the `export_type` (e.g., classifications, subjects) this way::
 
-As the name implies, this method blocks/waits until an in-progress export is ready. It uses the `wait_export` method and can be called passing the following parameters::
+    export_info = Project(1234).describe_export('classifications')
 
-    export_type #string specifying which type of export should be downloaded
+This would return `export_info` as a dictionary containing the metadata on the selected export.
 
-    timeout #is the maximum number of seconds to wait.
+Subject Set Classification Exports
+++++++++++++++++++++++++++++++++++
 
-    export_info = Project(project_id).wait_export(export_type='classifications')
+As mentioned above, it is possible to request a classifications export for project, workflow, or subject set scope.
+For the subject set classification export, classifications are included in the export if they satisfy two selection criteria:
 
-This would return `export_info` as a dictionary containing the metadata on the selected export and throw a `PanoptesAPIException` once the time limit is exceeded and the export is not ready
+1. The subject referenced in the classification is a member of the relevant subject set.
+2. The relevant subject set is currently linked to the workflow referenced in the classification.
 
-**Describing Exports**
+Example Usage::
+
+    # For a SubjectSet, check which Workflows to which it is currently linked
+    subject_set = SubjectSet.find(1234)
+    for wf in subject_set.links.workflows:
+        print(wf.id, wf.display_name)
+
+    # Generate Export
+    subject_set_classification_export = subject_set.get_export('classifications', generate=True)
+
+Automated Aggregation of Classifications
+++++++++++++++++++++++++++++++++++++++++
+
+The Zooniverse supports research teams by maintaining the ``panoptes_aggregation`` Python package
+(see `docs <https://aggregation-caesar.zooniverse.org/docs>`_ and `repo <https://github.com/zooniverse/aggregation-for-caesar>`_).
+This software requires local installation to run, which can be a deterrent for its use.
+As an alternative to installing and running this aggregation code, we provide a Zooniverse-hosted service for producing aggregated results for simple datasets.
+This "batch aggregation" feature is built to perform simple workflow-level data aggregation that uses baseline extractors and reducers without any custom configuration.
+Please see :py:meth:`.Workflow.run_aggregation` and :py:meth:`.Workflow.get_batch_aggregation_links` docstrings for full details.
+
+Example Usage::
+
+    # Generate input data exports: workflow-level classification export and project-level workflows export
+    Workflow(1234).generate_export('classification')
+    Project(2345).generate_export('workflows')
 
-This method fetches information/metadata about a specific type of export. This uses the `describe_export` method and can be called by passing in the export_type(classifications, subject_sets) this way::
+    # Request batch aggregation data product
+    Workflow(1234).run_aggregation()
 
-    export_info = Project(project_id).describe_export(export_type='classifications')
+    # Fetch batch aggregation download URLs
+    urls = Workflow(1234).get_batch_aggregation_links()
+    print(urls)
 
-This would return `export_info` as a dictionary containing the metadata on the selected export
+    # Load Reductions CSV using Pandas
+    pd.read_csv(urls['reductions'])
diff --git a/panoptes_client/panoptes.py b/panoptes_client/panoptes.py
@@ -21,7 +21,7 @@
 if os.environ.get('PANOPTES_DEBUG'):
     logging.basicConfig(level=logging.DEBUG)
 else:
-    logging.basicConfig()
+    logging.basicConfig(level=logging.INFO)
 
 
 class Panoptes(object):
diff --git a/panoptes_client/tests/test_workflow.py b/panoptes_client/tests/test_workflow.py
@@ -216,57 +216,66 @@ def setUp(self):
         self.instance = Workflow(1)
         self.mock_user_id = 1
 
-    def _mock_aggregation(self):
-        mock_aggregation = MagicMock()
-        mock_aggregation.object_count = 1
-        mock_aggregation.next = MagicMock(return_value=MagicMock(id=1))
-        return mock_aggregation
-
     @patch.object(Aggregation, 'where')
-    @patch.object(Aggregation, 'find')
-    def test_run_aggregation_with_user_object(self, mock_find, mock_where):
-        mock_where.return_value = self._mock_aggregation()
-
+    def test_run_aggregation_existing(self, mock_where):
         mock_current_agg = MagicMock()
-        mock_find.return_value = mock_current_agg
+        mock_current_agg.delete = MagicMock()
+
+        mock_aggregations = MagicMock()
+        mock_aggregations.object_count = 1
+        mock_aggregations.__next__.return_value = mock_current_agg
+        mock_where.return_value = mock_aggregations
 
         result = self.instance.run_aggregation(self.mock_user_id, False)
 
+        mock_current_agg.delete.assert_not_called()
         self.assertEqual(result, mock_current_agg)
 
-    @patch.object(Aggregation, 'find')
     @patch.object(Aggregation, 'where')
     @patch.object(Aggregation, 'save')
-    def test_run_aggregation_with_delete_if_true(self, mock_save, mock_where, mock_find):
-        mock_where.return_value = self._mock_aggregation()
-
+    def test_run_aggregation_existing_and_delete(self, mock_save, mock_where):
         mock_current_agg = MagicMock()
         mock_current_agg.delete = MagicMock()
-        mock_find.return_value = mock_current_agg
 
-        mock_save_func = MagicMock()
+        mock_aggregations = MagicMock()
+        mock_aggregations.object_count = 1
+        mock_aggregations.__next__.return_value = mock_current_agg
+        mock_where.return_value = mock_aggregations
 
+        mock_save_func = MagicMock()
         mock_save.return_value = mock_save_func()
-        self.instance.run_aggregation(self.mock_user_id, True)
 
-        mock_current_agg.delete.assert_called_once()
+        result = self.instance.run_aggregation(self.mock_user_id, True)
 
+        mock_current_agg.delete.assert_called_once()
         mock_save_func.assert_called_once()
+        self.assertNotEqual(result, mock_current_agg)
 
-    @patch.object(Workflow, 'get_batch_aggregations')
-    def test_get_agg_property(self, mock_get_batch_aggregations):
+    @patch.object(Aggregation, 'where')
+    def test_get_batch_aggregation(self, mock_where):
+        mock_current_agg = MagicMock()
+        mock_aggregations = MagicMock()
+        mock_aggregations.__next__.return_value = mock_current_agg
+        mock_where.return_value = mock_aggregations
+
+        result = self.instance.get_batch_aggregation()
+
+        self.assertEqual(result, mock_current_agg)
+
+    @patch.object(Aggregation, 'where')
+    def test_get_batch_aggregation_failure(self, mock_where):
+        mock_where.return_value = iter([])
+
+        with self.assertRaises(PanoptesAPIException):
+            self.instance.get_batch_aggregation()
+
+    @patch.object(Workflow, 'get_batch_aggregation')
+    def test_get_agg_property(self, mock_get_batch_aggregation):
         mock_aggregation = MagicMock()
         mock_aggregation.test_property = 'returned_test_value'
 
-        mock_get_batch_aggregations.return_value = iter([mock_aggregation])
+        mock_get_batch_aggregation.return_value = mock_aggregation
 
         result = self.instance._get_agg_property('test_property')
 
         self.assertEqual(result, 'returned_test_value')
-
-    @patch.object(Workflow, 'get_batch_aggregations')
-    def test_get_agg_property_failed(self, mock_get_batch_aggregations):
-        mock_get_batch_aggregations.return_value = iter([])
-
-        with self.assertRaises(PanoptesAPIException):
-            self.instance._get_agg_property('test_property')
diff --git a/panoptes_client/workflow.py b/panoptes_client/workflow.py
@@ -5,15 +5,15 @@
 from panoptes_client.subject_workflow_status import SubjectWorkflowStatus
 
 from panoptes_client.exportable import Exportable
-from panoptes_client.panoptes import PanoptesObject, LinkResolver, PanoptesAPIException
+from panoptes_client.panoptes import Panoptes, PanoptesObject, LinkResolver, PanoptesAPIException
 from panoptes_client.subject import Subject
 from panoptes_client.subject_set import SubjectSet
 from panoptes_client.utils import batchable
 
 from panoptes_client.caesar import Caesar
 from panoptes_client.user import User
 from panoptes_client.aggregation import Aggregation
-import six
+import logging
 
 class Workflow(PanoptesObject, Exportable):
     _api_slug = 'workflows'
@@ -536,39 +536,52 @@ def run_aggregation(self, user=None, delete_if_exists=False):
         """
         This method will start a new batch aggregation run, Will return a dict with the created aggregation if successful.
 
-        - **user** can be either a :py:class:`.User` or an ID.
-        - **delete_if_exists** parameter is optional if true, deletes any previous instance
-        -
+        - **user** can be either a :py:class:`.User` or an ID. Defaults to logged in user if not set.
+        - **delete_if_exists** parameter is optional; if true, deletes any previous instance.
+
         Examples::
 
-            Workflow(1234).run_aggregation(1234)
+            Workflow(1234).run_aggregation()
             Workflow(1234).run_aggregation(user=1234, delete_if_exists=True)
         """
 
         if(isinstance(user, User)):
             _user_id = user.id
         elif (isinstance(user, (int, str,))):
             _user_id = user
+        elif User.me():
+            _user_id = User.me().id
         else:
-            raise TypeError('Invalid user parameter')
+            raise TypeError('Invalid user parameter. Provide user ID or login.')
 
         try:
-            workflow_aggs = self.get_batch_aggregations()
+            workflow_aggs = Aggregation.where(workflow_id=self.id)
             if workflow_aggs.object_count > 0:
-                agg_id = workflow_aggs.next().id
-                current_wf_agg = Aggregation.find(agg_id)
+                current_wf_agg = next(workflow_aggs)
                 if delete_if_exists:
                     current_wf_agg.delete()
                     return self._create_agg(_user_id)
                 else:
+                    logging.getLogger('panoptes_client').info(
+                        'Aggregation exists for Workflow {}. '.format(self.id) +
+                        'Set delete_if_exists to True to create new aggregation.'
+                    )
                     return current_wf_agg
             else:
                 return self._create_agg(_user_id)
         except PanoptesAPIException as err:
             raise err
 
-    def get_batch_aggregations(self):
-        return Aggregation.where(workflow_id=self.id)
+    def get_batch_aggregation(self):
+        """
+        This method will fetch existing aggregation resource, if any.
+        """
+        try:
+            return next(Aggregation.where(workflow_id=self.id))
+        except StopIteration:
+            raise PanoptesAPIException(
+                'Could not find Aggregation for Workflow {}'.format(self.id)
+            )
 
     def _create_agg(self, user_id):
         new_agg = Aggregation()
@@ -578,25 +591,29 @@ def _create_agg(self, user_id):
         return new_agg
 
     def _get_agg_property(self, param):
-        try:
-            aggs = self.get_batch_aggregations()
-            return getattr(six.next(aggs), param, None)
-        except StopIteration:
-            raise PanoptesAPIException(
-                "Could not find Aggregations for Workflow with id='{}'".format(self.id)
-            )
+        return getattr(self.get_batch_aggregation(), param, None)
 
-    def check_batch_aggregation_run_status(self):
+    def get_batch_aggregation_status(self):
         """
-        This method will fetch existing aggregation status if any.
+        This method will fetch existing aggregation status, if any.
         """
         return self._get_agg_property('status')
 
     def get_batch_aggregation_links(self):
         """
-        This method will fetch existing aggregation links if any.
+        This method will fetch existing aggregation links, if any.
+
+        Data product options, returned as dictionary of type/URL key-value pairs:
+        1. reductions: subject-level reductions results CSV
+        2. aggregation: a ZIP file containing all inputs (workflow-level classification export, project-level workflows export) and outputs (extracts, reductions)
         """
-        return self._get_agg_property('uuid')
+        uuid = self._get_agg_property('uuid')
+        aggregation_url = 'https://aggregationdata.blob.core.windows.net'
+        env = 'production'
+        if Panoptes.client().endpoint == 'https://panoptes-staging.zooniverse.org':
+            env = 'staging'
+        return {'reductions': f'{aggregation_url}/{env}/{uuid}/{self.id}_reductions.csv',
+                'aggregation': f'{aggregation_url}/{env}/{uuid}/{self.id}_aggregation.zip'}
 
     @property
     def versions(self):