|
| 1 | +Processing Pipeline |
| 2 | +=================== |
| 3 | + |
| 4 | +Why? |
| 5 | +---- |
| 6 | + |
| 7 | +The application lets you define a series of tasks that will be checked for each bit of uploaded data, in order. |
| 8 | + |
| 9 | +Tasks need to be defined by each app, but there is a library of common tasks to make this easier. |
| 10 | + |
| 11 | +This allows for maximum flexibility - each app can define the tasks they need, including non-standard tasks that are not used by other CoVE's. |
| 12 | +(For example, BODS CoVE has a sample mode. When the user uploads big data, they can choose to run sample mode and only check some of it. |
| 13 | +This is accomplished by a special task towards the start of the pipeline that generates a smaller file from the uploaded file.) |
| 14 | + |
| 15 | +What happens when the user uploads data? |
| 16 | +---------------------------------------- |
| 17 | + |
| 18 | +The background worker will start processing the data and the user will be redirected to the results page. |
| 19 | + |
| 20 | +What happens when the user looks at a results page? |
| 21 | +--------------------------------------------------- |
| 22 | + |
| 23 | +Everytime a user views a results page, the system will check the state of that data. |
| 24 | + |
| 25 | +If it's currently being processed, the user will see a progress page with a wait message. |
| 26 | + |
| 27 | +If it's not currently being processed, the system will call `is_processing_applicable` and `is_processing_needed` functions on each task to see if any work is needed. |
| 28 | + |
| 29 | +If there is work to do, it will start the work and the user will see a progress page with a wait message. |
| 30 | +This means that even after a task first finishes, a task can change it's mind and request to do more work. |
| 31 | +(The most common use case for this is if the software is upgraded and how the processing is done is changed.) |
| 32 | + |
| 33 | +If there is no work to do, the system will show a results page to the user. |
| 34 | +`get_context` will be called on every task, so the task can load results from it's cache and present them to the user. |
| 35 | + |
| 36 | +Other pages that may be shown to the user include: |
| 37 | + * An error page if a Python error occurred |
| 38 | + * An expired page, if the data is so old that it has been expired and removed from the system |
| 39 | + |
| 40 | +How is the data actually processed? |
| 41 | +----------------------------------- |
| 42 | + |
| 43 | +To process the task, the background worker will call `process`. |
| 44 | +This can take as long as it needs, and the results should be cached for speedy loading later. |
| 45 | + |
| 46 | +Early tasks can also return data that will be passed to later tasks. |
| 47 | +This means any information or work that is needed in multiple tasks does not need to be done multiple times, but can be done once then reused. |
| 48 | + |
| 49 | + |
| 50 | +How should I define my tasks? |
| 51 | +----------------------------- |
| 52 | + |
| 53 | + |
| 54 | +Each task should be defined by extending a class. :doc:`For more information on the base class, see here. <python-api/process/base>` |
| 55 | + |
| 56 | +And your tasks should then be defined in settings. :doc:`For more information on settings, see here. <django-settings>` |
| 57 | + |
| 58 | +An example task pipeline |
| 59 | +------------------------ |
| 60 | + |
| 61 | +.. code-block:: python |
| 62 | +
|
| 63 | +
|
| 64 | + PROCESS_TASKS = [ |
| 65 | + # Get data if not already on disk - if the user provided a URL |
| 66 | + ("libcoveweb2.process.common_tasks.download_data_task", "DownloadDataTask"), |
| 67 | + # BOD's has a special Sample mode. |
| 68 | + # If that's activated, we'll make the sample data now for later tasks to use. |
| 69 | + ("cove_bods.process", "Sample"), |
| 70 | + # Make sure uploads are in primary format - for BOD's that is JSON |
| 71 | + # So any spreadsheets uploaded should be converted |
| 72 | + ("cove_bods.process", "WasJSONUploaded"), |
| 73 | + ("cove_bods.process", "ConvertSpreadsheetIntoJSON"), |
| 74 | + # Some information is reused in multiple tasks to come |
| 75 | + # So we'll process it once now and later tasks can reuse it. |
| 76 | + ("cove_bods.process", "GetDataReaderAndConfigAndSchema"), |
| 77 | + # Convert from primary JSON format into other output formats |
| 78 | + ("cove_bods.process", "ConvertJSONIntoSpreadsheets"), |
| 79 | + # Check and generate statistics from the JSON data |
| 80 | + ("cove_bods.process", "AdditionalFieldsChecksTask"), |
| 81 | + ("cove_bods.process", "PythonValidateTask"), |
| 82 | + ("cove_bods.process", "JsonSchemaValidateTask"), |
| 83 | + ] |
| 84 | +
|
0 commit comments