|
10 | 10 |
|
11 | 11 |
|
12 | 12 | ## Considerations |
13 | | -* Ingestion MUST be able to respect rate limits |
14 | | -* Ingestion SHOULD be able to collect data from arbitrary date ranges |
15 | | -* Ingestion SHOULD NOT consume all available memory |
16 | | -* Ingestion SHOULD have a reasonable timeout |
| 13 | +* Harvesting MUST be able to respect rate limits |
| 14 | +* Harvesting SHOULD be able to collect data from arbitrary date ranges |
| 15 | +* Harvesting SHOULD NOT consume all available memory |
| 16 | +* Harvesting SHOULD have a reasonable timeout |
17 | 17 |
|
18 | 18 |
|
19 | 19 | ## Parameters |
20 | | -* `ingest_config_id` -- The PK of the IngestConfig to use |
| 20 | +* `source_config_id` -- The PK of the SourceConfig to use |
21 | 21 | * `start_date` -- |
22 | 22 | * `end_date` -- |
23 | 23 | * `limit` -- The maximum number of documents to collect. Defaults to `None` (Unlimited) |
24 | 24 | * `superfluous` -- Take certain actions that have previously suceeded |
25 | 25 | * `transform` -- Should TransformJobs be launched for collected data. Defaults to `True` |
26 | 26 | * `no_split` -- Should harvest jobs be split into multiple? Default to `False` |
27 | | -* `ignore_disabled` -- Run the task, even with disabled ingest configs |
| 27 | +* `ignore_disabled` -- Run the task, even with disabled source configs |
28 | 28 | * `force` -- Force the task to run, against all odds |
29 | 29 |
|
30 | 30 |
|
31 | 31 | ## Steps |
32 | 32 |
|
33 | 33 | ### Preventative measures |
34 | | -* If the specified `ingest_config` is disabled and `force` or `ignore_disabled` is not set, crash |
35 | | -* For the given `ingest_config` find up to the last 5 harvest jobs with the same harvester versions |
| 34 | +* If the specified `source_config` is disabled and `force` or `ignore_disabled` is not set, crash |
| 35 | +* For the given `source_config` find up to the last 5 harvest jobs with the same harvester versions |
36 | 36 | * If they are all failed, throw an exception (Refuse to run) |
37 | 37 |
|
38 | 38 | ### Setup |
39 | | -* Lock the `ingest_config` (NOWAIT) |
| 39 | +* Lock the `source_config` (NOWAIT) |
40 | 40 | * On failure, reschedule for a later run. (This should be allowed to happen many times before finally failing) |
41 | | -* Get or create HarvestLog(`ingest_config_id`, `harvester_version`, `start_date`, `end_date`) |
| 41 | +* Get or create HarvestLog(`source_config_id`, `harvester_version`, `start_date`, `end_date`) |
42 | 42 | * if found and status is: |
43 | 43 | * `SUCCEEDED`, `SPLIT`, or `FAILED`: update timestamps and/or counts. |
44 | | - * `STARTED`: Log a warning (Should not have been able to lock the `ingest_config`) and update timestamps and/or counts. |
| 44 | + * `STARTED`: Log a warning (Should not have been able to lock the `source_config`) and update timestamps and/or counts. |
45 | 45 | * Set HarvestLog status to `STARTED` |
46 | 46 | * If the specified date range is >= [SOME LENGTH OF TIME] and `no_split` is False |
47 | 47 | * Chunk the date range and spawn a harvest task for each chunk |
48 | 48 | * Set status to `SPLIT` and exit |
49 | | -* Load the harvester for the given `ingest_config` |
| 49 | +* Load the harvester for the given `source_config` |
50 | 50 |
|
51 | 51 | ### Actually Harvest |
52 | | -* Harvest data between the specified datetimes, respecting `limit` and `ingest_config.rate_limit` |
| 52 | +* Harvest data between the specified datetimes, respecting `limit` and `source_config.rate_limit` |
53 | 53 |
|
54 | 54 | ### Pass the data along |
55 | 55 | * Begin catching any exceptions |
56 | 56 | * For each piece of data recieved (Perferably in bulk/chunks) |
57 | 57 | * Get or create `SourceUniqueIdentifier(suid, source_id)` |
58 | | - * Question: Should SUIDs depend on `ingest_config_id` instead of `source_id`? If we're harvesting data in multiple formats from the same source, we probably want to keep the respective states separate. |
| 58 | + * Question: Should SUIDs depend on `source_config_id` instead of `source_id`? If we're harvesting data in multiple formats from the same source, we probably want to keep the respective states separate. |
59 | 59 | * Get or create RawData(hash, suid) |
60 | 60 | * For each piece of data (After saving to keep as transactional as possible) |
61 | | - * Get or create `TransformLog(raw_id, ingest_config_id, transformer_version)` |
| 61 | + * Get or create `TransformLog(raw_id, source_config_id, transformer_version)` |
62 | 62 | * if the log already exists and superfluous is not set, exit |
63 | | - * Start the `TransformTask(raw_id, ingest_config_id)` unless `transform` is `False` |
| 63 | + * Start the `TransformTask(raw_id, source_config_id)` unless `transform` is `False` |
64 | 64 |
|
65 | 65 | ### Clean up |
66 | 66 | * If an exception was caught, set status to `FAILED` and insert the exception/traceback |
|
0 commit comments