IngestConfig => SourceConfig

aaxelb · chrisseto · commit fd79f146c6eb · 2017-03-10T10:25:38.000-05:00
diff --git a/whitepapers/Tables.md b/whitepapers/Tables.md
@@ -21,10 +21,10 @@ Identifier for a specific document from a specific source.
 | Column             | Type | Indexed | Nullable | FK  | Default | Description                                    |
 | :----------------- | :--: | :-----: | :------: | :-: | :-----: | :--------------------------------------------- |
 | `identifier`       | text |         |          |     |         | Identifier given to the document by the source |
-| `ingest_config_id` | int  |         |          |  ✓  |         | IngestConfig used to ingest the document       |
+| `source_config_id` | int  |         |          |  ✓  |         | SourceConfig used to ingest the document       |
 
 #### Other indices
-* `source_doc_id`, `ingest_config_id` (unique)
+* `source_doc_id`, `source_config_id` (unique)
 
 ### RawData
 Raw data, exactly as it was given to SHARE.
@@ -36,9 +36,9 @@ Raw data, exactly as it was given to SHARE.
 | `sha256`       | text | unique  |          |     |         | SHA-256 hash of `data`                                             |
 | `harvest_logs` | m2m  |         |          |     |         | List of HarvestLogs for harvester runs that found this exact datum |
 
-## Ingest Configuration
+## Source Configuration
 
-### IngestConfig
+### SourceConfig
 Describes one way to harvest metadata from a Source, and how to transform the result.
 
 | Column                 | Type  | Indexed | Nullable | FK  | Default | Description                                                                        |
@@ -48,11 +48,11 @@ Describes one way to harvest metadata from a Source, and how to transform the re
 | `earliest_date`        | date  |         |    ✓     |     |         | Earliest date with available data                                                  |
 | `rate_limit_allowance` |  int  |         |          |     |    5    | Number of requests allowed every `rate_limit_period` seconds                       |
 | `rate_limit_period`    |  int  |         |          |     |    1    | Number of seconds for every `rate_limit_allowance` requests                        |
-| `harvester_id`         |  int  |         |          |  ✓  |         | Harvester to use                                                                   |
+| `harvester_id`         |  int  |         |    ✓     |  ✓  |         | Harvester to use                                                                   |
 | `harvester_kwargs`     | jsonb |         |    ✓     |     |         | JSON object passed to the harvester as kwargs                                      |
 | `transformer_id`       |  int  |         |          |  ✓  |         | Transformer to use                                                                 |
 | `transformer_kwargs`   | jsonb |         |    ✓     |     |         | JSON object passed to the transformer as kwargs, along with the harvested raw data |
-| `disabled`             | bool  |         |          |     |  False  | True if this ingest config should not be run automatically                         |
+| `disabled`             | bool  |         |          |     |  False  | True if this source config should not be run automatically                         |
 
 ### Source
 A Source is a place metadata comes from.
@@ -90,23 +90,23 @@ Log entries to track the status of a specific harvester run.
 
 | Column              |   Type   | Indexed | Nullable | FK  | Default | Description                                                                                              |
 | :------------------ | :------: | :-----: | :------: | :-: | :-----: | :------------------------------------------------------------------------------------------------------- |
-| `ingest_config_id`  |   int    |         |          |  ✓  |         | IngestConfig for this harvester run                                                                      |
+| `source_config_id`  |   int    |         |          |  ✓  |         | SourceConfig for this harvester run                                                                      |
 | `harvester_version` |   text   |         |          |     |         | Semantic version of the harvester, with each segment padded to 3 digits (e.g. '1.2.10' => '001.002.010') |
 | `start_date`        | datetime |         |          |     |         | Beginning of the date range to harvest                                                                   |
 | `end_date`          | datetime |         |          |     |         | End of the date range to harvest                                                                         |
 | `started`           | datetime |         |          |     |         | Time `status` was set to STARTED                                                                         |
 | `status`            |   text   |         |          |     | INITIAL | Status of the harvester run, one of {INITIAL, STARTED, SPLIT, SUCCEEDED, FAILED}                         |
 
 #### Other indices
-* `ingest_config_id`, `harvester_version`, `start_date`, `end_date` (unique)
+* `source_config_id`, `harvester_version`, `start_date`, `end_date` (unique)
 
 ### TransformLog
 Log entries to track the status of a transform task
 
 | Column                |   Type   | Indexed | Nullable | FK  | Default | Description                                                                                                |
 | :-------------------- | :------: | :-----: | :------: | :-: | :-----: | :--------------------------------------------------------------------------------------------------------- |
 | `raw_id`              |   int    |         |          |  ✓  |         | RawData to be transformed                                                                                  |
-| `ingest_config_id`    |   int    |         |          |  ✓  |         | IngestConfig used                                                                                          |
+| `source_config_id`    |   int    |         |          |  ✓  |         | SourceConfig used                                                                                          |
 | `transformer_version` |   text   |         |          |     |         | Semantic version of the transformer, with each segment padded to 3 digits (e.g. '1.2.10' => '001.002.010') |
 | `started`             | datetime |         |          |     |         | Time `status` was set to STARTED                                                                           |
 | `status`              |   text   |         |          |     | INITIAL | Status of the transform task, one of {INITIAL, STARTED, RESCHEDULED, SUCCEEDED, FAILED}                    |
diff --git a/whitepapers/tasks/Harvest.md b/whitepapers/tasks/Harvest.md
@@ -10,57 +10,57 @@
 
 
 ## Considerations
-* Ingestion MUST be able to respect rate limits
-* Ingestion SHOULD be able to collect data from arbitrary date ranges
-* Ingestion SHOULD NOT consume all available memory
-* Ingestion SHOULD have a reasonable timeout
+* Harvesting MUST be able to respect rate limits
+* Harvesting SHOULD be able to collect data from arbitrary date ranges
+* Harvesting SHOULD NOT consume all available memory
+* Harvesting SHOULD have a reasonable timeout
 
 
 ## Parameters
-* `ingest_config_id` -- The PK of the IngestConfig to use
+* `source_config_id` -- The PK of the SourceConfig to use
 * `start_date` --
 * `end_date` -- 
 * `limit` -- The maximum number of documents to collect. Defaults to `None` (Unlimited)
 * `superfluous` -- Take certain actions that have previously suceeded
 * `transform` -- Should TransformJobs be launched for collected data. Defaults to `True`
 * `no_split` -- Should harvest jobs be split into multiple? Default to `False`
-* `ignore_disabled` -- Run the task, even with disabled ingest configs
+* `ignore_disabled` -- Run the task, even with disabled source configs
 * `force` -- Force the task to run, against all odds
 
 
 ## Steps
 
 ### Preventative measures
-* If the specified `ingest_config` is disabled and `force` or `ignore_disabled` is not set, crash
-* For the given `ingest_config` find up to the last 5 harvest jobs with the same harvester versions
+* If the specified `source_config` is disabled and `force` or `ignore_disabled` is not set, crash
+* For the given `source_config` find up to the last 5 harvest jobs with the same harvester versions
 * If they are all failed, throw an exception (Refuse to run)
 
 ### Setup
-* Lock the `ingest_config` (NOWAIT)
+* Lock the `source_config` (NOWAIT)
   * On failure, reschedule for a later run. (This should be allowed to happen many times before finally failing)
-* Get or create HarvestLog(`ingest_config_id`, `harvester_version`, `start_date`, `end_date`)
+* Get or create HarvestLog(`source_config_id`, `harvester_version`, `start_date`, `end_date`)
   * if found and status is:
     * `SUCCEEDED`, `SPLIT`, or `FAILED`: update timestamps and/or counts.
-    * `STARTED`: Log a warning (Should not have been able to lock the `ingest_config`) and update timestamps and/or counts.
+    * `STARTED`: Log a warning (Should not have been able to lock the `source_config`) and update timestamps and/or counts.
 * Set HarvestLog status to `STARTED`
 * If the specified date range is >= [SOME LENGTH OF TIME] and `no_split` is False
   * Chunk the date range and spawn a harvest task for each chunk
   * Set status to `SPLIT` and exit
-* Load the harvester for the given `ingest_config`
+* Load the harvester for the given `source_config`
 
 ### Actually Harvest
-* Harvest data between the specified datetimes, respecting `limit` and `ingest_config.rate_limit`
+* Harvest data between the specified datetimes, respecting `limit` and `source_config.rate_limit`
 
 ### Pass the data along
 * Begin catching any exceptions
 * For each piece of data recieved (Perferably in bulk/chunks)
   * Get or create `SourceUniqueIdentifier(suid, source_id)`
-    * Question: Should SUIDs depend on `ingest_config_id` instead of `source_id`? If we're harvesting data in multiple formats from the same source, we probably want to keep the respective states separate.
+    * Question: Should SUIDs depend on `source_config_id` instead of `source_id`? If we're harvesting data in multiple formats from the same source, we probably want to keep the respective states separate.
   * Get or create RawData(hash, suid)
 * For each piece of data (After saving to keep as transactional as possible)
-  * Get or create `TransformLog(raw_id, ingest_config_id, transformer_version)`
+  * Get or create `TransformLog(raw_id, source_config_id, transformer_version)`
   * if the log already exists and superfluous is not set, exit
-  * Start the `TransformTask(raw_id, ingest_config_id)` unless `transform` is `False`
+  * Start the `TransformTask(raw_id, source_config_id)` unless `transform` is `False`
 
 ### Clean up
 * If an exception was caught, set status to `FAILED` and insert the exception/traceback