Skip to content

Commit fd79f14

Browse files
aaxelbchrisseto
authored andcommitted
IngestConfig => SourceConfig
1 parent 87a02eb commit fd79f14

File tree

2 files changed

+25
-25
lines changed

2 files changed

+25
-25
lines changed

whitepapers/Tables.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,10 @@ Identifier for a specific document from a specific source.
2121
| Column | Type | Indexed | Nullable | FK | Default | Description |
2222
| :----------------- | :--: | :-----: | :------: | :-: | :-----: | :--------------------------------------------- |
2323
| `identifier` | text | | | | | Identifier given to the document by the source |
24-
| `ingest_config_id` | int | | || | IngestConfig used to ingest the document |
24+
| `source_config_id` | int | | || | SourceConfig used to ingest the document |
2525

2626
#### Other indices
27-
* `source_doc_id`, `ingest_config_id` (unique)
27+
* `source_doc_id`, `source_config_id` (unique)
2828

2929
### RawData
3030
Raw data, exactly as it was given to SHARE.
@@ -36,9 +36,9 @@ Raw data, exactly as it was given to SHARE.
3636
| `sha256` | text | unique | | | | SHA-256 hash of `data` |
3737
| `harvest_logs` | m2m | | | | | List of HarvestLogs for harvester runs that found this exact datum |
3838

39-
## Ingest Configuration
39+
## Source Configuration
4040

41-
### IngestConfig
41+
### SourceConfig
4242
Describes one way to harvest metadata from a Source, and how to transform the result.
4343

4444
| Column | Type | Indexed | Nullable | FK | Default | Description |
@@ -48,11 +48,11 @@ Describes one way to harvest metadata from a Source, and how to transform the re
4848
| `earliest_date` | date | || | | Earliest date with available data |
4949
| `rate_limit_allowance` | int | | | | 5 | Number of requests allowed every `rate_limit_period` seconds |
5050
| `rate_limit_period` | int | | | | 1 | Number of seconds for every `rate_limit_allowance` requests |
51-
| `harvester_id` | int | | || | Harvester to use |
51+
| `harvester_id` | int | | || | Harvester to use |
5252
| `harvester_kwargs` | jsonb | || | | JSON object passed to the harvester as kwargs |
5353
| `transformer_id` | int | | || | Transformer to use |
5454
| `transformer_kwargs` | jsonb | || | | JSON object passed to the transformer as kwargs, along with the harvested raw data |
55-
| `disabled` | bool | | | | False | True if this ingest config should not be run automatically |
55+
| `disabled` | bool | | | | False | True if this source config should not be run automatically |
5656

5757
### Source
5858
A Source is a place metadata comes from.
@@ -90,23 +90,23 @@ Log entries to track the status of a specific harvester run.
9090

9191
| Column | Type | Indexed | Nullable | FK | Default | Description |
9292
| :------------------ | :------: | :-----: | :------: | :-: | :-----: | :------------------------------------------------------------------------------------------------------- |
93-
| `ingest_config_id` | int | | || | IngestConfig for this harvester run |
93+
| `source_config_id` | int | | || | SourceConfig for this harvester run |
9494
| `harvester_version` | text | | | | | Semantic version of the harvester, with each segment padded to 3 digits (e.g. '1.2.10' => '001.002.010') |
9595
| `start_date` | datetime | | | | | Beginning of the date range to harvest |
9696
| `end_date` | datetime | | | | | End of the date range to harvest |
9797
| `started` | datetime | | | | | Time `status` was set to STARTED |
9898
| `status` | text | | | | INITIAL | Status of the harvester run, one of {INITIAL, STARTED, SPLIT, SUCCEEDED, FAILED} |
9999

100100
#### Other indices
101-
* `ingest_config_id`, `harvester_version`, `start_date`, `end_date` (unique)
101+
* `source_config_id`, `harvester_version`, `start_date`, `end_date` (unique)
102102

103103
### TransformLog
104104
Log entries to track the status of a transform task
105105

106106
| Column | Type | Indexed | Nullable | FK | Default | Description |
107107
| :-------------------- | :------: | :-----: | :------: | :-: | :-----: | :--------------------------------------------------------------------------------------------------------- |
108108
| `raw_id` | int | | || | RawData to be transformed |
109-
| `ingest_config_id` | int | | || | IngestConfig used |
109+
| `source_config_id` | int | | || | SourceConfig used |
110110
| `transformer_version` | text | | | | | Semantic version of the transformer, with each segment padded to 3 digits (e.g. '1.2.10' => '001.002.010') |
111111
| `started` | datetime | | | | | Time `status` was set to STARTED |
112112
| `status` | text | | | | INITIAL | Status of the transform task, one of {INITIAL, STARTED, RESCHEDULED, SUCCEEDED, FAILED} |

whitepapers/tasks/Harvest.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -10,57 +10,57 @@
1010

1111

1212
## Considerations
13-
* Ingestion MUST be able to respect rate limits
14-
* Ingestion SHOULD be able to collect data from arbitrary date ranges
15-
* Ingestion SHOULD NOT consume all available memory
16-
* Ingestion SHOULD have a reasonable timeout
13+
* Harvesting MUST be able to respect rate limits
14+
* Harvesting SHOULD be able to collect data from arbitrary date ranges
15+
* Harvesting SHOULD NOT consume all available memory
16+
* Harvesting SHOULD have a reasonable timeout
1717

1818

1919
## Parameters
20-
* `ingest_config_id` -- The PK of the IngestConfig to use
20+
* `source_config_id` -- The PK of the SourceConfig to use
2121
* `start_date` --
2222
* `end_date` --
2323
* `limit` -- The maximum number of documents to collect. Defaults to `None` (Unlimited)
2424
* `superfluous` -- Take certain actions that have previously suceeded
2525
* `transform` -- Should TransformJobs be launched for collected data. Defaults to `True`
2626
* `no_split` -- Should harvest jobs be split into multiple? Default to `False`
27-
* `ignore_disabled` -- Run the task, even with disabled ingest configs
27+
* `ignore_disabled` -- Run the task, even with disabled source configs
2828
* `force` -- Force the task to run, against all odds
2929

3030

3131
## Steps
3232

3333
### Preventative measures
34-
* If the specified `ingest_config` is disabled and `force` or `ignore_disabled` is not set, crash
35-
* For the given `ingest_config` find up to the last 5 harvest jobs with the same harvester versions
34+
* If the specified `source_config` is disabled and `force` or `ignore_disabled` is not set, crash
35+
* For the given `source_config` find up to the last 5 harvest jobs with the same harvester versions
3636
* If they are all failed, throw an exception (Refuse to run)
3737

3838
### Setup
39-
* Lock the `ingest_config` (NOWAIT)
39+
* Lock the `source_config` (NOWAIT)
4040
* On failure, reschedule for a later run. (This should be allowed to happen many times before finally failing)
41-
* Get or create HarvestLog(`ingest_config_id`, `harvester_version`, `start_date`, `end_date`)
41+
* Get or create HarvestLog(`source_config_id`, `harvester_version`, `start_date`, `end_date`)
4242
* if found and status is:
4343
* `SUCCEEDED`, `SPLIT`, or `FAILED`: update timestamps and/or counts.
44-
* `STARTED`: Log a warning (Should not have been able to lock the `ingest_config`) and update timestamps and/or counts.
44+
* `STARTED`: Log a warning (Should not have been able to lock the `source_config`) and update timestamps and/or counts.
4545
* Set HarvestLog status to `STARTED`
4646
* If the specified date range is >= [SOME LENGTH OF TIME] and `no_split` is False
4747
* Chunk the date range and spawn a harvest task for each chunk
4848
* Set status to `SPLIT` and exit
49-
* Load the harvester for the given `ingest_config`
49+
* Load the harvester for the given `source_config`
5050

5151
### Actually Harvest
52-
* Harvest data between the specified datetimes, respecting `limit` and `ingest_config.rate_limit`
52+
* Harvest data between the specified datetimes, respecting `limit` and `source_config.rate_limit`
5353

5454
### Pass the data along
5555
* Begin catching any exceptions
5656
* For each piece of data recieved (Perferably in bulk/chunks)
5757
* Get or create `SourceUniqueIdentifier(suid, source_id)`
58-
* Question: Should SUIDs depend on `ingest_config_id` instead of `source_id`? If we're harvesting data in multiple formats from the same source, we probably want to keep the respective states separate.
58+
* Question: Should SUIDs depend on `source_config_id` instead of `source_id`? If we're harvesting data in multiple formats from the same source, we probably want to keep the respective states separate.
5959
* Get or create RawData(hash, suid)
6060
* For each piece of data (After saving to keep as transactional as possible)
61-
* Get or create `TransformLog(raw_id, ingest_config_id, transformer_version)`
61+
* Get or create `TransformLog(raw_id, source_config_id, transformer_version)`
6262
* if the log already exists and superfluous is not set, exit
63-
* Start the `TransformTask(raw_id, ingest_config_id)` unless `transform` is `False`
63+
* Start the `TransformTask(raw_id, source_config_id)` unless `transform` is `False`
6464

6565
### Clean up
6666
* If an exception was caught, set status to `FAILED` and insert the exception/traceback

0 commit comments

Comments
 (0)