You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Manifest-based file ingestion provides end-users and systems a robust mechanism for loading metadata about datasets in Microsoft Energy Data Services Preview instance. This metadata is indexed by the system and allows the end-user to search the datasets.
13
14
14
-
Manifest-based file ingestion provides end-users and systems a robust mechanism for loading metadata in Microsoft Energy Data Services Preview instance. A manifest is a JSON document that has a pre-determined structure for capturing entities that conform to the [OSDU™](https://osduforum.org/) Well-known Schema (WKS) definitions.
15
-
16
-
Manifest-based file ingestion doesn't understand the contents of the file or doesn't parse the file. It just creates a metadata record for the file and makes it searchable. It doesn't infer or does anything on top of the file.
15
+
Manifest-based file ingestion is an opaque ingestion that do not parse or understand the file contents. It creates a metadata record based on the manifest and makes the record searchable.
17
16
18
17
[!INCLUDE [preview features callout](./includes/preview/preview-callout.md)]
19
18
20
-
## Understanding the manifest
19
+
## What is a Manifest?
20
+
A manifest is a JSON document that has a pre-determined structure for capturing entities defined as 'kind', that is, registered as schemas with the Schema service - [Well-known Schema (WKS) definitions](https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/E-R/README.md#manifest-schemas).
21
+
22
+
You can find an example manifest json document [here](https://community.opengroup.org/osdu/data/data-definitions/-/tree/master/Examples/manifest#manifest-example).
21
23
22
-
The manifest schema has containers for the following entities
24
+
The manifest schema has containers for the following OSDU™[Group types](https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/Guides/Chapters/02-GroupType.md#2-group-type):
23
25
24
26
***ReferenceData** (*zero or more*) - A set of permissible values to be used by other (master or transaction) data fields. Examples include *Unit of Measure (feet)*, *Currency*, etc.
25
27
***MasterData** (*zero or more*) - A single source of basic business data used across multiple systems, applications, and/or process. Examples include *Wells* and *Wellbores*
26
28
***WorkProduct (WP)** (*one - must be present if loading WorkProductComponents*) - A session boundary or collection (project, study) encompasses a set of entities that need to be processed together. As an example, you can take the ingestion of one or more log collections.
27
29
***WorkProductComponents (WPC)** (*zero or more - must be present if loading datasets*) - A typed, smallest, independently usable unit of business data content transferred as part of a Work Product (a collection of things ingested together). Each Work Product Component (WPC) typically uses reference data, belongs to some master data, and maintains a reference to datasets. Example: *Well Logs, Faults, Documents*
28
30
***Datasets** (*zero or more - must be present if loading WorkProduct and WorkProductComponent records*) - Each Work Product Component (WPC) consists of one or more data containers known as datasets.
29
31
30
-
## Manifest-based file ingestion workflow steps
31
-
32
-
1. A manifest is submitted to the Workflow Service using the manifest ingestion workflow name (for example, "Osdu_ingest")
33
-
2. Once the request is validated and the user authorization is complete, the workflow service will load and initiate the manifest ingestion workflow.
34
-
3. The first step is to check the syntax of the manifest.
35
-
1. Retrieve the **kind** property of the manifest
36
-
2. Retrieve the **schema definition** from the Schema service for the manifest kind
37
-
3. Validate that the manifest is syntactically correct according to the manifest schema definitions.
38
-
4. For each Reference data, Master data, Work Product, Work Product Component, and Dataset, do the following activities:
39
-
1. Retrieve the **kind** property.
40
-
2. Retrieve the **schema definition** from the Schema service for the kind
41
-
3. Validate that the entity is syntactically correct according to the schema definition and submits the manifest to the Workflow Service
42
-
4. Validate that mandatory attributes exist in the manifest
43
-
5. Validate that all property values follow the patterns defined in the schemas
44
-
6. Validate that no extra properties are present in the manifest
45
-
5. Any entity that doesn't pass the syntax check is rejected
46
-
4. The content is checked for a series of validation rules
47
-
1. Validation of referential integrity between Work Product Components and Datasets
48
-
1. There are no orphan Datasets defined in the WP (each Dataset belongs to a WPC)
49
-
2. Each Dataset defined in the WPC is described in the WP Dataset block
50
-
3. Each WPC is linked to at least
51
-
2. Validation that referenced parent data exists
52
-
3. Validation that Dataset file paths aren't empty
53
-
5. Process the contents into storage
54
-
1. Write each valid entity into the data platform via the Storage API
55
-
2. Capture the ID generated to update surrogate-keys where surrogate-keys are used
56
-
6. Workflow exits
57
-
58
-
## Manifest ingestion components
59
-
60
-
***Workflow Service** is a wrapper service on top of the Airflow workflow engine, which orchestrates the ingestion workflow. Airflow is the chosen workflow engine by the [OSDU™](https://osduforum.org/) community to orchestrate and run ingestion workflows. Airflow isn't directly exposed to clients, instead its features are accessed through the workflow service.
61
-
***File Service** is used to upload files, file collections, and other types of source data to the data platform.
62
-
***Storage Service** is used to save the manifest records into the data platform.
63
-
***Airflow engine** is the workflow engine that executes DAGs (Directed Acyclic Graphs).
64
-
***Schema Service** stores schemas used in the data platform. Schemas are being referenced during the Manifest-based file ingestion.
65
-
***Entitlements Service** manages access groups. This service is used during the ingestion for verification of ingestion permissions. This service is also used during the metadata record retrieval for validation of "read" writes.
32
+
The Manifest data is loaded in a particular sequence:
33
+
1. The 'ReferenceData' array (if populated).
34
+
2. The 'MasterData' array (if populated).
35
+
3. The 'Data' structure is processed last (if populated). Inside the 'Data' property, processing is done in the following order:
36
+
1. the 'Datasets' array
37
+
2. the 'WorkProductComponents' array
38
+
3. the 'WorkProduct'.
39
+
40
+
Any arrays are ordered. should there be interdependencies, the dependent items must be placed behind their relationship targets, for example, a master-data Well record must be placed in the 'MasterData' array before its Wellbores.
41
+
42
+
## Manifest-based file ingestion workflow
43
+
44
+
Microsoft Energy Data Services Preview instance has out-of-the-box support for Manifest-based file ingestion workflow. `Osdu_ingest` Airflow DAG is pre-configured in your instance.
The Manifest-based file ingestion workflow consists of the following components:
48
+
***Workflow Service** - A wrapper service running on top of the Airflow workflow engine.
49
+
***Airflow engine** - A workflow orchestration engine that executes workflows registered as DAGs (Directed Acyclic Graphs). Airflow is the chosen workflow engine by the [OSDU™](https://osduforum.org/) community to orchestrate and run ingestion workflows. Airflow isn't directly exposed, instead its features are accessed through the workflow service.
50
+
***Storage Service** - A service that is used to save the manifest metadata records into the data platform.
51
+
***Schema Service** - A service that manages OSDU™ defined schemas in the data platform. Schemas are being referenced during the Manifest-based file ingestion.
52
+
***Entitlements Service** - A service that manages access groups. This service is used during the ingestion for verification of ingestion permissions. This service is also used during the metadata record retrieval for validation of "read" writes.
53
+
***Legal Service** - A service that validates compliance through legal tags.
66
54
***Search Service** is used to perform referential integrity check during the manifest ingestion process.
67
55
68
-
## Manifest ingestion workflow sequence
56
+
### Pre-requisites
57
+
Before running the Manifest-based file ingestion workflow, customers must ensure that the user accounts running the workflow have access to the core services (Search, Storage, Schema, Entitlement and Legal) and Workflow service (see [Entitlement roles](https://community.opengroup.org/osdu/platform/deployment-and-operations/infra-azure-provisioning/-/blob/master/docs/osdu-entitlement-roles.md) for details). As part of Microsoft Energy Data Services instance provisioning, the OSDU™ standard schemas and associated reference data are pre-loaded. Customers must ensure that the user account used for ingesting the manifests is included in appropriate owners and viewers ACLs. Customers must ensure that manifests are configured with correct legal tags, owners and viewers ACLs, reference data, etc.
58
+
59
+
### Workflow sequence
60
+
The following illustration provides the Manifest-based file ingestion workflow:
61
+
:::image type="content" source="media/concepts-manifest-ingestion/concept-manifest-ingestion-sequence.png" alt-text="Screenshot of the manifest ingestion sequence.":::
62
+
63
+
A user submits a manifest to the `Workflow Service` using the manifest ingestion workflow name ("Osdu_ingest"). If the request is proper and the user is authorized to run the workflow, the workflow service loads the manifest and initiates the manifest ingestion workflow.
64
+
65
+
The workflow service executes a series of manifest `syntax validation` like manifest structure and attribute validation as per the defined schema and check for mandatory schema attributes. The system then perform `referential integrity validation` between Work Product Components and Datasets. For example, whether the referenced parent data exists.
69
66
70
-
:::image type="content" source="media/concepts-manifest-ingestion/concept-manifest-ingestion-sequence.png" alt-text="Screenshot of the manifest ingestion sequence.":::
67
+
Once the validations are successful, the system processes the content into storage by writing each valid entity into the data platform using the Storage Service API.
0 commit comments