Skip to content

Commit 5094ff2

Browse files
authored
Merge pull request #211574 from MicrosoftDocs/release-preview-energy-data-services
Release preview energy data services--scheduled release at 0:00AM of 9/20
2 parents ccb69da + 94b3e93 commit 5094ff2

File tree

101 files changed

+3053
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

101 files changed

+3053
-0
lines changed
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
---
2+
title: Microsoft Energy Data Services Preview csv parser ingestion workflow concept #Required; page title is displayed in search results. Include the brand.
3+
description: Learn how to use CSV parser ingestion. #Required; article description that is displayed in search results.
4+
author: bharathim #Required; your GitHub user alias, with correct capitalization.
5+
ms.author: bselvaraj #Required; microsoft alias of author; optional team alias.
6+
ms.service: energy-data-services #Required; service per approved list. slug assigned by ACOM.
7+
ms.topic: conceptual #Required; leave this attribute/value as-is.
8+
ms.date: 08/18/2022
9+
ms.custom: template-concept #Required; leave this attribute/value as-is.
10+
---
11+
12+
# CSV parser ingestion concepts
13+
14+
One of the simplest generic data formats that are supported by the Microsoft Energy Data Services Preview ingestion process is the "comma separated values" format, which is called a CSV format. The CSV format is processed through a CSV Parser DAG definition.
15+
16+
CSV Parser DAG implements an ELT approach to data loading, that is, data is loaded after it's extracted. Customers can use CSV Parser DAG to load data that doesn't match the [OSDU™](https://osduforum.org) canonical schema. Customers need to create and register a custom schema using the schema service matching the format of the CSV file.
17+
18+
[!INCLUDE [preview features callout](./includes/preview/preview-callout.md)]
19+
20+
## What does CSV ingestion do?
21+
22+
* **Schema validation** – ensure CSV file conforms to the schema.
23+
* **Type conversion** – ensure that the type of a field is as defined and converts it to the defined type if otherwise.
24+
* **ID generation** – used to upload into storage service. It helps in scenarios where the ingestion failed half-way as ID generation logic is idempotent, one avoids duplicate data on platform.
25+
* **Reference handling** – enable customers to refer to actual data on the platform and access it.
26+
* **Persistence** – It persists each row after validations by calling storage service API. Once persisted, the data is available for consumption through search and storage service APIs.
27+
28+
## CSV Parser ingestion functionality
29+
30+
The CSV parser ingestion currently supports the following functionality as a one-step DAG:
31+
32+
- CSV file is parsed as per the schema (one row in CSV = 1 record ingested into the data platform)
33+
- CSV file contents match the contents of the provided schema.
34+
- **Success**: validate the schema vs. the header of the CSV file and the values of the first nrows. Use the schema for all downstream tasks to build the metadata.
35+
- **Fail**: log the error(s) in the schema validation, proceed with ingestion if errors are non-breaking
36+
- Convert all characters to UTF8, and gracefully handle/replace characters that can't be converted to UTF8.
37+
- Unique data identity for an object in the Data Platform - CSV Ingestion generates Unique Identifier (ID) for each record by combining source, entity type and a base64 encoded string formed by concatenating natural key(s) in the data. In case the schema used for CSV Ingestion doesn't contain any natural keys storage service will generate random IDs for every record
38+
- Typecast to JSON-supported data types:
39+
- **Number** - Typecast integers, doubles, floats, etc. as described in the schema to "number". Some common spatial formats, such as Degrees/Minutes/Seconds (DMS) or Easting/Northing should be typecast to "String." Special Handling of these string formats will be handled in the Spatial Data Handling Task.
40+
- **Date** - Typecast dates as described in the schema to a date, doing a date format conversion toISO8601TZ format (for fully qualified dates). Some date fragments (such as years) can't be easily converted to this format and should be typecast to a number instead, or if textual date representations, for example, "July" should be typecast to string.
41+
- **Others** - All other encountered attributes should be typecast as string.
42+
- Stores a batch of records in the context of a particular ingestion job. Fragments/outputs from the previous steps are collected into a batch, and formatted in a way that is compatible with the Storage Service with the appropriate additional information, such as the ACL's, Legal tags, etc.
43+
- Support frame of reference handling:
44+
- **Unit** - converting declared frame of reference information into the appropriate persistable reference as per the Unit Service. This information is stored in the meta[] block.
45+
- **CRS** - the CRS Frame of Reference (FoR) information should be included in the schema of the data, including the source CRS (either geographic or projected), and if projected, the CRS info and persistable reference (if provided in schema) information is stored in the meta[] block.
46+
- Creates relationships as declared in the source schema.
47+
- Supports publishing status of ingested/failed records on GSM article
48+
49+
## CSV parser ingestion components
50+
51+
* **File service** – Facilitates management of files on data platform. Uploading, Secure discovery and downloading of files are capabilities provided by file service.
52+
* **Schema service** – Facilitates management of Schemas on data platform. Creating, fetching and searching for schemas are capabilities provided by schema service.
53+
* **Storage Service** – JSON object store that facilitates storage of metadata information for domain entities. Also raises storage events when records are saved using storage service.
54+
* **Unit Service** – Facilitates management and conversion of Units
55+
* **Workflow service** – Facilitates management of workflows on data platform. Wrapper over the workflow engine and abstract many technical nuances of the workflow engine from consumers.
56+
* **Airflow engine** – Heart of the ingestion framework. Actual Workflow orchestrator.
57+
* **DAGs** – Based on Direct Acyclic Graph concept, are workflows that are authored, orchestrated, managed and monitored by the workflow engine.
58+
59+
## CSV ingestion components diagram
60+
61+
:::image type="content" source="media/concepts-csv-parser-ingestion/csv-ingestion-components-diagram.png" alt-text="Screenshot of the CSV ingestion components diagram.":::
62+
63+
## CSV ingestion sequence diagram
64+
65+
:::image type="content" source="media/concepts-csv-parser-ingestion/csv-ingestion-sequence-diagram.png" alt-text="Screenshot of the CSV ingestion sequence diagram." lightbox="media/concepts-csv-parser-ingestion/csv-ingestion-sequence-diagram-expanded.png":::
66+
67+
## CSV parser ingestion workflow
68+
69+
### Prerequisites
70+
71+
* To trigger APIs, the user must have the below access and a valid authorization token
72+
* Access to services: Search, Storage, Schema, File Service, Entitlement, Legal
73+
* Access to Workflow service.
74+
* Following is list of service level groups that you need access to register and execute DAG using workflow service.
75+
* "service.workflow.creator"
76+
* "service.workflow.viewer"
77+
* "service.workflow.admin"
78+
79+
### Steps to execute a DAG using Workflow Service
80+
81+
* **Create schema** – Definition of the kind of records that will be created as outcome of ingestion workflow. The schema is uploaded through the schema service. The schema needs to be registered using schema service.
82+
* **Uploading the file** – Use file Service to upload a file. The file service provides a signed URL, which enables the customers to upload the data without credential requirements.
83+
* **Create Metadata record for the file** – Use file service to create meta data. The meta data enables discovery of file and secure downloads. It also provides a mechanism to provide information associated with the file that is needed during the processing of the file.
84+
* The file ID created is provided to the CSV parser, which takes care of downloading the file, ingesting the file, and ingesting the records with the help of workflow service. The customers also need to register the workflow, the CSV parser DAG is already deployed in the Airflow.
85+
* **Trigger the Workflow service** – To trigger the workflow, the customer needs to provide the file ID, the kind of the file and data partition ID. Once the workflow is triggered, the customer gets a run ID.
86+
Workflow service provides API to monitor the status of each workflow run. Once the csv parser run is completed, data is ingested into OSDU™ Data Platform, and can be searched through search service
87+
88+
OSDU™ is a trademark of The Open Group.
89+
90+
## Next steps
91+
Advance to the CSV parser tutorial and learn how to perform a CSV parser ingestion
92+
> [!div class="nextstepaction"]
93+
> [Tutorial: Sample steps to perform a CSV parser ingestion](tutorial-csv-ingestion.md)
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
---
2+
title: Domain data management services concepts #Required; page title is displayed in search results. Include the brand.
3+
description: Learn how to use Domain Data Management Services #Required; article description that is displayed in search results.
4+
author: marielgherz #Required; your GitHub user alias, with correct capitalization.
5+
ms.author: marielherzog #Required; microsoft alias of author; optional team alias.
6+
ms.service: energy-data-services #Required; service per approved list. slug assigned by ACOM.
7+
ms.topic: conceptual #Required; leave this attribute/value as-is.
8+
ms.date: 08/18/2022
9+
ms.custom: template-concept #Required; leave this attribute/value as-is.
10+
---
11+
12+
# Domain data management service concepts
13+
14+
**Domain Data Management Service (DDMS)** – is a platform component that extends [OSDU™](https://osduforum.org) core data platform with domain specific model and optimizations. DDMS is a mechanism of a platform extension that:
15+
16+
* delivers optimized handling of data for each (non-overlapping) "domain."
17+
* single vertical discipline or business area, for example, Petrophysics, Geophysics, Seismic
18+
* a functional aspect of one or more vertical disciplines or business areas, for example, Earth Model
19+
* delivers high performance capabilities not supported by OSDU™ generic normal APIs.
20+
* can help achieve the extension of OSDU™ scope to new business areas.
21+
* may be developed in a distributed manner with separate resources/sponsors.
22+
23+
OSDU™ Technical Standard defines the following types of OSDU™ application types:
24+
25+
| Application Type | Description |
26+
| --------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
27+
| OSDU™™ Embedded Applications | An application developed and managed within the OSDU™ Open-Source community that is built on and deployed as part of the OSDU™ Data Platform distribution. |
28+
| ISV Extension Applications | An application, developed and managed in the marketplace that is NOT part of THE OSDU™ Data Platform distributions, and when selected is deployed within the OSDU™ Data Platform as add-ons |
29+
| ISV third Party Applications | An application, developed and managed in the marketplace that integrates with the OSDU™ Data Platform, and runs outside the OSDU™ Data Platform |
30+
31+
32+
| Characteristics | Embedded | Extension | Third Party |
33+
| ----------------------------------------- | ---------------------------------- | --------------------------- | --------- |
34+
| Developed, managed, and deployed by | The OSDU™ Data Platform | ISV | ISV |
35+
| Software License | Apache 2 | ISV | ISV |
36+
| Mandatory as part of an OSDU™ distribution | Yes | No | No |
37+
| Replaceable | Yes, with preservation of behavior | Yes | Yes |
38+
| Architecture Compliance | The OSDU™ Standard | The OSDU™ Standard | ISV |
39+
| Examples | OS CRS <br /> Wellbore DDMS | ESRI CRS <br /> Petrel DS | Petrel |
40+
41+
[!INCLUDE [preview features callout](./includes/preview/preview-callout.md)]
42+
43+
## Who did we build this for?
44+
45+
**IT Developers** build systems to connect data to domain applications (internal and external – for example, Petrel) which enables data managers to deliver projects to geoscientists. The DDMS suite on Microsoft Energy Data Services helps automate these workflows and eliminates time spent managing updates.
46+
47+
**Geoscientists** use domain applications for key Exploration and Production workflows such as Seismic interpretation and Well tie analysis. While these users won't directly interact with the DDMS, their expectations for data performance and accessibility will drive requirements for the DDMS in the Foundation Tier. Azure will enable geoscientists to stream cross domain data instantly in OSDU&trade; compatible applications (for example, Petrel) connected to Microsoft Energy Data Services.
48+
49+
**Data managers** spend a significant number of time fulfilling requests for data retrieval and delivery. The Seismic, Wellbore, and Petrel Data Services enable them to discover and manage data in one place while tracking version changes as derivatives are created.
50+
51+
## Platform landscape
52+
53+
Microsoft Energy Data Services is an OSDU&trade; compatible product, meaning that its landscape and release model are dependent on OSDU&trade;.
54+
55+
Currently, OSDU&trade; certification and release process are not fully defined yet and this topic should be defined as a part of the Microsoft Energy Data Services Foundation Architecture.
56+
57+
OSDU&trade; R3 M8 is the base for the scope of the Microsoft Energy Data Services Foundation Private Preview – as a latest stable, tested version of the platform.
58+
59+
## Learn more: OSDU&trade; DDMS community principles
60+
61+
[OSDU&trade; community DDMS Overview](https://community.opengroup.org/osdu/documentation/-/wikis/OSDU&trade;-(C)/Design-and-Implementation/Domain-&-Data-Management-Services#ddms-requirements) provides an extensive overview of DDMS motivation and community requirements from a user, technical, and business perspective. These principles are extended to Microsoft Energy Data Services.
62+
63+
## DDMS requirements
64+
65+
A DDMS meets the following requirements, further classified into capability, architectural, operational and openness/extensibility requirements:
66+
67+
|**#** | **Description** | **Business rationale** | **Principle** |
68+
|---------|---------|---------|---------|
69+
| 1 | Data can be ingested with low friction | Need to seamlessly integrate with systems of record, to start with the industry standards | Capability |
70+
| 2 | New data is available in workflows with minimal latency | Deliver new data in context of the end-user workflow – seamlessly and fast. | Capability |
71+
| 3 | Domain data and services are highly usable | The business anticipates a large set of use-cases where domain data is used in various workflows. Need to make the consumption simple and efficient | Capability |
72+
| 4 | Scalable performance for E&P workflows | E&P data has specific access requirements, way beyond standard cloud storage. Scalable E&P data requires E&P workflow experience and insights | Capability |
73+
| 5 | Data is available for visual analytics and discovery (Viz/BI) | Deliver minimum set of visualization capabilities on the data | Capability |
74+
| 6 | One source of truth for data | Drive towards reduction of duplication | Capability |
75+
| 7 | Data is secured, and access governed | Securely stored and managed | Architectural |
76+
| 8 | All data is preserved and immutable | Ability to associate data to milestones and have data/workflow traceable across the ecosystem | Architectural |
77+
| 9 | Data is globally identifiable | No risk of overwriting or creating non-unique relationships between data and activities | Architectural |
78+
| 10 | Data lineage is tracked | Required for auditability, re-creation of the workflow, and learning from work previously done | Architectural |
79+
| 11 | Data is discoverable | Possible to find and consume back ingested data | Architectural |
80+
| 12 | Provisioning | Efficient provisioning of the DDMS and auto integration with the Data Ecosystem | Operational |
81+
| 13 | Business Continuity | Deliver on industry expectation for business continuity (RPO, RTO, SLA) | Operational |
82+
| 14 | Cost | Cost efficient delivery of data | Operational |
83+
| 15 | Auditability | Deliver required forensics to support cyber security incident investigations | Operational |
84+
| 16 | Accessibility | Deliver technology | Operational |
85+
| 17 | Domain-Centric Data APIs | | Openness and Extensibility |
86+
| 18 | Workflow composability and customizations | | Openness and Extensibility |
87+
| 19 | Data-Centric Extensibility | | Openness and Extensibility |
88+
89+
OSDU&trade; is a trademark of The Open Group.
90+
91+
## Next steps
92+
Advance to the seismic ddms sdutil tutorial to learn how to use sdutil to load seismic data into seismic store.
93+
> [!div class="nextstepaction"]
94+
> [Tutorial: Seismic store sdutil](tutorial-seismic-ddms-sdutil.md)

0 commit comments

Comments
 (0)