Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
362 changes: 362 additions & 0 deletions active/0009-import-export-feature/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,362 @@
- Start Date: 2024-11-12
- RFC PR: https://github.com/datahub-project/rfcs/pull/9
- Implementation PR(s):

# Import/Export Feature

## Summary

This feature will add the ability to both export datasets to CSV files and import them back into DataHub from those CSV files, using the UI. Code is already implemented for this feature, though further work may need to be done. This RFC details the Implementation in its current state.

## Motivation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A section on the User Journey and Motivations would be useful here.

e.g.
Why is the User exporting to csv -> Do they intend to make some changes to metadata in bulk and then import it back?

Are there scenarios where Users are trying to import CSV-s containing metadata that has been hand written or sourced from "non DataHub" catalogs? In those scenarios - how will users provide the "urn" field which represents the main identity of the dataset on DataHub?


This feature was developed with the intention to mimic the import/export functionality present in Collibra. It can be used for moving datasets between instances of DataHub, which may be useful for enterprise-level users. Though it is not a strictly necessary feature, the DataHub team has expressed interest in adding it to the DataHub project.

## Requirements

This feature as it is currently implemented is only intended to support:
- Export to CSV of individual datasets.
Copy link

@jayacryl jayacryl Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and their schemas!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, schemas are not present within the CSV files. During schema-level export, only the datasets within a given schema are exported.

- Export to CSV of all datasets within a container.
Copy link

@jayacryl jayacryl Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we could simply say: all datasets that match a search predicate?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature as it is currently implemented is not designed to support that.

- Import from CSV of previously exported data.

## Non-Requirements

This feature is not intended to add a REST API for import/export like that of Collibra. It is only intended for use through the UI. Additionally, we do not intend for this feature to be used to import datasets built from scratch. The feature is only intended to import CSV files that have been previously exported from DataHub.

## Detailed design

This feature will add three new options to the existing `SearchExtendedMenu` dropdown, as can be seen in figure 1. The first option exports all datasets within a container, the second exports individual datasets, and the third is used to import previously exported data into DataHub. The export options create CSV files from data existing in DataHub, while the import option adds new data to DataHub from CSV files.

| ![Figure 1: Search extended menu](search_extended_menu.png "Figure 1") |
|:--:|
| *Figure 1: Search extended menu* |

Below is a list of the column names used in the CSV files for this feature. Within the CSV files, each row describes an individual dataset or schema field.

``` csv
resource,asset_type,subresource,glossary_terms,tags,owners,ownership_type,description,domain
```

Here is information on how these CSV columns are used, and how the data stored within them is formatted:

- Resource: The URN of the dataset. In the case of schema fields, this is the URN of the dataset which contains the schema field.
- asset_type: What type of asset is contained in the row. This is either a dataset or schema field.
- subresource: The name of the schema field. This is unused by rows containing datasets.
- glossary_terms: A semicolon-separated list of glossary term URNs. This column is currently unused, but is planned to be used by both dataset and schema field rows.
- tags: A semicolon-separated list of tag URNs. This column is currently unused, but is planned to be used by both dataset and schema field rows.
- owners: A semicolon-separated list of owner URNs. Currently, this is populated on export, but unused on import.
- ownership_type: A list of mappings from owner URN to ownership type. Currently, this column is unused, and its format has yet to be determined.
- description: The description of a given asset. This is used by both dataset and schema field rows.
- domain: The URN of a domain associated with the dataset. This is unused by rows containing schema fields.

### Export

Within the `SearchExtendedMenu` dropdown, the container-level export option is only available when a container is being viewed. At all other times, it is grayed out and cannot be pressed. This is done using a React effect, which greys out the button unless the URL of the current page contains the word "container".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

screenshot here would be great

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image Here's the dataset one!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be updating the RFC with screenshots from the current implementation we have.


When either export option is selected, it opens a modal which prompts the user to enter the name of the CSV file to be created (see figures 2 and 3). For dataset-level export, the user is also prompted to enter the data source, database, schema, and table name of the dataset to be exported. Notably, these fields assume a specific number of containers to be present, which may not be the case for every data source. As such, this modal may need to be altered. This is what the fields presently refer to:
- Data source: The name of the data platform containing the dataset.
- Database: A container representing a database within the data source.
- Schema: A container representing a schema within the source database.
- Table name: The name of the dataset.

| ![Figure 2: Dataset download modal](download_dataset_modal.png "Figure 2") |
|:--:|
| *Figure 2: Dataset download modal* |

| ![Figure 3: Schema download modal](download_schema_modal.png "Figure 3") |
|:--:|
| *Figure 3: Schema download modal* |

Upon entry, the following steps occur:

1. The modal is made invisible, but continues executing code for the export process. A notification is created to inform the user that the export process is ongoing (see figure 4).
2. The URN of the dataset or container is determined, by either:
- Pulling from [`EntityContext`](https://github.com/datahub-project/datahub/blob/master/datahub-web-react/src/app/entity/shared/EntityContext.ts) in the case of container-level export.
- Manually constructing the URN from data entered into the modal in the case of dataset-level export.
3. The modal uses the URN as input into GraphQL queries, which are used to fetch the metadata for the datasets to be exported.
- Container-level export will first execute a GraphQL query to determine how many datasets are present in the container. If no datasets are present, execution will end early, and a notification is sent to the user informing them of such. Datasets are not searched for recursively.
- Additionally, container-level export will only fetch 50 datasets per GraphQL execution. If more than 50 datasets are present in the container, this query will be executed multiple times, with each execution producing and downloading separate CSV files.
4. The metadata returned from the GraphQL query is transformed into a CSV-compatible JSON object using a shared function, `convertToCSVRows`. Each row in this JSON object contains the columns described in the prior section.
5. The existing `downloadRowsAsCsv` function in [`csvUtils`](https://github.com/datahub-project/datahub/blob/master/datahub-web-react/src/app/search/utils/csvUtils.ts) is used to create the download.

| ![Figure 4: download notification](downloading_schema.png "Figure 4") |
|:--:|
| *Figure 4: Download notification* |

#### GraphQL queries

These GraphQL queries are used for container-level export and dataset-level export, respectively:

``` graphql
query getDatasetByUrn($urn: String!, $start: Int!, $count: Int!) {
search(input: { type: DATASET, query: "*", orFilters: [{and: [{field: "container", values: [$urn]}]}], start: $start, count: $count }) {
start
count
total
searchResults {
entity {
... on Dataset {
urn
type
name
platform {
urn
}
domain {
associatedUrn
domain {
urn
type
}
}
properties {
name
description
}

editableProperties {
description
}

ownership {
owners {
owner {
... on Entity {
urn
}
}
ownershipType {
urn
}
}
}
tags {
tags {
associatedUrn
}
}
glossaryTerms {
terms {
associatedUrn
}
}
editableSchemaMetadata {
editableSchemaFieldInfo {
description
fieldPath
tags {
tags {
associatedUrn
}
}
}
}
schemaMetadata {
name
fields {
description
type
fieldPath
nativeDataType
tags {
tags {
tag {
name
}
}
}
}
}
}
}
}
}
}

query getTable($urn: String!, $start: Int!, $count: Int!) {
search(
input: {
type: DATASET
query: "*"
start: $start
count: $count
orFilters: { and: [{ field: "urn", values: [$urn], condition: EQUAL }] }
}
) {
start
count
total
searchResults {
entity {
... on Dataset {
urn
type
name
platform {
urn
}
domain {
associatedUrn
domain {
urn
type
}
}
properties {
name
description
}

editableProperties {
description
}

ownership {
owners {
owner {
... on Entity {
urn
}
}
ownershipType {
urn
}
}
}
tags {
tags {
associatedUrn
}
}
glossaryTerms {
terms {
associatedUrn
}
}
editableSchemaMetadata {
editableSchemaFieldInfo {
description
fieldPath
tags {
tags {
associatedUrn
}
}
}
}
schemaMetadata {
name
fields {
description
type
fieldPath
nativeDataType
tags {
tags {
tag {
name
}
}
}
}
}
}
}
}
}
}
```

### Import

In the case of import, the button first opens a prompt to upload a file, using the following snippet of code.

``` jsx
<input id="file" type="file" onChange={changeHandler} style={{ opacity: 0 }} />
```

After the user has chosen a file for upload, a notification is shown to inform the user that the upload is in progress, as can be seen in figure 5.

| ![Figure 5: import notifications](import_notification.png "Figure 5") |
|:--:|
| *Figure 5: Import notifications* |

The `papaparse` library is used to parse the CSV file and iterate over each row present within it. The data is then fed into GraphQL mutations to create datasets. Notably, a new GraphQL mutation had to be created to allow the upserting of schema metadata. Here is the specification for that new mutation:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What sort of scale do we want to advertise for this feature?
How much have we tested up to?
Should the UI reject the import if there are too many datasets?
How should progress be displayed to the user?

Copy link

@jayacryl jayacryl Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had previously tested a prototype that broke requests down even at the row-level. I.e., one request for glossary terms, one request per schema column changed, etc.
Furthermore, it cached key fields of the exported file locally. This allowed for only submitting the diffs to graphql.

This approach helped us quickly identify if a specific cell in the csv failed to apply, while still succeeding with all the rest. This was easily presented in a final upload report at the end.
It seemed to work well with up to 100k items changed.

image image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At present, the implementation we've written does not display a progress bar, nor does it reject the user if too many datasets are present for import or export. However, import a CSV file can take a great deal of time if there are a lot of datasets present in the file. I believe we've tested up to 36 datasets at once.


``` graphql
type Mutation {
upsertDataset(urn: String!, input: DatasetUpsertInput!): Dataset
}

input DatasetUpsertInput {
name: String!
description: String
schemaMetadata: SchemaMetadataInput!
globalTagUrns: [String],
domainUrn: String
}

input SchemaMetadataInput {
version: Int!,
schemaName: String!,
platformUrn: String!,
fields: [SchemaFieldInput]!
}

input SchemaFieldInput {
fieldPath: String!
type: SchemaFieldDataType!
nativeDataType: String!
globalTagUrns: [String]
description: String
}
```

Alongside these new GraphQL mutations, the necessary mappers and resolvers have been added to `datahub-graphql-core` to properly send the input to GMS. It's noteworthy that there are several fields required by the GraphQL mutation that are not present in the CSV Schema. Such fields are filled with these values on import:

- `name`: The name is extracted from the dataset URN stored in the `resource` CSV field.
- `version`: 0
- `schemaName`: An empty string.
- `platformUrn`: The platform URN is extracted from the dataset URN stored in the `resource` CSV field.
- `type`: `SchemaFieldDataType.Null`
- `nativeDataType`: "Unknown Type"

Presently, the `glossary_terms`, `tags`, `owners`, and `ownership_type` CSV fields are unused in the import of datasets.

## How we teach this

A user guide should be written on how to use the feature. In particular, we will need to highlight:
- Under what circumstances the container-level export button becomes available.
- How to fill out the form presented in the dataset-level export modal.

This feature would be best presented as a wholly new functionality. Though it is presently possible to download search results as CSV, the format of the resulting CSV files differs significantly from that of this import/export feature. As such, the files cannot be used as import input.

The GraphQL documentation may also need to be updated to include the new mutation added alongside this feature, should the DataHub team decide to make the mutation available via the UI.

## Drawbacks and Alternatives

As mentioned before, this feature is only intended for use within the UI. As the code has currently been written, it would not be possible to extend the import and export functionality to a different API (i.e., REST), as all the code is written in React.
Copy link

@jayacryl jayacryl Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could convert it to python without too much effort - given all the core logic is in .ts, with no jsx stuff involved.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could do a such a thing, but most of the existing code we've written would need to be scrapped and rewritten in another language (Python, in this case).


A possible alternative to this would be to move the code that performs the CSV data mutations to the Metadata Service or the Frontend Server, so that it is accessible from throughout the DataHub stack through a REST API or similar. This does not come without drawbacks, however. Namely, we would need to re-write the existing code entirely, and we'd be introducing additional complexity through the API endpoints.

It's also notable that because the format of the CSV files is so different from those produced by the existing functionality of downloading search results, existing CSV files cannot be used to import datasets. This may cause confusion among users, and may be worth remediating.

It's also notable that an extension for DataHub does exist which adds very similar functionality ([link](https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/csv/)). This has not been investigated in detail, but if this is a duplicate feature, it may not be worth integrating into DataHub.
Copy link

@jayacryl jayacryl Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe folks are usually use this as a one-off rather than on a loop, so we could sunset the source once there is a mechanism for it in the UI.
And especially if we copy the react to python as mentioned above, then developers can leverage the python tool to do this programmatically.


## Rollout / Adoption Strategy

This feature does not change or break any existing functionality, and therefore no specific migration strategy is required.

## Future Work

Out of the required GraphQL fields that are presently missing on import, the `type` and `nativeDataType` fields are the only ones that have a noticeable visual impact within DataHub when absent. Because of this, it is an absolute necessity that we add corresponding columns to the CSV schema, so that we can populate those fields. Notably, if we add a `type` column, it could also be uased to store the sub types of datasets.

Additionally, the `glossary_terms`, `tags`, and `ownership_type` CSV columns are presently unused. It would be fairly simple to add the functionality to fill those columns in, however, as we are already fetching the necessary information for these fields in the search results of our GraphQL query. At the same time, the import code should be updated to make use of the `owners` column. To upsert this data to DataHub, either the `upsertDataset` mutation would need to be updated to handle the new information, or additional GraphQL mutations would need to be performed during import using existing mutations.

As talked about in further detail below, the dataset-level export will also need to be refactored to be more flexible, as at present, it was designed to only work with data sources with two layers of containers in DataHub. It remains to be decided how it should be redesigned.

## Unresolved questions

It's notable that the dataset-level export component of this feature was designed specifically for data sources with two layers of containers in DataHub. This is unlikely to always be the case, and as such, this component will likely need to be refactored to be more flexible. We will need to determine what shape the component should take before performing this refactoring.

Additionally, this feature would end up adding the ability to create Datasets through GraphQL as a side effect. It will need to be evaluated whether this is an acceptable outcome, or if it is acceptable, whether it should be made accessible through the GraphiQL interface.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what the implications of creating and deleting datasets via graphql would be for the overall DataHub platform?

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.