Skip to content

Commit afd65e1

Browse files
authored
feat(cli): delete cli v2 (#8068)
1 parent 3c0d720 commit afd65e1

File tree

17 files changed

+1049
-564
lines changed

17 files changed

+1049
-564
lines changed

docker/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,8 @@ DataHub Docker Images:
2525

2626
Do not use `latest` or `debug` tags for any of the image as those are not supported and present only due to legacy reasons. Please use `head` or tags specific for versions like `v0.8.40`. For production we recommend using version specific tags not `head`.
2727

28-
* [linkedin/datahub-ingestion](https://hub.docker.com/r/linkedin/datahub-ingestion/) - This contains the Python CLI. If you are looking for docker image for every minor CLI release you can find them under [acryldata/datahub-ingestion](https://hub.docker.com/r/acryldata/datahub-ingestion/).
29-
* [linkedin/datahub-gms](https://hub.docker.com/repository/docker/linkedin/datahub-gms/).
28+
* [acryldata/datahub-ingestion](https://hub.docker.com/r/acryldata/datahub-ingestion/)
29+
* [linkedin/datahub-gms](https://hub.docker.com/repository/docker/linkedin/datahub-gms/)
3030
* [linkedin/datahub-frontend-react](https://hub.docker.com/repository/docker/linkedin/datahub-frontend-react/)
3131
* [linkedin/datahub-mae-consumer](https://hub.docker.com/repository/docker/linkedin/datahub-mae-consumer/)
3232
* [linkedin/datahub-mce-consumer](https://hub.docker.com/repository/docker/linkedin/datahub-mce-consumer/)

docs/cli.md

Lines changed: 5 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -138,14 +138,9 @@ The `check` command allows you to check if all plugins are loaded correctly as w
138138

139139
### delete
140140

141-
The `delete` command allows you to delete metadata from DataHub. Read this [guide](./how/delete-metadata.md) to understand how you can delete metadata from DataHub.
142-
:::info
143-
Deleting metadata using DataHub's CLI and GraphQL API is a simple, systems-level action. If you attempt to delete an Entity with children, such as a Container, it will not automatically delete the children, you will instead need to delete each child by URN in addition to deleting the parent.
144-
:::
141+
The `delete` command allows you to delete metadata from DataHub.
145142

146-
```console
147-
datahub delete --urn "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)" --soft
148-
```
143+
The [metadata deletion guide](./how/delete-metadata.md) covers the various options for the delete command.
149144

150145
### exists
151146

@@ -534,11 +529,11 @@ Old Entities Migrated = {'urn:li:dataset:(urn:li:dataPlatform:hive,logging_event
534529
535530
### Using docker
536531
537-
[![Docker Hub](https://img.shields.io/docker/pulls/linkedin/datahub-ingestion?style=plastic)](https://hub.docker.com/r/linkedin/datahub-ingestion)
538-
[![datahub-ingestion docker](https://github.com/datahub-project/datahub/actions/workflows/docker-ingestion.yml/badge.svg)](https://github.com/datahub-project/datahub/actions/workflows/docker-ingestion.yml)
532+
[![Docker Hub](https://img.shields.io/docker/pulls/acryldata/datahub-ingestion?style=plastic)](https://hub.docker.com/r/acryldata/datahub-ingestion)
533+
[![datahub-ingestion docker](https://github.com/acryldata/datahub/actions/workflows/docker-ingestion.yml/badge.svg)](https://github.com/acryldata/datahub/actions/workflows/docker-ingestion.yml)
539534
540535
If you don't want to install locally, you can alternatively run metadata ingestion within a Docker container.
541-
We have prebuilt images available on [Docker hub](https://hub.docker.com/r/linkedin/datahub-ingestion). All plugins will be installed and enabled automatically.
536+
We have prebuilt images available on [Docker hub](https://hub.docker.com/r/acryldata/datahub-ingestion). All plugins will be installed and enabled automatically.
542537
543538
You can use the `datahub-ingestion` docker image as explained in [Docker Images](../docker/README.md). In case you are using Kubernetes you can start a pod with the `datahub-ingestion` docker image, log onto a shell on the pod and you should have the access to datahub CLI in your kubernetes cluster.
544539

docs/how/delete-metadata.md

Lines changed: 160 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -1,102 +1,207 @@
11
# Removing Metadata from DataHub
22

3+
:::tip
4+
To follow this guide, you'll need the [DataHub CLI](../cli.md).
5+
:::
6+
37
There are a two ways to delete metadata from DataHub:
48

5-
1. Delete metadata attached to entities by providing a specific urn or filters that identify a set of entities
6-
2. Delete metadata created by a single ingestion run
9+
1. Delete metadata attached to entities by providing a specific urn or filters that identify a set of urns (delete CLI).
10+
2. Delete metadata created by a single ingestion run (rollback).
711

8-
To follow this guide you need to use [DataHub CLI](../cli.md).
12+
:::caution Be careful when deleting metadata
913

10-
Read on to find out how to perform these kinds of deletes.
14+
- Always use `--dry-run` to test your delete command before executing it.
15+
- Prefer reversible soft deletes (`--soft`) over irreversible hard deletes (`--hard`).
1116

12-
_Note: Deleting metadata should only be done with care. Always use `--dry-run` to understand what will be deleted before proceeding. Prefer soft-deletes (`--soft`) unless you really want to nuke metadata rows. Hard deletes will actually delete rows in the primary store and recovering them will require using backups of the primary metadata store. Make sure you understand the implications of issuing soft-deletes versus hard-deletes before proceeding._
17+
:::
1318

19+
## Delete CLI Usage
1420

1521
:::info
16-
Deleting metadata using DataHub's CLI and GraphQL API is a simple, systems-level action. If you attempt to delete an Entity with children, such as a Domain, it will not delete those children, you will instead need to delete each child by URN in addition to deleting the parent.
22+
23+
Deleting metadata using DataHub's CLI is a simple, systems-level action. If you attempt to delete an entity with children, such as a container, it will not delete those children. Instead, you will need to delete each child by URN in addition to deleting the parent.
24+
1725
:::
18-
## Delete By Urn
1926

20-
To delete all the data related to a single entity, run
27+
All the commands below support the following options:
2128

22-
### Soft Delete (the default)
29+
- `-n/--dry-run`: Execute a dry run instead of the actual delete.
30+
- `--force`: Skip confirmation prompts.
2331

24-
This sets the `Status` aspect of the entity to `Removed`, which hides the entity and all its aspects from being returned by the UI.
25-
```
32+
### Selecting entities to delete
33+
34+
You can either provide a single urn to delete, or use filters to select a set of entities to delete.
35+
36+
```shell
37+
# Soft delete a single urn.
2638
datahub delete --urn "<my urn>"
39+
40+
# Soft delete using a filter.
41+
datahub delete --platform snowflake
42+
43+
# Filters can be combined, which will select entities that match all filters.
44+
datahub delete --platform looker --entity-type chart
45+
datahub delete --platform bigquery --env PROD
2746
```
28-
or
29-
```
30-
datahub delete --urn "<my urn>" --soft
31-
```
3247

33-
### Hard Delete
48+
When performing hard deletes, you can optionally add the `--only-soft-deleted` flag to only hard delete entities that were previously soft deleted.
49+
50+
### Performing the delete
51+
52+
#### Soft delete an entity (default)
53+
54+
By default, the delete command will perform a soft delete.
3455

35-
This physically deletes all rows for all aspects of the entity. This action cannot be undone, so execute this only after you are sure you want to delete all data associated with this entity.
56+
This will set the `status` aspect's `removed` field to `true`, which will hide the entity from the UI. However, you'll still be able to view the entity's metadata in the UI with a direct link.
3657

58+
```shell
59+
# The `--soft` flag is redundant since it's the default.
60+
datahub delete --urn "<urn>" --soft
61+
# or using a filter
62+
datahub delete --platform snowflake --soft
3763
```
64+
65+
#### Hard delete an entity
66+
67+
This will physically delete all rows for all aspects of the entity. This action cannot be undone, so execute this only after you are sure you want to delete all data associated with this entity.
68+
69+
```shell
3870
datahub delete --urn "<my urn>" --hard
71+
# or using a filter
72+
datahub delete --platform snowflake --hard
3973
```
4074

41-
As of datahub v0.8.35 doing a hard delete by urn will also provide you with a way to remove references to the urn being deleted across the metadata graph. This is important to use if you don't want to have ghost references in your metadata model and want to save space in the graph database.
42-
For now, this behaviour must be opted into by a prompt that will appear for you to manually accept or deny.
75+
As of datahub v0.10.2.3, hard deleting tags, glossary terms, users, and groups will also remove references to those entities across the metadata graph.
4376

44-
You can optionally add `-n` or `--dry-run` to execute a dry run before issuing the final delete command.
45-
You can optionally add `-f` or `--force` to skip confirmations
46-
You can optionally add `--only-soft-deleted` flag to remove soft-deleted items only.
77+
#### Hard delete a timeseries aspect
4778

48-
:::note
79+
It's also possible to delete a range of timeseries aspect data for an entity without deleting the entire entity.
4980

50-
Make sure you surround your urn with quotes! If you do not include the quotes, your terminal may misinterpret the command._
81+
For these deletes, the aspect and time ranges are required. You can delete all data for a timeseries aspect by providing `--start-time min --end-time max`.
5182

52-
:::
83+
```shell
84+
datahub delete --urn "<my urn>" --aspect <aspect name> --start-time '-30 days' --end-time '-7 days'
85+
# or using a filter
86+
datahub delete --platform snowflake --entity-type dataset --aspect datasetProfile --start-time '0' --end-time '2023-01-01'
87+
```
5388

54-
If you wish to hard-delete using a curl request you can use something like below. Replace the URN with the URN that you wish to delete
89+
The start and end time fields filter on the `timestampMillis` field of the timeseries aspect. Allowed start and end times formats:
5590

56-
```
57-
curl "http://localhost:8080/entities?action=delete" -X POST --data '{"urn": "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"}'
58-
```
91+
- `YYYY-MM-DD`: a specific date
92+
- `YYYY-MM-DD HH:mm:ss`: a specific timestamp, assumed to be in UTC unless otherwise specified
93+
- `+/-<number> <unit>` (e.g. `-7 days`): a relative time, where `<number>` is an integer and `<unit>` is one of `days`, `hours`, `minutes`, `seconds`
94+
- `ddddddddd` (e.g. `1684384045`): a unix timestamp
95+
- `min`, `max`, `now`: special keywords
5996

60-
## Delete by filters
97+
## Delete CLI Examples
6198

62-
_Note: All these commands below support the soft-delete option (`-s/--soft`) as well as the dry-run option (`-n/--dry-run`).
99+
:::note
63100

101+
Make sure you surround your urn with quotes! If you do not include the quotes, your terminal may misinterpret the command.
64102

65-
### Delete all Datasets from the Snowflake platform
103+
:::
104+
105+
_Note: All of the commands below support `--dry-run` and `--force` (skips confirmation prompts)._
106+
107+
#### Soft delete a single entity
108+
109+
```shell
110+
datahub delete --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"
66111
```
67-
datahub delete --entity_type dataset --platform snowflake
112+
113+
#### Hard delete a single entity
114+
115+
```shell
116+
datahub delete --urn "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)" --hard
68117
```
69118

70-
### Delete all containers for a particular platform
119+
#### Delete everything from the Snowflake DEV environment
120+
121+
```shell
122+
datahub delete --platform snowflake --env DEV
71123
```
72-
datahub delete --entity_type container --platform s3
124+
125+
#### Delete all BigQuery datasets in the PROD environment
126+
127+
```shell
128+
# Note: this will leave BigQuery containers intact.
129+
datahub delete --env PROD --entity-type dataset --platform bigquery
73130
```
74131

75-
### Delete all datasets in the DEV environment
132+
#### Delete all pipelines and tasks from Airflow
133+
134+
```shell
135+
datahub delete --platform "airflow"
76136
```
77-
datahub delete --env DEV --entity_type dataset
137+
138+
#### Delete all containers for a particular platform
139+
140+
```shell
141+
datahub delete --entity-type container --platform s3
78142
```
79143

80-
### Delete all Pipelines and Tasks in the DEV environment
144+
#### Delete everything in the DEV environment
145+
146+
```shell
147+
# This is a pretty broad filter, so make sure you know what you're doing!
148+
datahub delete --env DEV
81149
```
82-
datahub delete --env DEV --entity_type "dataJob"
83-
datahub delete --env DEV --entity_type "dataFlow"
150+
151+
#### Delete all Looker dashboards and charts
152+
153+
```shell
154+
datahub delete --platform looker
84155
```
85156

86-
### Delete all bigquery datasets in the PROD environment
157+
#### Delete all Looker charts (but not dashboards)
158+
159+
```shell
160+
datahub delete --platform looker --entity-type chart
87161
```
88-
datahub delete --env PROD --entity_type dataset --platform bigquery
162+
163+
#### Clean up old datasetProfiles
164+
165+
```shell
166+
datahub delete --entity-type dataset --aspect datasetProfile --start-time 'min' --end-time '-60 days'
89167
```
90168

91-
### Delete all looker dashboards and charts
169+
#### Delete a tag
170+
171+
```shell
172+
# Soft delete.
173+
datahub delete --urn 'urn:li:tag:Legacy' --soft
174+
175+
# Or, using a hard delete. This will automatically clean up all tag associations.
176+
datahub delete --urn 'urn:li:tag:Legacy' --hard
92177
```
93-
datahub delete --entity_type dashboard --platform looker
94-
datahub delete --entity_type chart --platform looker
178+
179+
#### Delete all datasets that match a query
180+
181+
```shell
182+
# Note: the query is an advanced feature, but can sometimes select extra entities - use it with caution!
183+
datahub delete --entity-type dataset --query "_tmp"
95184
```
96185

97-
### Delete all datasets that match a query
186+
#### Hard delete everything in Snowflake that was previously soft deleted
187+
188+
```shell
189+
datahub delete --platform snowflake --only-soft-deleted --hard
98190
```
99-
datahub delete --entity_type dataset --query "_tmp"
191+
192+
## Deletes using the SDK and APIs
193+
194+
The Python SDK's [DataHubGraph](../../python-sdk/clients.md) client supports deletes via the following methods:
195+
196+
- `soft_delete_entity`
197+
- `hard_delete_entity`
198+
- `hard_delete_timeseries_aspect`
199+
200+
Deletes via the REST API are also possible, although we recommend using the SDK instead.
201+
202+
```shell
203+
# hard delete an entity by urn
204+
curl "http://localhost:8080/entities?action=delete" -X POST --data '{"urn": "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"}'
100205
```
101206

102207
## Rollback Ingestion Run
@@ -105,26 +210,27 @@ The second way to delete metadata is to identify entities (and the aspects affec
105210

106211
To view the ids of the most recent set of ingestion batches, execute
107212

108-
```
213+
```shell
109214
datahub ingest list-runs
110215
```
111216

112217
That will print out a table of all the runs. Once you have an idea of which run you want to roll back, run
113218

114-
```
219+
```shell
115220
datahub ingest show --run-id <run-id>
116221
```
117222

118223
to see more info of the run.
119224

120-
Alternately, you can execute a dry-run rollback to achieve the same outcome.
121-
```
225+
Alternately, you can execute a dry-run rollback to achieve the same outcome.
226+
227+
```shell
122228
datahub ingest rollback --dry-run --run-id <run-id>
123229
```
124230

125231
Finally, once you are sure you want to delete this data forever, run
126232

127-
```
233+
```shell
128234
datahub ingest rollback --run-id <run-id>
129235
```
130236

@@ -133,10 +239,9 @@ This deletes both the versioned and the timeseries aspects associated with these
133239

134240
### Unsafe Entities and Rollback
135241

136-
> **_NOTE:_** Preservation of unsafe entities has been added in datahub `0.8.32`. Read on to understand what it means and how it works.
137-
138242
In some cases, entities that were initially ingested by a run might have had further modifications to their metadata (e.g. adding terms, tags, or documentation) through the UI or other means. During a roll back of the ingestion that initially created these entities (technically, if the key aspect for these entities are being rolled back), the ingestion process will analyse the metadata graph for aspects that will be left "dangling" and will:
139-
1. Leave these aspects untouched in the database, and soft-delete the entity. A re-ingestion of these entities will result in this additional metadata becoming visible again in the UI, so you don't lose any of your work.
243+
244+
1. Leave these aspects untouched in the database, and soft delete the entity. A re-ingestion of these entities will result in this additional metadata becoming visible again in the UI, so you don't lose any of your work.
140245
2. The datahub cli will save information about these unsafe entities as a CSV for operators to later review and decide on next steps (keep or remove).
141246

142247
The rollback command will report how many entities have such aspects and save as a CSV the urns of these entities under a rollback reports directory, which defaults to `rollback_reports` under the current directory where the cli is run, and can be configured further using the `--reports-dir` command line arg.

docs/how/updating-datahub.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ This file documents any backwards-incompatible changes in DataHub and assists pe
77
### Breaking Changes
88

99
- #7900: The `catalog_pattern` and `schema_pattern` options of the Unity Catalog source now match against the fully qualified name of the catalog/schema instead of just the name. Unless you're using regex `^` in your patterns, this should not affect you.
10+
- #8068: In the `datahub delete` CLI, if an `--entity-type` filter is not specified, we automatically delete across all entity types. The previous behavior was to use a default entity type of dataset.
11+
- #8068: In the `datahub delete` CLI, the `--start-time` and `--end-time` parameters are not required for timeseries aspect hard deletes. To recover the previous behavior of deleting all data, use `--start-time min --end-time max`.
1012

1113
### Potential Downtime
1214

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,20 @@
11
import logging
22

3-
from datahub.cli import delete_cli
43
from datahub.emitter.mce_builder import make_dataset_urn
5-
from datahub.emitter.rest_emitter import DatahubRestEmitter
4+
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
65

76
log = logging.getLogger(__name__)
87
logging.basicConfig(level=logging.INFO)
98

10-
rest_emitter = DatahubRestEmitter(gms_server="http://localhost:8080")
9+
graph = DataHubGraph(
10+
config=DatahubClientConfig(
11+
server="http://localhost:8080",
12+
)
13+
)
14+
1115
dataset_urn = make_dataset_urn(name="fct_users_created", platform="hive")
1216

13-
delete_cli._delete_one_urn(urn=dataset_urn, soft=True, cached_emitter=rest_emitter)
17+
# Soft-delete the dataset.
18+
graph.delete_entity(urn=dataset_urn, hard=False)
1419

1520
log.info(f"Deleted dataset {dataset_urn}")

0 commit comments

Comments
 (0)