Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
257 changes: 257 additions & 0 deletions active/000-buisness-attributes/000-business-attributes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,257 @@
- Start Date: 2023-10-19
- RFC PR: 000-business-attributes.md
- Discussion Issue:
- Implementation PR(s):

# Business Attributes

## Summary

A Business attribute is a centrally managed logical field that represents a unique schema field entity. This common construct is global in nature, i.e. it is not bound to a project or application implementation. Instead, its identity exists in representing the same field across various datasets owned by various different projects and applications. Projects or applications use the Business attribute to model a column in a dataset and inherit information about it such as definitions, data type, data quality rules/assertions, tags, glossary terms etc from the global definition. Data architects can use the concept of the business attribute to validate whether applications are conformant with the applicable metadata defined for the business attribute. By abstracting common business metadata into a logical model, different personas with appropriate business knowledge can define pertinent details, like rich definition, business use for the attribute, classification (i.e. PII, sensitive, shareable etc.), specific data rules that govern the attribute, connection to glossary terms.

## Motivation

* Improve the metadata definition, consistency, and meaning for commonly used attributes across applications / data systems in the organisation.
* Business users/consumers can easily discover datasets they are looking for with the help of business attributes
* Add the ability to define data rules to govern the data in this attribute across applications


## Requirements

The primary personas involved in this are the Data Architect, Dataset Owner and Business User personas

### Data Architect

#### Must Haves
1. Ability to define a business attribute, and associate a description, datatype, glossary terms, tags and DQ rules among other aspects of a general field
2. Ability to update a business attribute record centrally
3. Ability to track any discrepancies between business attributes and associated fields

#### Good-to-Haves
1. Approval workflow for creation/updation

### Dataset Owner

#### Must Haves
1. Ability to inherit all attributes from a business attribute to a field by attaching the business attribute to the field
1. Ability to inherit all updates from a business attribute to a field

#### Good-to-Haves

1. Ability to intelligently inherit attributes from business attributes to fields in datasets via system auto-assignment

### Business User

#### Must Haves
1. Ability to search for fields using business description/tags/glossary attached to business attribute

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the expectation that this expanded search should be the default universal search experience on the main search bar?
How should ranking work when you have a match on a business description through a business attribute attached to a field versus a match on a field level description .. or a match on the table description?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our ranking strategy prioritizes field-level descriptions, followed by descriptions of business attributes. The primary objective is to ensure that the corresponding dataset is displayed as long as there is a match with the business attribute description.



## Non-Requirements
* Allow dataset owners to propose changes to business attributes
* Allow dataset owners to partially inherit from business attributes (eg only inherit description but not tags)
* Allow fields to have multiple business attributes
* Update descriptions in data sources

## Detailed design

### Sample Mockups

#### Show Business Atrribute Launcher
![](business-attribute-rfc-1.png)

#### List Business Atrributes
![](business-attribute-rfc-2.png)

#### Create Business Attribute
![](business-attribute-rfc-3.png)

#### Business Attribute Details
![](business-attribute-rfc-4.png)

#### Enrich Schema
Current feature of enriching the schema field of dataset remains as it is

![](business-attribute-rfc-5.png)

#### Attaching Business Attribute
![](business-attribute-rfc-6.png)

#### Metadata Model Enhancements

Below are the current Relationships of Dataset Entity with its aspects in Datahub.

![](business-attribute-rfc-7.png)


### Model Business Attribute Entity

We're suggesting the introduction of a new entity called BusinessAttribute and a new Aspect known as BusinessAttributeInfo. This new entity will be a top-level entity, enabling it to be independently defined and managed by the business team.

Users will have the ability to attach Business Attributes exclusively to dataset schema Fields. We're also suggesting necessary modifications to the user interface to manage this feature.

We're proposing a new base class called EditableSchemaFieldBase, which will be included in both EditableSchemaFieldInfo and BusinessAttributeInfo. The main goal is to repurpose existing Records to model BusinessAttributeInfo.

Furthermore, we're introducing a new field named businessAttribute of the type BusinessAttributeAssociation in EditableSchemaFieldInfo. This field will contain the urn for the BusinessAttribute attached to the dataset schema field.

![](business-attribute-rfc-8.png)

### URN Representation for Business Attribute
```
urn:li:businessAttribute:b3916dfe-a27c-4916-97be-60773dca90d7
```

### Business Attribute Attachment to SchemaField
To create a link between the dataset schema field and the business attribute, we are proposing the introduction of a new aspect, BusinessAttributeAssociation, and a new field of the same type named businessAttribute. This field will be located in EditableSchemaFieldInfo and will hold the urn for the business attribute.

### Metadata Graph

#### Enabling Capability of searching Dataset entities as per Description/tags of Business Attributes

We propose to introduce a new annotation "@SearchableRef" to enhance the search capability of Dataset entities based on the Description/tags of Business Attributes. This annotation will allow us to populate the Elasticsearch indexes with expanded details about the referenced entity.

For instance, the new field 'businessAttribute' in EditableSchemaFieldInfo can be annotated with @SearchableRef. This instructs the relevant hooks to populate the Elasticsearch index 'datahubindex_v2' with details of the business attribute and enable the search of datasets based on the properties of the business attribute. The structure of 'datasetindex_v2' is expected to change accordingly once the business attribute is linked with schemaField.

```json
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.6931471,
"hits": [
{
"_index": "datasetindex_v2",
"_id": "urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Apostgres%2Cpostgres.public.orders%2CPROD%29",
"_score": 0.6931471,
"_source": {
"urn": "urn:li:dataset:(urn:li:dataPlatform:postgres,postgres.public.orders,PROD)",
"removed": false,
"container": "urn:li:container:a208486b83be39fa411922e07701d984",
"hasContainer": true,
"typeNames": [
"Table"
],
"runId": [
"postgres-2023_09_29-12_54_35",
"postgres-2023_09_29-12_55_29",
"postgres-2023_09_29-12_55_41"
],
"customProperties": [],
"name": "orders",
"hasDescription": false,
"fieldPaths": [
"order_id",
"user_id",
"product_id",
"quantity",
"order_date"
],
"fieldGlossaryTerms": [],
"fieldDescriptions": [],
"fieldLabels": [],
"fieldTags": [],
"origin": "PROD",
"id": "postgres.public.orders",
"platform": "urn:li:dataPlatform:postgres",
"browsePaths": [
"/prod/postgres/postgres/public"
],
"hasGlossaryTerms": false,
"glossaryTerms": [],
"editedFieldBusinessAttribute": [],
"editedFieldGlossaryTerms": [
"urn:li:glossaryTerm:5fb9fd4e-5ed2-46e9-afa2-6ab5bb1f4851"
],
"editedFieldBusinessAttributev2": [
"urn:li:businessAttributev2:first-attribute",
"urn:li:businessAttributev2:second-attribute"
],
"editedFieldDescriptions": [
"lorem ipsume and order\\_id and testing\n\n ",
"lorem ipsum",
" quantiy-loremipsume",
"this is product id",
"order\\_date-loremipsum wiht lot of love and testing and more love"
],
"editedFieldTags": [
"urn:li:tag:TestTag",
"urn:li:tag:TestTag",
"urn:li:tag:TestTag"
],
"businessAttributeGlossary": [
"urn:li:glossaryTerm:5fb9fd4e-5ed2-46e9-afa2-6ab5bb1f4851",
"urn:li:glossaryTerm:2b441fe5-c466-479a-8ce7-5b3f2db7ce1a"
],
"businessAttributeTags": [
"urn:li:tag:TestTag",
"urn:li:tag:TestTag"
],
"businessAttributeDescription": [
" quantiy-loremipsume",
" productLoremipsume"
]
}
}
]
}
}
```

#### Cascade changes made in Business Attribute to related Datasets

When any changes made to referenced entity, in this case Business attribute, for example, removal/add of tag/glossary-terms or description update, these changes needs to be cascaded to attached dataset elastic index. For this we are proposing to create either new annotation or using "`@SearchableRef`" with incoming and outgoing references like "`@Relationship`" entity, but under the hood we are planning to use "`@Relationship`" to fetch attached dataset entities and update their corresponding elastic indexes as per the changes made in referenced entity.


## How we teach this

For technical users, the concept of business attribute is best understood as user-defined abstract schema fields. In this context, anyone in the technology space can visualize it as the abstract record which can be inherited from by the implementation record, which is the actual physical schema field.

For architects, these are best understood as field templates. They define a template for a field by assigning metadata to each field which corresponds to a specification.

They are meant to be entirely transparent to business users and as such should not need to be introduced to them.

While the change doesn't fundamentally alter any existing Datahub concept, it greatly empowers users and acts as an accelerator to enrichment. As such, it should be taught as a fundamentally new concept which can be leveraged by large Datahub customers to greatly reduce their metadata management overhead.

## Drawbacks

Cascading changes made in Business Attribute to related Dataset elastic indexes put a load on Kafka and elastic. Changes made in business attribute results in Kafka events, which gets consumes by the consumer which in turn update the elastic indexes of the referenced dataset with the updated changes of business attribute. As per current implementation, Kafka consumer finalise the message processing until it updates the elastic index, so for example, if business attribute is referenced in datasets(order of 100K), then processing large volumes of messages can lead to performance degradation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that this write amplification might make this approach infeasible to implement. Curious if you've run any benchmarks to see how this scales if you have 1M entities attached to a business attribute, and you have a change in the description of the attribute.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this approach we will have eventual consistency. When there is change, for example in business attribute description, now this change should be propagated to the elastic indexes for all the referenced datasets. During this propagation search experience might be inconsistent but after some time eventual consistency will be achieved. As of now we have not run any benchmarking. After design approval we will take care of testing and benchmarking during implementation time.


Also, if the consumer fails to update the elastic index in timely manner, there could be inconsistencies between the data in Kafka and data in elastic. This could lead to inaccurate query results.

## Alternatives

### Model Business Attribute

We have also explored an alternative approach to model the BusinessAttributeInfo aspect (as shown in the diagram below). In this approach, we include the reference of Business Attribute in EditableSchemaFieldInfo. However, this method could potentially lead to cyclic dependencies, which would not correctly represent our intended data model.

![](business-attribute-rfc-9.png)

### Business Attribute Attachment with Schema Field
We also consider below approach for this, but are not going forward with it because of the cons listed below.

When the user attach the business attribute on the schema field, we can attach the properties of Business Attribute such as `Tag`, `Description`, `Glossary Terms` to the `Schema Field` of Dataset Entity and corresponding `MetadataChangeLog` Events and Platform Events gets generated for the Dataset Entity. Also, there will be an entry of business attribute id in the `editableSchemaFieldInfo` for Dataset, which establishes the relationship between business attribute and the corresponding schema field.

#### PROS
With this approach we are leveraging existing behaviour of Datahub, for example, when we copy the properties, this is same as enriching the data from UI, which in turn take care of elastic search indexing, and also the corresponding MCL and platform events gets generated.

#### CONS
This approach will not be scalable in the use case when there is change in Business Attribute Properties and put a heavy load on Mysql and as well as Elastic search. For example, if any tag or description or glossary updated/removed from business attribute then this change needs to replicated to all the attached datasets(order of 100K) and leads to the potential for inconsistent states of Dataset details page, in addition to a huge amount of load on the Mysql.


## Rollout / Adoption Strategy

This is not a breaking change and can easily adopted by users since its generic enough. As per current design thought process, we need to write migration scripts to finalise it.

## Future Work

Show all the associated tags/glossary-terms/descriptions of business attribute with disabled state
![Future Work](business-attribute-rfc-10.png)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.