You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: metadata-ingestion/docs/dev_guides/classification.md
+3-2Lines changed: 3 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,6 +9,7 @@ Note that a `.` is used to denote nested fields in the YAML recipe.
9
9
| Field | Required | Type | Description | Default |
10
10
| --- | --- | --- | --- | -- |
11
11
| enabled || boolean | Whether classification should be used to auto-detect glossary terms | False |
12
+
| sample_size || int | Number of sample values used for classification. | 100 |
12
13
| info_type_to_term || Dict[str,string]| Optional mapping to provide glossary term identifier for info type. | By default, info type is used as glossary term identifier. |
13
14
| classifiers || Array of object | Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance. |[{'type': 'datahub', 'config': None}]|
14
15
| table_pattern || AllowDenyPattern (see below for fields) | Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in `database.schema.table` format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*' | {'allow': ['.*'], 'deny': [], 'ignoreCase': True} |
@@ -28,8 +29,8 @@ DataHub Classifier is the default classifier implementation, which uses [acryl-d
28
29
29
30
| Field | Required | Type | Description | Default |
30
31
| --- | --- | --- | --- | -- |
31
-
| confidence_level_threshold || number || 0.6|
32
-
| info_types || list[string]| List of infotypes to be predicted. By default, all supported infotypes are considered. If specified. this should be subset of ['Email_Address', 'Gender', 'Credit_Debit_Card_Number', 'Phone_Number', 'Street_Address', 'Full_Name', 'Age', 'IBAN', 'US_Social_Security_Number', 'Vehicle_Identification_Number', 'IP_Address_v4', 'IP_Address_v6', 'US_Driving_License_Number', 'Swift_Code']| None |
32
+
| confidence_level_threshold || number || 0.68|
33
+
| info_types || list[string]| List of infotypes to be predicted. By default, all supported infotypes are considered. If specified. this should be subset of `['Email_Address', 'Gender', 'Credit_Debit_Card_Number', 'Phone_Number', 'Street_Address', 'Full_Name', 'Age', 'IBAN', 'US_Social_Security_Number', 'Vehicle_Identification_Number', 'IP_Address_v4', 'IP_Address_v6', 'US_Driving_License_Number', 'Swift_Code']`| None |
33
34
| info_types_config | Configuration details for infotypes | Dict[str, InfoTypeConfig]|| See [reference_input.py](https://github.com/acryldata/datahub-classify/blob/main/datahub-classify/src/datahub_classify/reference_input.py) for default configuration. |
34
35
| info_types_config.`key`.prediction_factors_and_weights | ❓ (required if info_types_config.`key` is set) | Dict[str,number]| Factors and their weights to consider when predicting info types ||
35
36
| info_types_config.`key`.name || NameFactorConfig (see below for fields) |||
Copy file name to clipboardExpand all lines: metadata-ingestion/src/datahub/ingestion/glossary/classifier.py
+5Lines changed: 5 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -31,6 +31,11 @@ class ClassificationConfig(ConfigModel):
31
31
default=False,
32
32
description="Whether classification should be used to auto-detect glossary terms",
33
33
)
34
+
35
+
sample_size: int=Field(
36
+
default=100, description="Number of sample values used for classification."
37
+
)
38
+
34
39
table_pattern: AllowDenyPattern=Field(
35
40
default=AllowDenyPattern.allow_all(),
36
41
description="Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in `database.schema.table` format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'",
0 commit comments