Skip to content

Commit 798ce3d

Browse files
feat(classification): configurable sample size (#8096)
Co-authored-by: david-leifker <[email protected]>
1 parent 8357fc8 commit 798ce3d

File tree

4 files changed

+15
-6
lines changed

4 files changed

+15
-6
lines changed

metadata-ingestion/docs/dev_guides/classification.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ Note that a `.` is used to denote nested fields in the YAML recipe.
99
| Field | Required | Type | Description | Default |
1010
| --- | --- | --- | --- | -- |
1111
| enabled | | boolean | Whether classification should be used to auto-detect glossary terms | False |
12+
| sample_size | | int | Number of sample values used for classification. | 100 |
1213
| info_type_to_term | | Dict[str,string] | Optional mapping to provide glossary term identifier for info type. | By default, info type is used as glossary term identifier. |
1314
| classifiers | | Array of object | Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance. | [{'type': 'datahub', 'config': None}] |
1415
| table_pattern | | AllowDenyPattern (see below for fields) | Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in `database.schema.table` format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*' | {'allow': ['.*'], 'deny': [], 'ignoreCase': True} |
@@ -28,8 +29,8 @@ DataHub Classifier is the default classifier implementation, which uses [acryl-d
2829

2930
| Field | Required | Type | Description | Default |
3031
| --- | --- | --- | --- | -- |
31-
| confidence_level_threshold | | number | | 0.6 |
32-
| info_types | | list[string] | List of infotypes to be predicted. By default, all supported infotypes are considered. If specified. this should be subset of ['Email_Address', 'Gender', 'Credit_Debit_Card_Number', 'Phone_Number', 'Street_Address', 'Full_Name', 'Age', 'IBAN', 'US_Social_Security_Number', 'Vehicle_Identification_Number', 'IP_Address_v4', 'IP_Address_v6', 'US_Driving_License_Number', 'Swift_Code'] | None |
32+
| confidence_level_threshold | | number | | 0.68 |
33+
| info_types | | list[string] | List of infotypes to be predicted. By default, all supported infotypes are considered. If specified. this should be subset of `['Email_Address', 'Gender', 'Credit_Debit_Card_Number', 'Phone_Number', 'Street_Address', 'Full_Name', 'Age', 'IBAN', 'US_Social_Security_Number', 'Vehicle_Identification_Number', 'IP_Address_v4', 'IP_Address_v6', 'US_Driving_License_Number', 'Swift_Code']` | None |
3334
| info_types_config | Configuration details for infotypes | Dict[str, InfoTypeConfig] | | See [reference_input.py](https://github.com/acryldata/datahub-classify/blob/main/datahub-classify/src/datahub_classify/reference_input.py) for default configuration. |
3435
| info_types_config.`key`.prediction_factors_and_weights | ❓ (required if info_types_config.`key` is set) | Dict[str,number] | Factors and their weights to consider when predicting info types | |
3536
| info_types_config.`key`.name | | NameFactorConfig (see below for fields) | | |

metadata-ingestion/src/datahub/ingestion/glossary/classifier.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,11 @@ class ClassificationConfig(ConfigModel):
3131
default=False,
3232
description="Whether classification should be used to auto-detect glossary terms",
3333
)
34+
35+
sample_size: int = Field(
36+
default=100, description="Number of sample values used for classification."
37+
)
38+
3439
table_pattern: AllowDenyPattern = Field(
3540
default=AllowDenyPattern.allow_all(),
3641
description="Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in `database.schema.table` format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'",

metadata-ingestion/src/datahub/ingestion/glossary/datahub_classifier.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ class Config:
7171
# TODO: Generate Classification doc (classification.md) from python source.
7272
class DataHubClassifierConfig(ConfigModel):
7373
confidence_level_threshold: float = Field(
74-
default=0.6,
74+
default=0.68,
7575
init=False,
7676
description="The confidence threshold above which the prediction is considered as a proposal",
7777
)

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_v2.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1436,17 +1436,20 @@ def inspect_session_metadata(self) -> None:
14361436

14371437
# Ideally we do not want null values in sample data for a column.
14381438
# However that would require separate query per column and
1439-
# that would be expensive, hence not done.
1439+
# that would be expensive, hence not done. To compensale for possibility
1440+
# of some null values in collected sample, we fetch extra (20% more)
1441+
# rows than configured sample_size.
14401442
def get_sample_values_for_table(self, table_name, schema_name, db_name):
14411443
# Create a cursor object.
14421444
logger.debug(
14431445
f"Collecting sample values for table {db_name}.{schema_name}.{table_name}"
14441446
)
1447+
1448+
actual_sample_size = self.config.classification.sample_size * 1.2
14451449
with PerfTimer() as timer:
14461450
cur = self.get_connection().cursor()
1447-
NUM_SAMPLED_ROWS = 1000
14481451
# Execute a statement that will generate a result set.
1449-
sql = f'select * from "{db_name}"."{schema_name}"."{table_name}" sample ({NUM_SAMPLED_ROWS} rows);'
1452+
sql = f'select * from "{db_name}"."{schema_name}"."{table_name}" sample ({actual_sample_size} rows);'
14501453

14511454
cur.execute(sql)
14521455
# Fetch the result set from the cursor and deliver it as the Pandas DataFrame.

0 commit comments

Comments
 (0)