-
Notifications
You must be signed in to change notification settings - Fork 50
Description
Environment Details
- SDMetrics version: 0.14.1
Error Description
In the Quality Report, the Column Pair Trends and Intertable Trends properties both use the ContingencySimilarity metric to compute a score.
This underlying metric's performance may not be optimized when a column has extremely high cardinality. If you are computing between two columns A and B, then this metric computes the cross-tabulation of the two columns based on cardinality. Eg: If Column A is categorical with cardinality of a, and column B is also categorical with cardinality of b, then the Contingency Table will contain a x b values. This may end up being slow if a or b is really large.
Additional Context
We are not interested in replacing ContingencySimilarity with another metric. Rather, we should optimize its performance. Some ideas include:
- looking at the base operations for cross tabulation and figuring out if there are any faster ones
- taking a random subset
- considering the top n most frequently occurring categories for the cross tabulation (where "top n" is calculated based on only the real data and the same exact set of n categories is used for the synthetic data)
- etc.
Any solution will have to be vetted to ensure that the overall quality score being returned does not differ too much from the status quo.