-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
Hello,
This is a feature I haven't seen in any data prepation/etl. The core feature is to detect the unique key in a dataframe. More than often, you have to deal with a dataset without knowing what's make a row unique. This can lead to misinterpret the data, cartesian product at join and other funny stuff.
Feature Description
How do I imagine that ?
Entry parameters; one dataframe, ability to specify a max number of field for combination (empty or 0=no max).
Algo : it tests the count distinct every combination of field versus the count of rows
Result : a dataframe with one row by field combination that works. If no result : "no field combination is unique. check for duplicate or need for aggregation upstream".
ex :
<style> </style>order_id | line_id | amount | customer | site |
---|---|---|---|---|
1 | 1 | 100 | A | U_250 |
1 | 2 | 12 | A | U_250 |
1 | 3 | 45 | A | U_250 |
2 | 1 | 75 | A | U_250 |
2 | 2 | 12 | A | U_250 |
3 | 1 | 15 | B | U_250 |
4 | 1 | 45 | B | U_251 |
The user will previously select every field but excluding Amount (he knows that Amount would have no sense in key)
The algo will test the following key
-each separate field
-each combination of two fields
-each combination of three fields
-each combination of four fields
to match the number of row (7)
And gives something like that
choice | number of fields | field combination |
---|---|---|
very good | 2 | order_id,line_id |
average | 3 | order_id,line_id, customer |
average | 3 | order_id,line_id, site |
bad | 4 | order_id,line_id, site, customer |
… | … | …. |
Best regards,
Simon
Alternative Solutions
N/A
Additional Context
No response