-
-
Notifications
You must be signed in to change notification settings - Fork 149
Commit 4ee7fd3
committed
feat: add stats for each field
read the record batches from arrow files in staging directory
run datafusion queries to fetch count, distinct count
and count for each distinct values for all fields in the dataset
store in <dataset>_pmeta dataset
UI to call below SQL query to fetch the stats from this dataset-
```
SELECT
field_name,
field_count
distinct_count,
distinct_value,
distinct_value_count
FROM (
SELECT
field_stats_field_name as field_name,
field_stats_distinct_stats_distinct_value as distinct_value,
SUM(field_stats_count) as field_count, field_stats_distinct_count as distinct_count,
SUM(field_stats_distinct_stats_count) as distinct_value_count,
ROW_NUMBER() OVER (
PARTITION BY field_stats_field_name
ORDER BY SUM(field_stats_count) DESC
) as rn
FROM <dataset>_pmeta
WHERE field_stats_field_name = 'status_code'
AND field_stats_distinct_stats_distinct_value IS NOT NULL
GROUP BY field_stats_field_name, field_stats_distinct_stats_distinct_value, field_stats_distinct_count
) ranked
WHERE rn <= 5
ORDER BY field_name, distinct_value_count DESC;
```1 parent cfd1348 commit 4ee7fd3Copy full SHA for 4ee7fd3
File tree
Expand file treeCollapse file tree
1 file changed
+223
-25
lines changedFilter options
- src/parseable
Expand file treeCollapse file tree
1 file changed
+223
-25
lines changed
0 commit comments