Partition by row id #32886
-
|
Hi, I retrieve parquet files from a SQL database. Typically, I partition the data by month using TimePartitionDefinition and something like : @asset(partition_def = MonthlyPartitionsDefinition(..))
def example(context, conn):
return conn.sql("SELECT x FROM table WHERE date = {context.partition_key}")However, in some cases, the date column is not indexed, and retrieving the data takes a very long time. We want to partition the data based on the 'id' column instead, meaning we retrieve the data in batches of 1000 rows for instance . I would like something like : @asset(partition_def = ??? )
def example(context, conn ) :
return conn.sql("SELECT x FROM table WHERE id BETWEEN {partition_key} AND {partition_key + batch_size}")Should I use static or dynamic partitioning? The issue is that I don’t know the total number of partitions in advance. The total number of row and then partition will change in time. I would need to run SELECT MAX(id) to determine this. What do you suggest ? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
Sorry, I completely misunderstood you. Yes, use dynamic partitions, populate them from another asset or job. Refresh them with a sensor or smth. Associate partition keys with ID ranges, like int(key) + batch_size |
Beta Was this translation helpful? Give feedback.
Sorry, I completely misunderstood you.
Yes, use dynamic partitions, populate them from another asset or job. Refresh them with a sensor or smth.
Associate partition keys with ID ranges, like int(key) + batch_size