Using Hop to transform data and write to another database #6308
Replies: 2 comments
-
|
@harrisward created a sample repository that loads data from the MySQL Sakila database to a small data warehouse, you'll find some great examples there. To aggregate: Group By transform. This requires sorted data. For small data sets that fit entirely in memory, you can use the Memory Group By transform. For the deleted users: either exclude those users from your input query or with a Filter Rows transform. Your questions:
|
Beta Was this translation helpful? Give feedback.
-
|
For the daily summary data, I use "Select values" in the "Meta-data" tab to convert the time field into a "String" in the desired counting format, such as "yyyy-MM-dd" in this case. Then I sort this field from smallest to largest using "Sort rows" and perform a "Group By sum" using the fields created from "yyyy-MM-dd". This method allows you to adjust the resolution from daily to hourly by adjusting the "yyyy-MM-dd HH" format to your desired level. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello Apache-Hop Community,
Disclaimer: I am new to Apache Hop and using it for my Master's thesis. I appreciate any guidance!
The Goal: I am building a pipeline to populate a BI fact table from a production Postgres database.
Source: A Users table containing raw transactional data (Granularity: 1 row per user).
Fields: ID, created_at, deleted_at, status etc.
Target: A Daily_Stats BI table (Granularity: 1 row per day).
Desired Fields: Date, NewUsersThatDay, TotalUserCount (Cumulative).
The Problem: I am struggling to transform the row-level user data into the aggregated daily format within a single pipeline.
Especially with the issue of different numbers of input fields per stream part.
Furthermore, users with the deleted_at field not being null be deleted after 30 days. Which means I need to add the functionality that after the status changes from active to deleted, i need to reduce the running count by 1 as the user would otherwise be counted in the "rolling count" eventhough it was deleted.
So my questions are:
Is all of this possible in one big pipeline? Or do I have to split it up into several?
How do i use scripting or SQL or any other transform to calculate the rolling count, especially because i need the last rolling count as input from the database and table i will be writing my output to?
How do i calculate the NewUsersThatDay field as i have to aggregate not within a row but with the rows from that day?
Thank you so much in advance
Beta Was this translation helpful? Give feedback.
All reactions