Using Hop to transform data and write to another database #6308

HaruspexSan · 2026-01-06T19:21:07Z

HaruspexSan
Jan 6, 2026

Hello Apache-Hop Community,

Disclaimer: I am new to Apache Hop and using it for my Master's thesis. I appreciate any guidance!

The Goal: I am building a pipeline to populate a BI fact table from a production Postgres database.

Source: A Users table containing raw transactional data (Granularity: 1 row per user).
Fields: ID, created_at, deleted_at, status etc.

Target: A Daily_Stats BI table (Granularity: 1 row per day).
Desired Fields: Date, NewUsersThatDay, TotalUserCount (Cumulative).

The Problem: I am struggling to transform the row-level user data into the aggregated daily format within a single pipeline.
Especially with the issue of different numbers of input fields per stream part.

Furthermore, users with the deleted_at field not being null be deleted after 30 days. Which means I need to add the functionality that after the status changes from active to deleted, i need to reduce the running count by 1 as the user would otherwise be counted in the "rolling count" eventhough it was deleted.

So my questions are:
Is all of this possible in one big pipeline? Or do I have to split it up into several?
How do i use scripting or SQL or any other transform to calculate the rolling count, especially because i need the last rolling count as input from the database and table i will be writing my output to?
How do i calculate the NewUsersThatDay field as i have to aggregate not within a row but with the rows from that day?

Thank you so much in advance

bamaer · 2026-01-06T19:33:01Z

bamaer
Jan 6, 2026
Collaborator

@harrisward created a sample repository that loads data from the MySQL Sakila database to a small data warehouse, you'll find some great examples there.

To aggregate: Group By transform. This requires sorted data. For small data sets that fit entirely in memory, you can use the Memory Group By transform.

For the deleted users: either exclude those users from your input query or with a Filter Rows transform.

Your questions:

A best practice is to split your work in smaller manageable pipelines instead of fewer large/huge pipelines.
SQL: Table Input should get you a (very) long way
Scripting: Script, Javascript transforms. Scripting is very much possible, but try to use it as a last resort
NewUserThatDay: something like Table/File Input -> (optional) Filter to exclude deleted -> Group By ("that day" would be your key, number of rows as type, user_id as subject).

0 replies

lionants02 · 2026-01-24T16:20:31Z

lionants02
Jan 24, 2026

For the daily summary data, I use "Select values" in the "Meta-data" tab to convert the time field into a "String" in the desired counting format, such as "yyyy-MM-dd" in this case. Then I sort this field from smallest to largest using "Sort rows" and perform a "Group By sum" using the fields created from "yyyy-MM-dd".

This method allows you to adjust the resolution from daily to hourly by adjusting the "yyyy-MM-dd HH" format to your desired level.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Hop to transform data and write to another database #6308

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Using Hop to transform data and write to another database #6308

Uh oh!

HaruspexSan Jan 6, 2026

Replies: 2 comments

Uh oh!

bamaer Jan 6, 2026 Collaborator

Uh oh!

lionants02 Jan 24, 2026

HaruspexSan
Jan 6, 2026

bamaer
Jan 6, 2026
Collaborator

lionants02
Jan 24, 2026