Skip to content

Comments

feat: task logs for database operations#36

Merged
jason-lynch merged 12 commits intomainfrom
feat/PLAT-33/workflow-task-logs-split
Jun 2, 2025
Merged

feat: task logs for database operations#36
jason-lynch merged 12 commits intomainfrom
feat/PLAT-33/workflow-task-logs-split

Conversation

@jason-lynch
Copy link
Member

Summary

With this PR, each database operation has an associated task that can be tracked via the API. I've modified each operation endpoint to return a task alongside their normal response.

I've divided it up into more granular commits that can (hopefully!) be reviewed individually.

Details

In order to make this behavior consistent for deletes, the delete task is not deleted when a database is deleted, and it can still be fetched via API after the database is deleted. This way, API clients wont' get a 404 if they're slow to request the delete task, and they can track the delete process through to completion. The delete task will, however, get deleted if you recreate a new database with the same ID.

The backup and restore operations already had tasks associated with them, so I've introduced the concept of "parent tasks". Those existing tasks now have a parent_id that's set to the main operation's task ID.

I've also added optional timestamps to the task log API. You can enable them with a new include_timestamps parameter.

Since I was already making so many API changes (including adding a response type for the backup operation) I renamed initiate-database-backup to backup-database-node to be consistent with the other endpoints. This leaves room for a potential backup-database endpoint that backs up all nodes.

Remember to run restish api sync for each control plane server to get the latest API changes in your CLI.

PLAT-33

@jason-lynch jason-lynch requested review from mmols and tsivaprasad May 27, 2025 14:30
Base automatically changed from feat/PLAT-55/release-process to main May 28, 2025 14:51
Copy link
Member

@mmols mmols left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is working well for me - just a few questions.

return nil, fmt.Errorf("failed to list tasks: %w", err)
}
for _, t := range tasks {
// We want to leave the delete task in place so that API clients can
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if our default behavior should be to leave these tasks in the system, and handle task cleanup through a separate function / story where a retention is configurable.

If the user deletes a database, the current situation is that there is little context / logs other than the main Control Plane log.

Is there a technical reason to not keep older tasks around?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this because it felt like a strange user experience to see all of the old tasks if you recreate a database with the same identifier - especially since there's no other mechanism to remove them. That said, I might be biased because I reuse identifiers in order to make development easier. I don't know if that will be common for end-users.

If the user deletes a database, the current situation is that there is little context / logs other than the main Control Plane log.

The way this works is that the delete task is left in place, but all other tasks are removed. You're still able to list and fetch tasks for the database, but you'll only see the delete task.

If you then recreate the database with the same ID, the delete task is removed before the database is created.

Is there a technical reason to not keep older tasks around?

The only technical reason to clean these up is that they consume Etcd capacity, which is limited. But, these don't take up much space relative to the workflow records, so I mainly did this for user experience.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought more about this, and I've changed it so that tasks are not deleted when the database is deleted. They are still deleted on create, so it will clear out old tasks if you reuse a database identifier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a good compromise - I tested this and it meshes with what I'd expect. I think re-using the identifier won't be common for most users (compared to omitting it and having it generated).

return a.TaskSvc.AddLogLine(ctx, databaseID, taskID, msg)
}

msg := fmt.Sprintf("%s %s on host %s", verb, resource.Identifier(), a.Config.HostID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can retrieve the resource ID once and reuse it in multiple places within the function, instead of calling resource.Identifier() repeatedly. What do you think?"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'm refactoring this a little bit for the structured logging change, so I'll make sure to include this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this as part of the change to structured logging in ea72b00.

func (a *Activities) LogTaskEvent(ctx context.Context, input *LogTaskEventInput) (*LogTaskEventOutput, error) {
logger := activity.Logger(ctx).With("database_id", input.DatabaseID.String())
logger.Info("updating database state")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add an early return to skip processing if len(input.Messages) == 0 to avoid unnecessary work. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! That makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this check to the logTaskEvent function that invokes this activity in 1950037.

Also renames `initiate-database-backup` to `backup-database-node` to be
consistent with other operations and to leave room for a potential
`backup-database` operation.

PLAT-33
Modifies database operation workflows to take a task ID and update the
task status during processing.

PLAT-33
Adds some already-existing fields to the task APIs and an
`include_timestamp` option to the task log API.

PLAT-33
Our task logs have a separate `timestamp` field, so disabling these
timestamps will make the task logs easier to consume consistently.

PLAT-33
@jason-lynch jason-lynch force-pushed the feat/PLAT-33/workflow-task-logs-split branch from 92172f9 to bba8a1b Compare May 31, 2025 16:05
@jason-lynch jason-lynch requested review from mmols and tsivaprasad June 2, 2025 13:03
@jason-lynch jason-lynch dismissed tsivaprasad’s stale review June 2, 2025 20:10

Fixed requested changes in ea72b00 and 1950037

@jason-lynch jason-lynch merged commit 37399cc into main Jun 2, 2025
2 checks passed
@jason-lynch jason-lynch deleted the feat/PLAT-33/workflow-task-logs-split branch June 2, 2025 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants