Skip to content

Conversation

@mxm
Copy link
Contributor

@mxm mxm commented Feb 4, 2025

This adds state size, i.e. the size of the last completed checkpoint, to the
deployment status. It also exposes the state size as a deployment metric.

@mxm mxm requested a review from gyfora February 4, 2025 16:04
@mxm mxm force-pushed the FLINK-37253 branch 5 times, most recently from f0ecf40 to dd22b3d Compare February 5, 2025 12:04
return tmTotalMemory + jmTotalMemory;
}

public static Long calculateClusterStateSize(Configuration conf, int taskManagerReplicas) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be called totalClusterMemorySize instead of state?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I don't think this is used anywhere... :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State size or checkpoint size isn't directly related to the cluster memory size. For the heap memory backend, we would expect the state size to be lower than the overall memory. For RocksDB, it could even exceed the cluster memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok but I still don't get 3 things:

  • Where is this used?
  • Why do we need this bad approximation if state size metrics are available from Flink?
  • This is basically just total memory, why do we call it state size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see your comment in #941 (comment), it was somehow hidden when I replied.

Sorry, this code was unused code. I have removed it.

mxm added 2 commits February 5, 2025 16:07
…rics

This adds state size, i.e. the size of the last completed checkpoint, to the
deployment status. It also exposes the state size as a deployment metric.
@gyfora
Copy link
Contributor

gyfora commented Feb 5, 2025

Have you tested this in a (local) kubernetes env with different Flink versions? Does it work as expected?

Copy link
Contributor

@gyfora gyfora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the manually tested for correctness for the supported Flink versions and e2es pass then good to go

@mxm
Copy link
Contributor Author

mxm commented Feb 5, 2025

Tried it out on a local k8s cluster with various Flink versions:

image image image image

@mxm mxm merged commit b7d6f9d into apache:main Feb 7, 2025
115 of 118 checks passed
@mxm mxm deleted the FLINK-37253 branch February 7, 2025 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants