Flink tuning

I think we need to tune Flink before this forge can really cook. Opening this issue to start the discussion.

Here are some initial ideas:
1. Anytime I scale up I start hitting ImagePullBackOff errors. To avoid this, we could clone our own Flink image and have the workers pull from it. I need to investigate the extent this can be setup in Terraform. Helm seems a natural place to start: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/helm/
1. I believe those ImagePullBackOffs are also causing jobs to fail with `org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy` . By default, I don't think Flink will handle any failures. Once we configure this, I think execution will be much more reliable. 
1. As the flink configuration becomes more complicated, it might make sense to keep a separate `flink-config.yaml` that's read in like `yamlencode(file(fink-config.yaml))`. Some of this config may be tuned per runner, which is a reason for separating it from the static Terraform.
1. I'm by no means an expert, but before diving deeper into Flink, we might revisit whether it's the right technology or whether Spark is a better choice for batch work. @ranchodeluxe has put in a lot getting Flink moving but also has expressed some frustration. I'm still holding out hope for Flink, but I think we'll have a better sense after some tuning. (Apparently Spark has been investigated here pangeo-forge/pangeo-forge-runner#133)
1. As discussed below, we can currently enable task restarts by configuring flink via a recipe's `config.py`. But before we can enable job manager restarts (called High Availability in Flink), we'll need to add a shared filesystem where Flink can store job metadata.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink tuning #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Flink tuning #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions