-
Notifications
You must be signed in to change notification settings - Fork 6
Flink tuning #14
Copy link
Copy link
Open
Description
I think we need to tune Flink before this forge can really cook. Opening this issue to start the discussion.
Here are some initial ideas:
- Anytime I scale up I start hitting ImagePullBackOff errors. To avoid this, we could clone our own Flink image and have the workers pull from it. I need to investigate the extent this can be setup in Terraform. Helm seems a natural place to start: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/helm/
- I believe those ImagePullBackOffs are also causing jobs to fail with
org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy. By default, I don't think Flink will handle any failures. Once we configure this, I think execution will be much more reliable. - As the flink configuration becomes more complicated, it might make sense to keep a separate
flink-config.yamlthat's read in likeyamlencode(file(fink-config.yaml)). Some of this config may be tuned per runner, which is a reason for separating it from the static Terraform. - I'm by no means an expert, but before diving deeper into Flink, we might revisit whether it's the right technology or whether Spark is a better choice for batch work. @ranchodeluxe has put in a lot getting Flink moving but also has expressed some frustration. I'm still holding out hope for Flink, but I think we'll have a better sense after some tuning. (Apparently Spark has been investigated here Implement runner for apache spark pangeo-forge-runner#133)
- As discussed below, we can currently enable task restarts by configuring flink via a recipe's
config.py. But before we can enable job manager restarts (called High Availability in Flink), we'll need to add a shared filesystem where Flink can store job metadata.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels