Skip to content

Flink tuning #14

@thodson-usgs

Description

@thodson-usgs

I think we need to tune Flink before this forge can really cook. Opening this issue to start the discussion.

Here are some initial ideas:

  1. Anytime I scale up I start hitting ImagePullBackOff errors. To avoid this, we could clone our own Flink image and have the workers pull from it. I need to investigate the extent this can be setup in Terraform. Helm seems a natural place to start: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/helm/
  2. I believe those ImagePullBackOffs are also causing jobs to fail with org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy . By default, I don't think Flink will handle any failures. Once we configure this, I think execution will be much more reliable.
  3. As the flink configuration becomes more complicated, it might make sense to keep a separate flink-config.yaml that's read in like yamlencode(file(fink-config.yaml)). Some of this config may be tuned per runner, which is a reason for separating it from the static Terraform.
  4. I'm by no means an expert, but before diving deeper into Flink, we might revisit whether it's the right technology or whether Spark is a better choice for batch work. @ranchodeluxe has put in a lot getting Flink moving but also has expressed some frustration. I'm still holding out hope for Flink, but I think we'll have a better sense after some tuning. (Apparently Spark has been investigated here Implement runner for apache spark  pangeo-forge-runner#133)
  5. As discussed below, we can currently enable task restarts by configuring flink via a recipe's config.py. But before we can enable job manager restarts (called High Availability in Flink), we'll need to add a shared filesystem where Flink can store job metadata.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions