You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jul 10, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: rfcs/20190829-tfx-container-component-execution.md
+35-16Lines changed: 35 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -528,24 +528,43 @@ The container launcher streams the log from user’s docker container or Kuberne
528
528
API. It will start a thread which constantly pulls new logs and outputs them to
529
529
local stdout.
530
530
531
-
### Cancellation
532
-
533
-
How the container launcher handles cancellation request varies by orchestrators:
534
-
535
-
* Airflow natively supports cancellation propagation to operator. We will need
536
-
to pass the cancellation request from operator into executor.
537
-
* Argo doesn’t natively support cancellation propagation. Currently KFP relies
538
-
on Argo’s pod annotation to workaround the limitation and has been proven
539
-
to work. We will use the same process to propagate cancellation requests to
540
-
user’s container.
541
-
542
-
In order to allow the user to specify the cancellation command line entrypoint, the Kubernetes
543
-
pod launcher will support an optional parameter called `cancellation_command`
544
-
from`ExecutorContainerSpec`.
545
-
546
-
## Open discussions
531
+
## Discussions
547
532
548
533
* In the Argo runner, each step requires 2 pods with total 3 containers (launcher
549
534
main container + launcher argo wait container + user main container) to run.
550
535
Although each launcher container requires minimal Kubernetes resources,
551
536
resource usage is still a concern.
537
+
538
+
* With an additional pod, it gives launcher more control over execution andreduce the discrepancy between different orchestrators. We decided to go with the platform launcher approach and the additional container resource can be ignored.
539
+
540
+
* For executor container spec, will environment variables be supported?
541
+
542
+
* It’s not added right now to the container spec. Most things could be passed down using command line and arguments. So there is a workaround right now. Also, environment variables can be platform specific. Kubernetes for example, has certain conventions that don’t apply in other platforms. Hence, this could be part of platform_config instead of spec.
543
+
544
+
* Step 3 indicates we first create a DAG, then use node identity to apply platform configuration. Another possibility is to do it directly during DAG construction. For example, if user goes back and changes the DAG, the platform configuration may stop mapping well, and will need to be revisited by the user. Did we consider the second option?
545
+
546
+
* We don’t want users to add platform configuration at the pipeline level since it won’t be portable. We want the same pipeline to be compatible with local run and say running on k8s. Right now, we’d like to keep the pipeline specification itself clean from platform-specific abstractions.
547
+
548
+
* The current proposal uses the component name as the key for binding. Different instantiations may have late binding for their names. For example, if I have 3 ExampleGen, should we be binding to the name instead of the type?
549
+
550
+
* The names need to be unique else compilation fails. The instance names are actually component id, which is enforced to be unique at compile time.
551
+
552
+
* How is caching affected by container tags? We should be careful with using container tags, since these are mutable. We should be relying on digest instead. If we cannot safely get digest, we should disable caching so we don’t fail due to the inability to obtain the digest at runtime. E.g. ‘latest’ and ‘nightly’ tags are not good candidates
553
+
554
+
* By default, if we provide a tag name, the image will be cached in the cluster. We should log an error if caching requirement cannot be met at runtime. Note that currently, executor code changes don’t affect caching behaviour. We should change the above and ensure caching takes the above into account as well.
555
+
556
+
* Will we ensure we have adequate test coverage?
557
+
558
+
* We will add e2e tests for both Docker container and Kubernetes container launchers. The former using Beam as orchestrator, and the latter using Kubeflow orchestrator.
559
+
560
+
561
+
* What are the major user-facing differences between using TFXDSLwith these extensions compared withKFP’s SDK today?
562
+
563
+
* here is a difference in how users will specify platform-specific configuration in the pipeline. In KFP’s SDK, the user specifies this in-place when writing the logical pipeline. In TFXDSL, the need to ensure the logical pipeline is portable necessarily means the platform configuration needs to be specified outside the logical pipeline, which may be slightly more cumbersome than the KFP experience today. Note that the separation of configuration sometimes appears inKFPSDK too, when users want to apply global settings.
564
+
565
+
* We don’t support or provide mechanisms for users to control container lifetime, e.g. container cancellation.
566
+
567
+
* A lot of cancellation operations are best effort anyway. Failures on cancel operations are hard to handle. Users need to understand from the document that we are not aiming to enable such operations.
568
+
* If user starts a long-running job from a container, and the pipeline is canceled, users may want the container to receive this message and cancel gracefully.
569
+
* Can we guarantee that workflows will not stop until we get confirmation of cancellation of long-running operations?
570
+
* This seems difficult, and best effort may be enough, given that this isall Kubernetes itself does today.
0 commit comments