Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

Commit c8b0981

Browse files
authored
Update the discussions
Add discussions from the review meeting and remove cancellation section which is out of the scope of the initial RFC.
1 parent 661fad6 commit c8b0981

File tree

1 file changed

+35
-16
lines changed

1 file changed

+35
-16
lines changed

rfcs/20190829-tfx-container-component-execution.md

Lines changed: 35 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -528,24 +528,43 @@ The container launcher streams the log from user’s docker container or Kuberne
528528
API. It will start a thread which constantly pulls new logs and outputs them to
529529
local stdout.
530530

531-
### Cancellation
532-
533-
How the container launcher handles cancellation request varies by orchestrators:
534-
535-
* Airflow natively supports cancellation propagation to operator. We will need
536-
to pass the cancellation request from operator into executor.
537-
* Argo doesn’t natively support cancellation propagation. Currently KFP relies
538-
on Argo’s pod annotation to workaround the limitation and has been proven
539-
to work. We will use the same process to propagate cancellation requests to
540-
user’s container.
541-
542-
In order to allow the user to specify the cancellation command line entrypoint, the Kubernetes
543-
pod launcher will support an optional parameter called `cancellation_command`
544-
from `ExecutorContainerSpec`.
545-
546-
## Open discussions
531+
## Discussions
547532

548533
* In the Argo runner, each step requires 2 pods with total 3 containers (launcher
549534
main container + launcher argo wait container + user main container) to run.
550535
Although each launcher container requires minimal Kubernetes resources,
551536
resource usage is still a concern.
537+
538+
* With an additional pod, it gives launcher more control over execution and reduce the discrepancy between different orchestrators. We decided to go with the platform launcher approach and the additional container resource can be ignored.
539+
540+
* For executor container spec, will environment variables be supported?
541+
542+
* It’s not added right now to the container spec. Most things could be passed down using command line and arguments. So there is a workaround right now. Also, environment variables can be platform specific. Kubernetes for example, has certain conventions that don’t apply in other platforms. Hence, this could be part of platform_config instead of spec.
543+
544+
* Step 3 indicates we first create a DAG, then use node identity to apply platform configuration. Another possibility is to do it directly during DAG construction. For example, if user goes back and changes the DAG, the platform configuration may stop mapping well, and will need to be revisited by the user. Did we consider the second option?
545+
546+
* We don’t want users to add platform configuration at the pipeline level since it won’t be portable. We want the same pipeline to be compatible with local run and say running on k8s. Right now, we’d like to keep the pipeline specification itself clean from platform-specific abstractions.
547+
548+
* The current proposal uses the component name as the key for binding. Different instantiations may have late binding for their names. For example, if I have 3 ExampleGen, should we be binding to the name instead of the type?
549+
550+
* The names need to be unique else compilation fails. The instance names are actually component id, which is enforced to be unique at compile time.
551+
552+
* How is caching affected by container tags? We should be careful with using container tags, since these are mutable. We should be relying on digest instead. If we cannot safely get digest, we should disable caching so we don’t fail due to the inability to obtain the digest at runtime. E.g. ‘latest’ and ‘nightly’ tags are not good candidates
553+
554+
* By default, if we provide a tag name, the image will be cached in the cluster. We should log an error if caching requirement cannot be met at runtime. Note that currently, executor code changes don’t affect caching behaviour. We should change the above and ensure caching takes the above into account as well.
555+
556+
* Will we ensure we have adequate test coverage?
557+
558+
* We will add e2e tests for both Docker container and Kubernetes container launchers. The former using Beam as orchestrator, and the latter using Kubeflow orchestrator.
559+
560+
561+
* What are the major user-facing differences between using TFX DSL with these extensions compared with KFP’s SDK today?
562+
563+
* here is a difference in how users will specify platform-specific configuration in the pipeline. In KFP’s SDK, the user specifies this in-place when writing the logical pipeline. In TFX DSL, the need to ensure the logical pipeline is portable necessarily means the platform configuration needs to be specified outside the logical pipeline, which may be slightly more cumbersome than the KFP experience today. Note that the separation of configuration sometimes appears in KFP SDK too, when users want to apply global settings.
564+
565+
* We don’t support or provide mechanisms for users to control container lifetime, e.g. container cancellation.
566+
567+
* A lot of cancellation operations are best effort anyway. Failures on cancel operations are hard to handle. Users need to understand from the document that we are not aiming to enable such operations.
568+
* If user starts a long-running job from a container, and the pipeline is canceled, users may want the container to receive this message and cancel gracefully.
569+
* Can we guarantee that workflows will not stop until we get confirmation of cancellation of long-running operations?
570+
* This seems difficult, and best effort may be enough, given that this is all Kubernetes itself does today.

0 commit comments

Comments
 (0)