@@ -12,7 +12,7 @@ This RFC proposes an orchestrator agnostic way to reliably execute a user’s
1212container in the TFX pipeline. The proposal can support:
1313
1414* Running an arbitrary container in either a local Docker environment or a remote
15- k8s cluster.
15+ Kubernetes cluster.
1616* Passing data into the container
1717* Passing output data from the container
1818* Capturing logs from the container
@@ -21,9 +21,9 @@ container in the TFX pipeline. The proposal can support:
2121
2222## Motivation
2323
24- Currently, in a TFX pipeline, there is no way to execute a generic container as
25- one of its steps . Without this feature, users cannot bring their own containers
26- into the pipeline. This blocks following use cases:
24+ Currently, the execution of a generic container as a step in a TFX pipeline is
25+ not supported . Without this feature, users cannot bring their own containers
26+ into the pipeline. This blocks the following use cases:
2727
2828* User already has a docker image and wants to run the image as one of the
2929 steps in a TFX pipeline.
@@ -42,7 +42,7 @@ The execution may occurs in local Docker container or in a remote Kubernetes clu
4242
4343Today, KFP’s ContainerOp leverages
4444[ Argo container template API] ( https://github.com/argoproj/argo/blob/master/pkg/apis/workflow/v1alpha1/workflow_types.go )
45- to launch user’s container in a k8s pod. Argo, as the orchestrator, controls when
45+ to launch user’s container in a Kubernetes pod. Argo, as the orchestrator, controls when
4646to launch the POD and it uses a sidecar container to report output files back
4747and wait for user’s container to complete. We are not proposing to use Argo API
4848because of the following reasons:
@@ -55,9 +55,9 @@ because of the following reasons:
5555* Argo doesn’t provide an easy way to recover from user’s transient errors,
5656 which is critical in production workload.
5757
58- #### Airflow k8s pod operator
58+ #### Airflow Kubernetes pod operator
5959
60- Airflow supports launching a k8s pod by an
60+ Airflow supports launching a Kubernetes pod by an
6161[ operator] ( https://github.com/apache/airflow/blob/master/airflow/contrib/operators/kubernetes_pod_operator.py ) .
6262This approach is closer to what we are proposing in the document. However, we
6363cannot directly use the operator because:
@@ -76,12 +76,11 @@ cannot directly use the operator because:
7676
7777### TLDR
7878
79- We propose to solve the above problems by the following design.
79+ We propose to solve the above problems with the following design:
8080
81- * Define container as an executor spec.
82- * Launch container by component launcher in either local docker or k8s pod.
83- * Use platform config to specify platform specific settings like k8s pod
84- config.
81+ * Define a container as an executor spec.
82+ * Launch a container via a component launcher in either a local docker or Kubernetes pod.
83+ * Use a platform config to specify a platform-specific settings config.
8584
8685The proposed solution has the following parts:
8786
@@ -92,9 +91,9 @@ The proposed solution has the following parts:
9291 * ` DockerComponentLauncher ` which launches ` ExecutorContainerSpec ` in
9392 a Docker environment.
9493 * ` KubernetesPodComponentLauncher ` which launches ` ExecutorContainerSpec `
95- in a k8s environment.
94+ in a Kubernetes environment.
9695* Extensible ` PlatformConfig ` framework.
97- * ` KubernetesPodPlatformConfig ` to support k8s pod spec as a config.
96+ * ` KubernetesPodPlatformConfig ` to support Kubernetes pod spec as a config.
9897 * ` DockerPlatformConfig ` to support docker run configs.
9998
10099### Architecture
@@ -105,7 +104,7 @@ Architecture that allows local container execution.
105104
106105Architecture that allows Kubernetes container execution.
107106
108- ![ TFX k8s container execution] ( 20190829-tfx-container-component-execution/tfx-k8s -container-execution.png )
107+ ![ TFX Kubernetes container execution] ( 20190829-tfx-container-component-execution/tfx-Kubernetes -container-execution.png )
109108
110109Class diagram that allows container execution
111110
@@ -114,8 +113,7 @@ Class diagram that allows container execution
114113### Python DSL experience
115114
116115In order to use container base component in TFX DSL, user needs follow these
117- steps. Step 1 and Step 2 follow the DSL extension proposed by the other RFC
118- (https://github.com/tensorflow/community/pull/146 ).
116+ steps. Step 1 and Step 2 follow the DSL extension proposed by [ TFX Generic Container-based Component] ( https://github.com/tensorflow/community/pull/146 ) .
119117
120118#### Step 1: Define the container based component by ` ExecutorContainerSpec `
121119
@@ -169,7 +167,7 @@ _ = BeamRunner(platform_configs={
169167}).run(create_pipeline())
170168```
171169
172- #### Step 3(b): Set k8s platform config via runner’s config
170+ #### Step 3(b): Set Kubernetes platform config via runner’s config
173171
174172``` python
175173_ = KubeflowDagRunner(platform_configs = {
@@ -199,7 +197,7 @@ different target platforms. For example:
199197 process.
200198* `DockerComponentLauncher` can launch a container executor in a Docker
201199 environment.
202- * `KubernetesPodComponentLauncher` can launch a container executor in a k8s
200+ * `KubernetesPodComponentLauncher` can launch a container executor in a Kubernetes
203201 environment.
204202* A Dataflow launcher can launch a beam executor in Dataflow service.
205203
@@ -274,7 +272,7 @@ class KubernetesPodComponentLauncher(BaseComponentLauncher):
274272 input_dict: Dict[Text, List[types.Artifact]],
275273 output_dict: Dict[Text, List[types.Artifact]],
276274 exec_properties: Dict[Text, Any]) -> None :
277- # k8s pod launcher implementation
275+ # Kubernetes pod launcher implementation
278276 …
279277```
280278
@@ -467,7 +465,7 @@ definitions:
467465```
468466
469467The output.json file is optional, but if the user’s container writes to the file . It
470- overrides the default handling of the k8s pod launcher. The output fields are:
468+ overrides the default handling of the Kubernetes pod launcher. The output fields are:
471469
472470* error_status: tells the executor whether it should retry or fail
473471* outputs and exec_properties: used to override the execution and
@@ -478,55 +476,55 @@ MLMD from executor.
478476
479477# ## Auth context resolution
480478
481- The k8s pod launcher internally uses the k8s Python client. The auth context resolution
479+ The Kubernetes pod launcher internally uses the Kubernetes Python client. The auth context resolution
482480logic is as follows:
483481
4844821 . If the current env is in a cluster, use `load_incluster_config` to load k8s
485483 context.
486- 1 . If not , use default k8s active context to connect to remote cluster.
484+ 1 . If not , use default Kubernetes active context to connect to remote cluster.
487485
488486# ## Pod launcher resiliency
489487
490488In this design section, we focused more on the launcher resiliency under
491489`KubeflowDAGRunner` . In `AirflowDAGRunner` , the launcher code is running in the
492- same process of Airflow orchestrator which we rely on Airflow to ensure its
493- resiliency. `BeamDAGRunner` , however, is considered mainly for local testing
490+ same process of Airflow orchestrator, and we rely on Airflow to ensure the
491+ resiliency of the process. `BeamDAGRunner` , however, is considered mainly for local testing
494492purpose and we won' t add support for it to be resilient.
495493
496494In `KubeflowDAGRunner` , a pipeline step will create two pods in order to execute
497495user’s container:
498496
499- * A launcher pod which contains the driver, k8s pod launcher, and publisher code.
497+ * A launcher pod which contains the driver, Kubernetes pod launcher, and publisher code.
500498* A user pod with user’s container.
501499
502- A pod in k8s is not resilient by itself. We will use Argo’s retry feature to make
500+ A pod in Kubernetes is not resilient by itself. We will use Argo’s retry feature to make
503501the launcher pod partially resilient. The details are as follows:
504502
505503* Each Argo launcher step will be configured with a default retry count.
506504* Argo will retry the step in case of failure, no matter what type of error.
507505* The launcher container will create a tmp workdir in `pipeline_root` .
508506* It will keep intermediate results (for example, the ID of the created pod) in the tmp workdir.
509- * The k8s pod launcher will be implemented in a way that it will resume the
507+ * The Kubernetes pod launcher will be implemented in a way that it will resume the
510508 operation based on the intermediate results in the tmp workdir.
511509* The launcher will also record a permanent failure data in the tmp workdir so
512510 it won’t resume the operation in case of non- retriable failures.
513511
514512# ## Default retry strategy
515513
516514K8s pod launcher supports exponential backoff retry. This strategy applies to
517- all runners which can support k8s pod launcher. Docker launchers are not in the
515+ all runners which can support Kubernetes pod launcher. Docker launchers are not in the
518516scope of the design as it is mainly for local development use case.
519517
520518The retry only happens if the error is retriable. An error is retriable only
521519when:
522520
523- * It’s a transient error code from k8s pod API .
524- * The output.json file from artifact store indicates it’s a retriable error.
525- * The pod get deleted (For example: GKE preemptible pod feature).
521+ * It’s a transient error code from Kubernetes pod API .
522+ * Or, the output.json file from artifact store indicates it’s a retriable error.
523+ * Or, the pod get deleted (For example: GKE preemptible pod feature).
526524
527525# ## Log streaming
528526
529- The container launcher streams the log from user’s docker container or k8s pod through the
527+ The container launcher streams the log from user’s docker container or Kubernetes pod through the
530528API . It will start a thread which constantly pulls new logs and outputs them to
531529local stdout.
532530
@@ -541,13 +539,13 @@ How the container launcher handles cancellation request varies by orchestrators:
541539 to work. We will use the same process to propagate cancellation requests to
542540 user’s container.
543541
544- In order to allow the user to specify the cancellation command line entrypoint, the k8s
542+ In order to allow the user to specify the cancellation command line entrypoint, the Kubernetes
545543pod launcher will support an optional parameter called `cancellation_command`
546544from `ExecutorContainerSpec` .
547545
548546# # Open discussions
549547
550548* In the Argo runner, each step requires 2 pods with total 3 containers (launcher
551549 main container + launcher argo wait container + user main container) to run.
552- Although each launcher container requires minimal k8s resources,
550+ Although each launcher container requires minimal Kubernetes resources,
553551 resource usage is still a concern.
0 commit comments