Skip to content

Commit 6c8fddf

Browse files
add ecosystem.
1 parent d28c1a3 commit 6c8fddf

14 files changed

+1209
-40
lines changed
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
+++
2+
title = "MXNet on Volcano"
3+
4+
date = 2025-07-20
5+
lastmod = 2025-07-20
6+
7+
draft = false # Is this a draft? true/false
8+
toc = true # Show table of contents? true/false
9+
type = "docs" # Do not modify.
10+
11+
# Add menu entry to sidebar.
12+
linktitle = "MXNet"
13+
[menu.docs]
14+
parent = "zoology"
15+
weight = 3
16+
17+
+++
18+
19+
20+
21+
# MXNet Introduction
22+
23+
MXNet is an open-source deep learning framework designed for efficient and flexible training and deployment of deep neural networks. It supports seamless scaling from a single GPU to multiple GPUs, and further to distributed multi-machine multi-GPU setups.
24+
25+
# MXNet on Volcano
26+
27+
Combining MXNet with Volcano allows you to fully leverage Kubernetes' container orchestration capabilities and Volcano's batch scheduling functionality to achieve efficient distributed training.
28+
29+
Click [here](https://github.com/apache/mxnet/blob/master/example/distributed_training-horovod/gluon_mnist.py) to view the example provided by the MXNet team. This directory contains the following files:
30+
31+
- Dockerfile: Builds the standalone worker image.
32+
- Makefile: Used to build the above image.
33+
- train-mnist-cpu.yaml: Volcano Job specification.
34+
35+
To run the example, edit the image name and version in `train-mnist-cpu.yaml`. Then run:
36+
37+
```
38+
kubectl apply -f train-mnist-cpu.yaml -n ${NAMESPACE}
39+
```
40+
41+
to create the Job.
42+
43+
Then use:
44+
45+
```
46+
kubectl -n ${NAMESPACE} describe job.batch.volcano.sh mxnet-job
47+
```
48+
49+
to view the status.

content/en/docs/argo_on_volcano.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
+++
2+
title = "Argo on Volcano"
3+
4+
date = 2025-07-20
5+
lastmod = 2025-07-20
6+
7+
draft = false # Is this a draft? true/false
8+
toc = true # Show table of contents? true/false
9+
type = "docs" # Do not modify.
10+
11+
# Add menu entry to sidebar.
12+
linktitle = "Argo"
13+
[menu.docs]
14+
parent = "zoology"
15+
weight = 3
16+
17+
+++
18+
19+
20+
21+
### Argo Introduction
22+
23+
Argo is an open-source Kubernetes native workflow engine that allows users to define and execute containerized workflows. The Argo project includes multiple components, with Argo Workflows being the core component used for orchestrating parallel jobs on Kubernetes, supporting DAG (Directed Acyclic Graph) and step templates.
24+
25+
### Argo on Volcano
26+
27+
By integrating Argo Workflow with Volcano, you can combine the advantages of both: Argo provides powerful workflow orchestration capabilities, while Volcano provides advanced scheduling features.
28+
29+
#### Integration Method
30+
31+
Argo resource templates allow for the creation, deletion, or updating of any type of Kubernetes resource (including CRDs). We can use resource templates to integrate Volcano Jobs into Argo Workflow, thereby adding job dependency management and DAG flow control capabilities to Volcano.
32+
33+
#### Configuring RBAC Permissions
34+
35+
Before integration, ensure that Argo Workflow has sufficient permissions to manage Volcano resources:
36+
37+
1. Argo Workflow needs to specify a serviceAccount, which can be specified as follows:
38+
39+
```
40+
argo submit --serviceaccount <name>
41+
```
42+
43+
2. Add Volcano resource management permissions to the serviceAccount:
44+
45+
```yaml
46+
yaml- apiGroups:
47+
- batch.volcano.sh
48+
resources:
49+
- "*"
50+
verbs:
51+
- "*"
52+
```
53+
54+
#### Example
55+
56+
Here is an example YAML for creating a Volcano Job using Argo Workflow:
57+
58+
```yaml
59+
yamlapiVersion: argoproj.io/v1alpha1
60+
kind: Workflow
61+
metadata:
62+
generateName: volcano-job-
63+
spec:
64+
entrypoint: nginx-tmpl
65+
serviceAccountName: argo # Specify service account
66+
templates:
67+
- name: nginx-tmpl
68+
activeDeadlineSeconds: 120 # Limit workflow execution time
69+
resource: # Indicates this is a resource template
70+
action: create # kubectl operation type
71+
successCondition: status.state.phase = Completed
72+
failureCondition: status.state.phase = Failed
73+
manifest: |
74+
apiVersion: batch.volcano.sh/v1alpha1
75+
kind: Job
76+
metadata:
77+
generateName: test-job-
78+
ownerReferences: # Add owner references to ensure resource lifecycle management
79+
- apiVersion: argoproj.io/v1alpha1
80+
blockOwnerDeletion: true
81+
kind: Workflow
82+
name: "{{workflow.name}}"
83+
uid: "{{workflow.uid}}"
84+
spec:
85+
minAvailable: 1
86+
schedulerName: volcano
87+
policies:
88+
- event: PodEvicted
89+
action: RestartJob
90+
plugins:
91+
ssh: []
92+
env: []
93+
svc: []
94+
maxRetry: 5
95+
queue: default
96+
tasks:
97+
- replicas: 2
98+
name: "default-nginx"
99+
template:
100+
metadata:
101+
name: web
102+
spec:
103+
containers:
104+
- image: nginx:latest
105+
imagePullPolicy: IfNotPresent
106+
name: nginx
107+
resources:
108+
requests:
109+
cpu: "100m"
110+
restartPolicy: OnFailure
111+
```
112+
113+
For more information and advanced configurations, please check the [link](https://github.com/volcano-sh/volcano/tree/master/example/integrations/argo) to learn more.
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
+++
2+
title = "Cromwell on Volcano"
3+
4+
date = 2025-07-20
5+
lastmod = 2025-07-20
6+
7+
draft = false # Is this a draft? true/false
8+
toc = true # Show table of contents? true/false
9+
type = "docs" # Do not modify.
10+
11+
# Add menu entry to sidebar.
12+
linktitle = "Cromwell"
13+
[menu.docs]
14+
parent = "zoology"
15+
weight = 3
16+
17+
+++
18+
19+
20+
21+
# Cromwell Introduction
22+
23+
Cromwell is a workflow management system designed for scientific workflows.
24+
25+
# Cromwell on Volcano
26+
27+
Cromwell can be integrated with Volcano to efficiently schedule and execute bioinformatics workflows in Kubernetes environments.
28+
29+
To make Cromwell interact with a Volcano cluster and dispatch jobs to it, you can use the following basic configuration:
30+
31+
```hocon
32+
hoconhoconVolcano {
33+
actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
34+
config {
35+
runtime-attributes = """
36+
Int runtime_minutes = 600
37+
Int cpus = 2
38+
Int requested_memory_mb_per_core = 8000
39+
String queue = "short"
40+
"""
41+
42+
submit = """
43+
vcctl job run -f ${script}
44+
"""
45+
kill = "vcctl job delete -N ${job_id}"
46+
check-alive = "vcctl job view -N ${job_id}"
47+
job-id-regex = "(\\d+)"
48+
}
49+
}
50+
```
51+
52+
Please note that this configuration example is community-contributed and therefore not officially supported.
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
+++
2+
title = "Horovod on Volcano"
3+
4+
date = 2025-07-20
5+
lastmod = 2025-07-20
6+
7+
draft = false # Is this a draft? true/false
8+
toc = true # Show table of contents? true/false
9+
type = "docs" # Do not modify.
10+
11+
# Add menu entry to sidebar.
12+
linktitle = "Horovod"
13+
[menu.docs]
14+
parent = "zoology"
15+
weight = 3
16+
17+
+++
18+
19+
20+
21+
# Horovod Introduction
22+
23+
Horovod is a distributed deep learning training framework compatible with PyTorch, TensorFlow, Keras, and Apache MXNet. With Horovod, existing training scripts can be scaled to run on hundreds of GPUs with just a few lines of Python code. It achieves near-linear performance improvements on large-scale GPU clusters.
24+
25+
## Horovod on Volcano
26+
27+
Volcano as a cloud-native batch system, provides native support for Horovod distributed training jobs. Through Volcano's scheduling capabilities, users can easily deploy and manage Horovod training tasks on Kubernetes clusters.
28+
29+
Below is an example configuration for running Horovod on Volcano:
30+
31+
```yaml
32+
yamlapiVersion: batch.volcano.sh/v1alpha1
33+
kind: Job
34+
metadata:
35+
name: lm-horovod-job
36+
labels:
37+
"volcano.sh/job-type": Horovod
38+
spec:
39+
minAvailable: 4
40+
schedulerName: volcano
41+
plugins:
42+
ssh: []
43+
svc: []
44+
policies:
45+
- event: PodEvicted
46+
action: RestartJob
47+
tasks:
48+
- replicas: 1
49+
name: master
50+
policies:
51+
- event: TaskCompleted
52+
action: CompleteJob
53+
template:
54+
spec:
55+
containers:
56+
- command:
57+
- /bin/sh
58+
- -c
59+
- |
60+
WORKER_HOST=`cat /etc/volcano/worker.host | tr "\n" ","`;
61+
mkdir -p /var/run/sshd; /usr/sbin/sshd;
62+
mpiexec --allow-run-as-root --host ${WORKER_HOST} -np 3 python tensorflow_mnist_lm.py;
63+
image: volcanosh/horovod-tf-mnist:0.5
64+
name: master
65+
ports:
66+
- containerPort: 22
67+
name: job-port
68+
resources:
69+
requests:
70+
cpu: "500m"
71+
memory: "1024Mi"
72+
limits:
73+
cpu: "500m"
74+
memory: "1024Mi"
75+
restartPolicy: OnFailure
76+
imagePullSecrets:
77+
- name: default-secret
78+
- replicas: 3
79+
name: worker
80+
template:
81+
spec:
82+
containers:
83+
- command:
84+
- /bin/sh
85+
- -c
86+
- |
87+
mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
88+
image: volcanosh/horovod-tf-mnist:0.5
89+
name: worker
90+
ports:
91+
- containerPort: 22
92+
name: job-port
93+
resources:
94+
requests:
95+
cpu: "1000m"
96+
memory: "2048Mi"
97+
limits:
98+
cpu: "1000m"
99+
memory: "2048Mi"
100+
restartPolicy: OnFailure
101+
imagePullSecrets:
102+
- name: default-secret
103+
```
104+
105+
In this configuration, we define a Horovod distributed training job with the following key components:
106+
107+
1. Task structure: Consists of 1 master node and 3 worker nodes, totaling 4 Pods
108+
2. Communication mechanism: Utilizes Volcano's SSH plugin for inter-node communication
109+
3. Resource allocation: Master node is allocated fewer resources (500m CPU/1Gi memory), while worker nodes receive more resources (1000m CPU/2Gi memory)
110+
4. Fault tolerance: When a Pod is evicted, the entire job restarts
111+
5. Job completion policy: When the master task completes, the entire job is marked as complete

0 commit comments

Comments
 (0)