@@ -9,6 +9,144 @@ of models without relying on centralized orchestration tools like KubeFlow.
9
9
The ` standalone.py ` tool provides support for fetching generated SDG (Synthetic Data Generation) data from an AWS S3 compatible object store.
10
10
While AWS S3 is supported, alternative object storage solutions such as Ceph, Nooba, and MinIO are also compatible.
11
11
12
+ ## Overall end-to-end workflow
13
+
14
+ ``` text
15
+ +-------------------------------+
16
+ | Kubernetes Job |
17
+ | "data-download" |
18
+ +-------------------------------+
19
+ | Init Container |
20
+ | "download-data-object-store" |
21
+ | (Fetches data from object |
22
+ | storage) |
23
+ +-------------------------------+
24
+ | Main Container |
25
+ | "sdg-data-preprocess" |
26
+ | (Processes the downloaded |
27
+ | data) |
28
+ +-------------------------------+
29
+ |
30
+ v
31
+ +-------------------------------+
32
+ | "watch for completion" |
33
+ +-------------------------------+
34
+ |
35
+ v
36
+ +-----------------------------------+
37
+ | PytorchJob CR training phase 1 |
38
+ | |
39
+ | +---------------------+ |
40
+ | | Master Pod | |
41
+ | | (Trains and | |
42
+ | | Coordinates the | |
43
+ | | distributed | |
44
+ | | training) | |
45
+ | +---------------------+ |
46
+ | | |
47
+ | v |
48
+ | +---------------------+ |
49
+ | | Worker Pod 1 | |
50
+ | | (Handles part of | |
51
+ | | the training) | |
52
+ | +---------------------+ |
53
+ | | |
54
+ | v |
55
+ | +---------------------+ |
56
+ | | Worker Pod 2 | |
57
+ | | (Handles part of | |
58
+ | | the training) | |
59
+ | +---------------------+ |
60
+ +-----------------------------------+
61
+ |
62
+ v
63
+ +-------------------------------+
64
+ | "wait for completion" |
65
+ +-------------------------------+
66
+ |
67
+ v
68
+ +-----------------------------------+
69
+ | PytorchJob CR training phase 2 |
70
+ | |
71
+ | +---------------------+ |
72
+ | | Master Pod | |
73
+ | | (Trains and | |
74
+ | | Coordinates the | |
75
+ | | distributed | |
76
+ | | training) | |
77
+ | +---------------------+ |
78
+ | | |
79
+ | v |
80
+ | +---------------------+ |
81
+ | | Worker Pod 1 | |
82
+ | | (Handles part of | |
83
+ | | the training) | |
84
+ | +---------------------+ |
85
+ | | |
86
+ | v |
87
+ | +---------------------+ |
88
+ | | Worker Pod 2 | |
89
+ | | (Handles part of | |
90
+ | | the training) | |
91
+ | +---------------------+ |
92
+ +-----------------------------------+
93
+ |
94
+ v
95
+ +-------------------------------+
96
+ | "wait for completion" |
97
+ +-------------------------------+
98
+ |
99
+ v
100
+ +-------------------------------+
101
+ | Kubernetes Job |
102
+ | "eval-mt-bench" |
103
+ +-------------------------------+
104
+ | Init Container |
105
+ | "run-eval-mt-bench" |
106
+ | (Runs evaluation on MT Bench)|
107
+ +-------------------------------+
108
+ | Main Container |
109
+ | "output-eval-mt-bench-scores"|
110
+ | (Outputs evaluation scores) |
111
+ +-------------------------------+
112
+ |
113
+ v
114
+ +-------------------------------+
115
+ | "wait for completion" |
116
+ +-------------------------------+
117
+ |
118
+ v
119
+ +-------------------------------+
120
+ | Kubernetes Job |
121
+ | "eval-final" |
122
+ +-------------------------------+
123
+ | Init Container |
124
+ | "run-eval-final" |
125
+ | (Runs final evaluation) |
126
+ +-------------------------------+
127
+ | Main Container |
128
+ | "output-eval-final-scores" |
129
+ | (Outputs final evaluation |
130
+ | scores) |
131
+ +-------------------------------+
132
+ |
133
+ v
134
+ +-------------------------------+
135
+ | "wait for completion" |
136
+ +-------------------------------+
137
+ |
138
+ v
139
+ +-------------------------------+
140
+ | Kubernetes Job |
141
+ | "trained-model-upload" |
142
+ +-------------------------------+
143
+ | Main Container |
144
+ | "upload-data-object-store" |
145
+ | (Uploads the trained model to|
146
+ | the object storage) |
147
+ +-------------------------------+
148
+ ```
149
+
12
150
## Requirements
13
151
14
152
The ` standalone.py ` script is designed to run within a Kubernetes environment. The following requirements must be met:
0 commit comments