|
| 1 | +# Scheduling and Executing Computational Tasks |
| 2 | + |
| 3 | +Training a model is conducted by scheduling the execution of the |
| 4 | +operators in a computational graph. From a broad perspective, a training |
| 5 | +job runs a computational graph for a defined number of iterations, |
| 6 | +relying on optimal scheduling of tasks such as data loading and training |
| 7 | +(inference) execution. Within each iteration, we need to analyze |
| 8 | +operator-level scheduling based on the graph topology, computational |
| 9 | +dependencies, and control flows. We optimize the scheduling and |
| 10 | +execution of computational graphs to make full use of computing |
| 11 | +resources, improve computational efficiency, and shorten the model |
| 12 | +training and inference time. The following introduces the typical |
| 13 | +techniques of computational graph scheduling. |
| 14 | + |
| 15 | +The scheduling execution of the computation graph can be divided into |
| 16 | +three modes according to the graph generation method, which are operator |
| 17 | +scheduling, whole graph scheduling, and operator and subgraph combined |
| 18 | +scheduling. These three modes also correspond to the three modes of |
| 19 | +dynamic graph, static graph, and combination of dynamic and static in |
| 20 | +the calculation graph generation mechanism. |
| 21 | + |
| 22 | +Next, we will introduce the scheduling and execution of the calculation |
| 23 | +graph in detail. |
| 24 | + |
| 25 | +## Operator Scheduling |
| 26 | + |
| 27 | +Operator scheduling means that the operators contained in the algorithm |
| 28 | +or model are scheduled and executed one by one through the runtime of |
| 29 | +the Python language. This scheduling mechanism is used when the |
| 30 | +calculation graph is executed in dynamic graph mode, such as PyTorch's |
| 31 | +default execution mode and TensorFlow's eager mode. |
| 32 | + |
| 33 | +Operator scheduling includes two steps. In the first step, according to |
| 34 | +the call sequence of the model operator declaration, the dynamic |
| 35 | +calculation graph obtains a linear operator scheduling sequence. And the |
| 36 | +second is distributing the ordering of operators to instruction streams. |
| 37 | + |
| 38 | +In Figure :numref:`ch04/ch04-diaoduzhixing`, the directed acyclic graph on |
| 39 | +the left contains five nodes a, b, c, d, and e and four dependency edges |
| 40 | +a-\>d, b-\>c, c-\>d, and d-\>e (e.g., a-\>d indicates that d depends on |
| 41 | +a). According to the operator call sequence of the model code, such as |
| 42 | +a-\>b-\>c-\>d-\>e, all operator nodes are put into the queue in turn, |
| 43 | +and the scheduling ends. |
| 44 | + |
| 45 | + |
| 46 | +:label:`ch04/ch04-diaoduzhixing` |
| 47 | + |
| 48 | +With the ordering, we then prepare to distribute the operators in the |
| 49 | +ordering and related data to the GPU hardware for execution. Figure |
| 50 | +:numref:`ch04/ch04-single-op-exec` shows the trace of operator |
| 51 | +scheduling. Once the Python runtime calls an operator, the machine |
| 52 | +learning framework initializes the operator by determining information |
| 53 | +such as the operator precision, type and size of each input/output, and |
| 54 | +target device. It then allocates memory for the operator before copying |
| 55 | +the memory to the specific device for execution. |
| 56 | + |
| 57 | + |
| 58 | +:label:`ch04/ch04-single-op-exec` |
| 59 | + |
| 60 | +The operator scheduling method offers high flexibility because operators |
| 61 | +are directly scheduled by the Python runtime. It facilitates the |
| 62 | +representation of complex computational logic (such as control flows) |
| 63 | +and use of Python-native data structures for implementing complex |
| 64 | +algorithms. Operators are driven by the Python runtime to finish |
| 65 | +computational tasks, facilitating easy collaboration with Python's |
| 66 | +large, rich ecosystem. |
| 67 | + |
| 68 | +Despite its advantages, operator scheduling also has some disadvantages. |
| 69 | +One is that context-based runtime optimizations such as operator fusion |
| 70 | +and algebraic simplification become difficult. This is because global |
| 71 | +information about the computational graph is unavailable. Another |
| 72 | +disadvantage is that computational tasks have to run in serial mode, |
| 73 | +rather than in parallel, due to the lack of computational topology. |
| 74 | + |
| 75 | +## Graph Scheduling |
| 76 | + |
| 77 | +When the calculation graph uses the static graph mechanism for |
| 78 | +whole-graph scheduling execution, operators will be sent to the hardware |
| 79 | +for execution one by one according to a certain execution sequence. |
| 80 | +However, global information about the computational graph is available. |
| 81 | +it can analyze operator dependencies and the number of computing |
| 82 | +devices, and complete the scheduling and execution of the entire graph |
| 83 | +in the following two ways: |
| 84 | + |
| 85 | +1. **Serial**: executes its tasks one at a time, in the order that they |
| 86 | + are added to the queue.This method expands a computational graph |
| 87 | + into a sequence of operators, which are then run separately. |
| 88 | + Operators are executed in a static order using a single thread, |
| 89 | + thereby requiring fewer resources. |
| 90 | + |
| 91 | +2. **Parallel**: executes its tasks concurrently for higher |
| 92 | + efficiency.This method expands a computational graph based on |
| 93 | + operator dependencies. Operators are executed in the order defined |
| 94 | + by their input dependencies, and those without input dependencies |
| 95 | + are executed concurrently. This method executes operators in a |
| 96 | + dynamic order (which may vary in each iteration) using multiple |
| 97 | + threads, thereby consuming more system resources. |
| 98 | + |
| 99 | +Within a computational graph, most operators are dependent on each other |
| 100 | +directly or indirectly. When scheduling such operators, their sequence |
| 101 | +must be guaranteed. Figure |
| 102 | +:numref:`ch04/ch04-diaodu` shows a computational graph, where a |
| 103 | +forward pass is run on the input data to produce a predicted value and |
| 104 | +then the gradient of the loss function is computed for backpropagation. |
| 105 | +In general, downstream operators run dependently on the output from the |
| 106 | +upstream. As such, we have to schedule the operators in this |
| 107 | +computational graph to a serial queue in order to ensure that each |
| 108 | +operator receives the necessary input. |
| 109 | + |
| 110 | + |
| 111 | +:label:`ch04/ch04-diaodu` |
| 112 | + |
| 113 | +A computational graph may also contain operators independent of each |
| 114 | +other, for example, op1 and op2 shown in Figure |
| 115 | +:numref:`ch04/ch04-para`. We can have each operator run on |
| 116 | +different hardware devices to implement parallel computing. Compared |
| 117 | +with the serial mode, parallel computing decreases execution time by |
| 118 | +leveraging more computing resources at the same time. |
| 119 | + |
| 120 | + |
| 121 | +:label:`ch04/ch04-para` |
| 122 | + |
| 123 | +Serial execution and parallel execution have their own advantages and |
| 124 | +disadvantages, as summarized in Table |
| 125 | +:numref:`ch04/ch4-graph`. |
| 126 | + |
| 127 | +:Comparison between serial execution and parallel execution |
| 128 | + |
| 129 | +| Execution Method | Serial execution | Parallel execution | |
| 130 | +|----------------------|------------------|-------------------- | |
| 131 | +| Execution Order | Static | Dynamic | |
| 132 | +| Execution Threads | Single thread | Multiple threads | |
| 133 | +| Resource Consumption | Low | High | |
| 134 | +:label:`ch04/ch4-graph` |
| 135 | + |
| 136 | +A computing environment contains more than one type of computing device, |
| 137 | +such as a CPU, GPU, or other. As such, a computational graph consisting |
| 138 | +of operators that run on more than one type of computing device is |
| 139 | +referred to as a heterogeneous computational graph. |
| 140 | + |
| 141 | +The graph contains the following types of operators based on the |
| 142 | +computing hardware. |
| 143 | + |
| 144 | +- **CPU operators**: They are C++ operators that run on the host CPU. |
| 145 | + The computing performance of the CPU depends on the extent to which |
| 146 | + the multi-core capability of the CPU is utilized. |
| 147 | + |
| 148 | +- **GPU operators**: They run on the GPU (e.g., NVIDIA GPU). GPU |
| 149 | + kernels are delivered to the host GPU one by one for execution. The |
| 150 | + GPU features ample parallel computing units that offer significant |
| 151 | + speedup to parallel algorithms. |
| 152 | + |
| 153 | +- **Python operators**: They run on the host CPU. Unlike CPU |
| 154 | + operators, Python operators are interpreted and executed by the |
| 155 | + Python runtime interpreter. |
| 156 | + |
| 157 | +We mentioned earlier that the dynamic graph mechanism relies on the |
| 158 | +Python interpreter to distribute operators and execute them serially |
| 159 | +according to the order of operators defined by the model code. This mode |
| 160 | +usually allows data to be transmitted on different computing devices. |
| 161 | +Communication bottlenecks may increase the time spent waiting for |
| 162 | +operators to execute data, reducing the overall execution efficiency of |
| 163 | +the calculation graph. Therefore, the first condition for the efficient |
| 164 | +execution of the calculation graph is to accurately identify the device |
| 165 | +where the operator is executed, try to avoid the transmission of data |
| 166 | +between different devices. Independent operators are scheduled on |
| 167 | +different devices in parallel. The static graph mechanism can get rid of |
| 168 | +the constraints of the Python interpreter. The calculation graph is sent |
| 169 | +to the device at one time, which reduces the number of interactions |
| 170 | +between the host and the computing chip, and improves computing |
| 171 | +efficiency and performance. |
| 172 | + |
| 173 | +The combination of operators and subgraphs for scheduling execution mode |
| 174 | +is a combination of the previous two execution modes. Due to the |
| 175 | +flexibility of the computing graph structure, the efficiency of |
| 176 | +computing graphs in complex scenarios may not be optimal when executed |
| 177 | +on the entire computing chip. For example, computing chips can |
| 178 | +accelerate floating-point operations, while CPUs are good at processing |
| 179 | +logical judgments. Therefore, the parts with low execution efficiency |
| 180 | +for computing chips can be separated and handed over to devices with |
| 181 | +higher execution efficiency such as CPU for processing, which can take |
| 182 | +into account both performance and flexibility. |
| 183 | + |
| 184 | +There are different levels of parallelism: operator parallelism, model |
| 185 | +parallelism, and data parallelism. Operator parallelism is not just |
| 186 | +about executing independent operators in parallel. Where applicable, we |
| 187 | +can further partition an operator into multiple parallel child |
| 188 | +operations. Model parallelism refers to partitioning a computational |
| 189 | +graph among several devices in order to shorten the time taken by each |
| 190 | +training iteration. And data parallelism involves training the same |
| 191 | +computational graph on different data, reducing the total number of |
| 192 | +iterations and improving training efficiency. We will discuss these |
| 193 | +three parallelism methods in Chapter Distributed Training. |
| 194 | + |
| 195 | +## Synchronous and Asynchronous Data Loading |
| 196 | + |
| 197 | +As previously mentioned, a single training iteration of a computational |
| 198 | +graph goes through three serial tasks: data loading, data preprocessing, |
| 199 | +and model training. Each task is dependent on the output of the previous |
| 200 | +one. To schedule the three types of tasks in iterative graph training, |
| 201 | +we can use the synchronous and asynchronous mechanisms at the iteration |
| 202 | +level. |
| 203 | + |
| 204 | +1. **Synchronous**: Tasks are executed in order, one after the other. |
| 205 | + Tasks have to wait for and coordinate between each other. |
| 206 | + |
| 207 | +2. **Asynchronous**: When a task is complete, the same task in the next |
| 208 | + iteration can be executed immediately. |
| 209 | + |
| 210 | +If the synchronous mechanism is adopted to train the computational graph |
| 211 | +shown in Figure :numref:`ch04/ch04-tongbu`, in each iteration, a batch of input |
| 212 | +data is loaded, preprocessed, and then passed to the computational graph |
| 213 | +for model training and parameter update. Tasks in the next iteration |
| 214 | +wait until the current iteration is complete. The synchronous mechanism |
| 215 | +wastes computation and communication resources because the data |
| 216 | +preprocessing and model training tasks must wait until a batch of data |
| 217 | +is completely loaded, and because the I/O channel for data loading is |
| 218 | +idle at model training time. |
| 219 | + |
| 220 | + |
| 221 | +:label:`ch04/ch04-tongbu` |
| 222 | + |
| 223 | +In the asynchronous setting shown in Figure |
| 224 | +:numref:`ch04/ch04-yibu`, after loading and passing a batch of |
| 225 | +input data to the subsequent data preprocessing task, the I/O channel |
| 226 | +immediately moves on to the next batch without waiting for the current |
| 227 | +iteration to complete. In contrast with the synchronous mechanism, the |
| 228 | +idle time between data loading, data preprocessing, and model training |
| 229 | +in the asynchronous mechanism is notably reduced, thereby shortening the |
| 230 | +overall training time with improved execution efficiency. |
| 231 | + |
| 232 | + |
| 233 | +:label:`ch04/ch04-yibu` |
| 234 | + |
| 235 | +To further shorten the training time and improve the execution |
| 236 | +efficiency, we can combine the asynchronous mechanism with parallel |
| 237 | +computing, as shown in Figure |
| 238 | +:numref:`ch04/ch04-yibubingxing`. On the one hand, the |
| 239 | +asynchronous mechanism reduces the model's wait time for data loading |
| 240 | +and preprocessing, allowing the model to quickly traverse the entire |
| 241 | +dataset. On the other hand, parallel computing increases the batch size |
| 242 | +in iterative training, increasing the efficiency of computing resources. |
| 243 | + |
| 244 | + |
| 245 | +:label:`ch04/ch04-yibubingxing` |
0 commit comments