|
| 1 | +# Functional Programming |
| 2 | + |
| 3 | +In the following, we will discuss the reasons behind the growing trend |
| 4 | +of incorporating functional programming into the design of machine |
| 5 | +learning frameworks. |
| 6 | + |
| 7 | +## Benefits of Functional Programming |
| 8 | + |
| 9 | +Training constitutes the most critical phase in machine learning, and |
| 10 | +the manner in which training is depicted hinges significantly on |
| 11 | +optimizer algorithms. Predominantly, contemporary machine learning tasks |
| 12 | +utilize first-order optimizers, favored for their ease of use. With |
| 13 | +machine learning advancing at a rapid pace, both software and hardware |
| 14 | +are incessantly updated to stay abreast. Consequently, an increasing |
| 15 | +number of researchers are beginning to investigate higher-order |
| 16 | +optimizers, noted for their superior convergence performance. Frequently |
| 17 | +utilized second-order optimizers, such as the Newton method, |
| 18 | +quasi-Newton method, and AdaHessians, necessitate the computation of a |
| 19 | +Hessian matrix incorporating second-order derivative information. Two |
| 20 | +considerable challenges arise from this computation: 1) how to manage |
| 21 | +such a hefty computational load efficiently; 2) how to express |
| 22 | +higher-order derivatives in programmatic language. |
| 23 | + |
| 24 | +In recent times, numerous large AI models have been introduced, which |
| 25 | +include (with the number of parameters noted in parentheses) OpenAI |
| 26 | +GPT-3 (175B) in 2020; PanGu (100B), PanGu-$\alpha$ (200B), Google's |
| 27 | +Switch Transformer (1.6T), and WuDao (1.75T) in 2021; along with |
| 28 | +Facebook's NLLB-200 (54B) in 2022. The demand for ultra-large model |
| 29 | +training is escalating, and data parallelism alone cannot meet this |
| 30 | +growing requirement. Conversely, model parallelism demands manual model |
| 31 | +segmentation, a process that is time-intensive and laborious. |
| 32 | +Consequently, the main challenge future machine learning frameworks must |
| 33 | +overcome is how to actualize automatic parallelism. At its core, a |
| 34 | +machine learning model is a representation of a mathematical model. |
| 35 | +Hence, the ability to succinctly represent machine learning models has |
| 36 | +risen to a key concern in the design of programming paradigms for |
| 37 | +machine learning frameworks. |
| 38 | + |
| 39 | +Recognizing the challenges presented by the practical implementation of |
| 40 | +machine learning frameworks, researchers have identified that functional |
| 41 | +programming could offer beneficial solutions. Functional programming, in |
| 42 | +computer science, is a programming paradigm that envisions computation |
| 43 | +as the evaluation of mathematical functions, actively avoiding state |
| 44 | +changes and data mutations. This paradigm harmonizes well with |
| 45 | +mathematical reasoning. Neural networks are composed of interconnected |
| 46 | +nodes, with each node performing basic mathematical operations. |
| 47 | +Functional programming languages allow developers to portray these |
| 48 | +mathematical operations in a language that closely mirrors the |
| 49 | +operations, enhancing the readability and maintainability of programs. |
| 50 | +Concurrently, in functional languages, functions are kept separate, |
| 51 | +simplifying the management of concurrency and parallelism. |
| 52 | + |
| 53 | +In summary, functional programming is anticipated to confer the |
| 54 | +following benefits to machine learning frameworks: |
| 55 | + |
| 56 | +1. It is suited for machine learning scenarios where higher-order |
| 57 | + derivatives are needed. |
| 58 | + |
| 59 | +2. It simplifies the development of parallel programming interfaces. |
| 60 | + |
| 61 | +3. It results in a more concise code representation. |
| 62 | + |
| 63 | +## Framework Support for Functional Programming |
| 64 | + |
| 65 | +Machine learning frameworks have increasing support for functional |
| 66 | +programming. In 2018, Google rolled out JAX. Contrary to traditional |
| 67 | +machine learning frameworks, JAX amalgamates neural network computation |
| 68 | +and numerical computation. Its interfaces are compatible with native |
| 69 | +data science interfaces in Python, such as NumPy and SciPy. Moreover, |
| 70 | +JAX extends distribution, vectorization, high-order derivation, and |
| 71 | +hardware acceleration in a functional programming style, characterized |
| 72 | +by Lambda closure and no side effects. |
| 73 | + |
| 74 | +In 2020, Huawei introduced MindSpore, the functional differential |
| 75 | +programming architecture of which allows users to concentrate on the |
| 76 | +native mathematical expressions of machine learning models. In 2022, |
| 77 | +taking inspiration from Google's JAX, PyTorch launched functorch. |
| 78 | +Functorch is essentially a library aimed at providing composable vmap |
| 79 | +(vectorization) and autodiff transforms compatible with PyTorch modules |
| 80 | +and PyTorch autograd, thereby achieving excellent eager-mode |
| 81 | +performance. It can be inferred that functorch meets the requirements |
| 82 | +for distributed parallelism in PyTorch static graphs. Code |
| 83 | +[\[ch02/code2.4\]](#ch02/code2.4){reference-type="ref" |
| 84 | +reference="ch02/code2.4"} gives an example of functorch. |
| 85 | + |
| 86 | +``` {#ch02/code2.4 caption="Functorch Example" label="ch02/code2.4"} |
| 87 | +from functorch import combine_state_for_ensemble, vmap |
| 88 | + minibatches = data[:num_models] |
| 89 | + models = [MLP().to(device) for _ in range(num_models)] |
| 90 | + fmodel, params, buffers = combine_state_for_ensemble(models) |
| 91 | + predictions1_vmap = vmap(fmodel, out_dims=1)(params, buffers, minibatches) |
| 92 | +``` |
| 93 | + |
| 94 | +Functorch introduces *vmap*, standing for \"vectorized map\". Its role |
| 95 | +is to adapt functions designed for individual inputs so that they can |
| 96 | +handle batches of inputs, therefore facilitating efficient vectorized |
| 97 | +calculations. Unlike the batch processing capabilities of standard |
| 98 | +PyTorch modules, vmap can convert any operation to be batch-aware |
| 99 | +without the need to alter the operation's original structure. Moreover, |
| 100 | +vmap offers greater flexibility to batch dimensions, allowing users to |
| 101 | +specify which dimension should be treated as the batch dimension |
| 102 | +(specifying the $out\_dim$ argument), a contrast to the default |
| 103 | +behaviour of the standard PyTorch where the first dimension is usually |
| 104 | +chosen as the batch dimension. |
| 105 | + |
| 106 | +By tracing the development of machine learning frameworks, it becomes |
| 107 | +evident that the functional programming paradigm become increasingly |
| 108 | +popular. This can be attributed to functional programming's ability to |
| 109 | +express machine learning models intuitively and its convenience for |
| 110 | +implementing automatic differentiation, high-order derivation, and |
| 111 | +parallel execution. Consequently, future machine learning frameworks are |
| 112 | +likely to adopt layered frontend interfaces that are not exclusively |
| 113 | +designed for machine learning scenarios. Instead, they will primarily |
| 114 | +offer differential programming in their abstraction designs, making |
| 115 | +gradient-based software easy to be developed for various applications. |
0 commit comments