|
| 1 | +--- |
| 2 | +title: Data Classes |
| 3 | +summary: by Alistair Jones |
| 4 | + |
| 5 | +tags: |
| 6 | + - Data Classes |
| 7 | + - Config |
| 8 | + - Python |
| 9 | + - PySpark |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +# |
| 14 | + |
| 15 | +!!! tip "TLDR" |
| 16 | + |
| 17 | + - Classes are used for storing data and code that acts upon that data. |
| 18 | + - Functions forget what happens in each call, while the state of a class stores information between calls to the classes' methods. |
| 19 | + - Data classes are a special type of class that are useful for passing parameters around a pipeline |
| 20 | + |
| 21 | +## Introduction |
| 22 | + |
| 23 | +This is a brief guide into the what, why and how of 'data classes' - which are a special type of classes in Python. |
| 24 | + |
| 25 | +The page provides a high-level overview of classes, aiming to serve as a jumping-off point rather than replicating the vast quantities of [documentation about classes](https://docs.python.org/3/tutorial/classes.html) on the internet. |
| 26 | + |
| 27 | +The primary focus is on data classes and how they can benefit Reproducible Analytical Pipelines. |
| 28 | + |
| 29 | + |
| 30 | +## What is a class? |
| 31 | + |
| 32 | +A class is essentially a reusable template for storing data and code which acts upon that data. |
| 33 | + |
| 34 | +Consider what happens when we call a Python function: usually it will take some input data, do something with that data and possibly return some output. |
| 35 | + |
| 36 | +In general, a function forgets about what happens between calls, so if we called it again with the same arguments we'd get the same result, again and again. |
| 37 | + |
| 38 | +Classes are slightly different - there are two principal concepts: |
| 39 | + |
| 40 | +- The state of a class represents the data it stores - for example, a class representing a person might have 'name', 'dob' and 'address' stored within its state |
| 41 | +- The methods of a class contain code that use and modify the state of a class. |
| 42 | + |
| 43 | +See [this page](https://www.w3schools.com/python/python_classes.asp) for a tutorial to help with getting started with classes in Python. |
| 44 | + |
| 45 | + |
| 46 | +## Why should I care about classes? |
| 47 | + |
| 48 | +Classes are a convenient way to program when you need to keep track of state information. |
| 49 | + |
| 50 | +It is perfectly fine not to use them and indeed for many scenarios they can unnecessarily increase complexity and can contribute towards to [spaghetti code](https://en.wikipedia.org/wiki/Spaghetti_code)! |
| 51 | + |
| 52 | +But when you find yourself passing the same data around the same set of functions again and again, then classes might help to reduce duplication and avoid repetition. |
| 53 | + |
| 54 | + |
| 55 | +## What is a 'data class'? |
| 56 | + |
| 57 | +A data class is a special type of class in Python. As the name suggests, data classes are typically used as containers for passing data around different parts of a pipeline or program - they don't usually have much logic defined in methods, rather they are used to pass data between functions. |
| 58 | + |
| 59 | +_(Note that here the term "data" is meant broadly - not just the tabular data that you get from your database (though it could be), but also any value, parameter, or variable that you might need to put in your code.)_ |
| 60 | + |
| 61 | +In a Reproducible Analytical Pipeline, we usual have some parameters that need to be passed together to different parts of the pipeline, such as the start and end dates for a publication. |
| 62 | + |
| 63 | +If our functions for field definitions, creating tables, etc, all need to access these parameters, one way would be to have the same parameters provided multiple times in function arguments: |
| 64 | + |
| 65 | +```py |
| 66 | +from datetime import date |
| 67 | + |
| 68 | +def my_first_function(start_date: date, end_date: date): |
| 69 | + # some interesting logic |
| 70 | + |
| 71 | +def my_second_function(start_date: date, end_date: date): |
| 72 | + # some other interesting logic |
| 73 | +``` |
| 74 | + |
| 75 | +_(You may note that this example makes use of type hints, which are incredibly useful and necessary for explaining and using data classes. However, for brevity, type hints will not be covered in depth here - for more information, [read this](https://dev.to/dev0928/what-are-type-hints-in-python-3c2k).)_ |
| 76 | + |
| 77 | +For a small number of parameters, and a small number of functions, this is probably fine. But imagine what this starts to look like as the number of parameters increases, as does the number of functions - there starts to be lots of repeated code, which is both harder to read and more prone to errors! |
| 78 | + |
| 79 | +## Example: using data classes to store parameters |
| 80 | + |
| 81 | +With data classes, we can rewrite the above as follow: |
| 82 | + |
| 83 | +```py |
| 84 | +import dataclasses |
| 85 | +from datetime import date |
| 86 | +@dataclasses.dataclass |
| 87 | + |
| 88 | +class PublicationDates: |
| 89 | + start_date: date |
| 90 | + end_date: date |
| 91 | + |
| 92 | + def my_first_function(publication_dates: PublicationDates): |
| 93 | + # some interesting logic |
| 94 | + # access the start_date attribute via `publication_dates.start_date` |
| 95 | + # access the end_date attribute via `publication_dates.end_date` |
| 96 | + |
| 97 | + def my_second_function(publication_dates: PublicationDates): |
| 98 | + # some other interesting logic |
| 99 | + # access the start_date attribute via `publication_dates.start_date` |
| 100 | + # access the end_date attribute via ` publication_dates.end_date` |
| 101 | +``` |
| 102 | + |
| 103 | +Notice now that instead of two arguments, we have one - just the `PublicationDates` data class. |
| 104 | + |
| 105 | +Someone who wanted to use these functions could then do so as follows: |
| 106 | + |
| 107 | +```py |
| 108 | +start_date = date(2023, 1, 1) |
| 109 | +end_date = date(2024, 1, 1) |
| 110 | +my_publication_dates = PublicationDates(start_date, end_date) |
| 111 | + |
| 112 | +my_first_result = my_first_function(my_publication_dates) |
| 113 | +my_second_result = my_second_function(my_publication_dates) |
| 114 | +``` |
| 115 | + |
| 116 | +Although the benefits may be limited for this contrived example, we can see that: |
| 117 | + |
| 118 | +1. The intent is clearer when we read our function - rather than just having independent parameters called `start_date` and `end_date` we immediately know that these variables are a pair, representing the publication start and end dates, since they are contained together within the data class. |
| 119 | +2. Since `my_publication_dates` is passed to both functions, we now have a single argument for each function, so the code is easier to read and there is less room for error (e.g. it's not possible to mix up our `start_date` and `end_date` . This doesn't make a huge difference in this example with two functions and two parameters, but imagine what happens as the pipeline gets bigger. |
| 120 | + |
| 121 | +## Validations using data classes |
| 122 | + |
| 123 | +A great feature of data classes is that after they are created, a special function - called `__postinit__` runs that you can use to do things like run validations or checks, or to derive additional attributes. |
| 124 | + |
| 125 | +For example: |
| 126 | + |
| 127 | +```py |
| 128 | +import dataclasses |
| 129 | +from datetime import date |
| 130 | + |
| 131 | +@dataclasses.dataclass |
| 132 | +class PublicationDates: |
| 133 | + start_date: date |
| 134 | + end_date: date |
| 135 | + |
| 136 | + def __postinit__(self): |
| 137 | + # Raise an error if start date less than end date |
| 138 | + assert self.start_date < self.end_date, "Start date was not less than end date!" |
| 139 | + |
| 140 | +good_publication_dates = PublicationDates(start_date=date(2022, 1, 1), end_date=date(2023, 1, 1)) # This will be fine |
| 141 | +bad_publication_dates = PublicationDates(start_date=date(2023, 1, 1), end_date=date(2022, 1, 1)) # This will raise an AssertionError |
| 142 | +``` |
| 143 | + |
| 144 | +So now, in addition to the benefits we had previously, we can also have checks and validations running at the time that we define our parameters - rather than individually within each function, or via creating another function to do these checks that we need to remember to call - the logic is built into how the parameters are defined! This also makes it clearer to someone who wants to use these parameters that there are constraints on the values. |
| 145 | + |
| 146 | +You don't have to define a `__postinit__` method but it can be useful in certain situations! |
| 147 | + |
| 148 | +## Further reading |
| 149 | + |
| 150 | +- [Data classes guide on Data Quest](https://www.dataquest.io/blog/how-to-use-python-data-classes/) |
0 commit comments