You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/training_resources/python/data-classes.md
+15-12Lines changed: 15 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,11 +33,11 @@ A class is essentially a reusable template for storing data and code which acts
33
33
34
34
Consider what happens when we call a Python function: usually it will take some input data, do something with that data and possibly return some output.
35
35
36
-
In general, a function forgets about what happens between calls, so if we called it again with the same arguments we'd get the same result, again and again.
36
+
In general, a function forgets what happens between each call, so if we called it again with the same arguments we'd get the same result, again and again.
37
37
38
38
Classes are slightly different - there are two principal concepts:
39
39
40
-
- The state of a class represents the data it stores - for example, a class representing a person might have 'name', 'dob' and 'address' stored within its state
40
+
- The state of a class represents the data it stores - for example, a class representing a 'person' might have 'name', 'date_of_birth' and 'address' stored within its state
41
41
- The methods of a class contain code that use and modify the state of a class.
42
42
43
43
See [this page](https://www.w3schools.com/python/python_classes.asp) for a tutorial to help with getting started with classes in Python.
@@ -47,7 +47,7 @@ See [this page](https://www.w3schools.com/python/python_classes.asp) for a tutor
47
47
48
48
Classes are a convenient way to program when you need to keep track of state information.
49
49
50
-
It is perfectly fine not to use them and indeed for many scenarios they can unnecessarily increase complexity and can contribute towards to[spaghetti code](https://en.wikipedia.org/wiki/Spaghetti_code)!
50
+
It is perfectly fine not to use them and indeed for many scenarios they can unnecessarily increase complexity and contribute towards [spaghetti code](https://en.wikipedia.org/wiki/Spaghetti_code)!
51
51
52
52
But when you find yourself passing the same data around the same set of functions again and again, then classes might help to reduce duplication and avoid repetition.
53
53
@@ -56,9 +56,9 @@ But when you find yourself passing the same data around the same set of function
56
56
57
57
A data class is a special type of class in Python. As the name suggests, data classes are typically used as containers for passing data around different parts of a pipeline or program - they don't usually have much logic defined in methods, rather they are used to pass data between functions.
58
58
59
-
_(Note that here the term "data" is meant broadly - not just the tabular data that you get from your database (though it could be), but also any value, parameter, or variable that you might need to put in your code.)_
59
+
_(Note that here the term "data" is used in a broad sense, which refers not only to tabular data that you read in from a database or file, but also any value, parameter, or variable that you might need to put in your code.)_
60
60
61
-
In a Reproducible Analytical Pipeline, we usual have some parameters that need to be passed together to different parts of the pipeline, such as the start and end dates for a publication.
61
+
In a Reproducible Analytical Pipeline, we usually have some parameters that need to be passed together to different parts of the pipeline, such as the start and end dates for a publication.
62
62
63
63
If our functions for field definitions, creating tables, etc, all need to access these parameters, one way would be to have the same parameters provided multiple times in function arguments:
_(You may note that this example makes use of type hints, which are incredibly useful and necessary for explaining and using data classes. However, for brevity, type hints will not be covered in depth here - for more information, [read this](https://dev.to/dev0928/what-are-type-hints-in-python-3c2k).)_
75
+
_(You may notice that this example makes use of type hints, which are incredibly useful and necessary for explaining and using data classes. However, for brevity, type hints will not be covered in depth here - for more information, [read this](https://dev.to/dev0928/what-are-type-hints-in-python-3c2k).)_
76
76
77
-
For a small number of parameters, and a small number of functions, this is probably fine. But imagine what this starts to look like as the number of parameters increases, as does the number of functions - there starts to be lots of repeated code, which is both harder to read and more prone to errors!
77
+
For a small number of parameters, and a small number of functions, this is probably fine. But imagine what this starts to look like as the number of parameters increases, as does the number of functions - there starts to be lots of repeated code, which is harder to read, more cumbersome to maintain and more prone to errors!
78
78
79
79
## Example: using data classes to store parameters
Although the benefits may be limited for this contrived example, we can see that:
117
117
118
118
1. The intent is clearer when we read our function - rather than just having independent parameters called `start_date` and `end_date` we immediately know that these variables are a pair, representing the publication start and end dates, since they are contained together within the data class.
119
-
2. Since `my_publication_dates` is passed to both functions, we now have a single argument for each function, so the code is easier to read and there is less room for error (e.g. it's not possible to mix up our `start_date` and `end_date`. This doesn't make a huge difference in this example with two functions and two parameters, but imagine what happens as the pipeline gets bigger.
119
+
2. Since `my_publication_dates` is passed to both functions, we now have a single argument for each function, so the code is easier to read and there is less room for error (e.g. it's not possible to mix up our `start_date` and `end_date`). This doesn't make a huge difference in this example with two functions and two parameters, but imagine what happens as the pipeline grows in size and complexity!
120
120
121
121
## Validations using data classes
122
122
123
-
A great feature of data classes is that after they are created, a special function - called `__postinit__`runs that you can use to do things like run validations or checks, or to derive additional attributes.
123
+
A useful feature of data classes is that after they are created, a special method called `__postinit__`is called: you can use this to do things like run validations or checks, or to derive additional attributes for the data class.
bad_publication_dates = PublicationDates(start_date=date(2023, 1, 1), end_date=date(2022, 1, 1)) # This will raise an AssertionError
142
142
```
143
143
144
-
So now, in addition to the benefits we had previously, we can also have checks and validations running at the time that we define our parameters - rather than individually within each function, or via creating another function to do these checks that we need to remember to call - the logic is built into how the parameters are defined! This also makes it clearer to someone who wants to use these parameters that there are constraints on the values.
144
+
So now, in addition to the benefits we had previously, we can also have checks and validations running at the time that we define our parameters. This saves us adding these individually within each function or creating another function to do these checks that we need to remember to call! Instead, the checking/validation logic is built into how the parameters are defined! This also makes it clearer to someone who wants to use these parameters that there are constraints on the values.
145
145
146
-
You don't have to define a `__postinit__` method but it can be useful in certain situations!
146
+
You don't have to define a `__postinit__` method but it can be useful in certain situations.
147
+
148
+
### Open-source validation libraries
149
+
For more advanced use-cases, there are open-source libraries of validations which build on the idea of data classes - one widely used example is [Pydantic](https://docs.pydantic.dev/latest/) which is definitely worth exploring once you've understood the concepts introduced in this guide!
147
150
148
151
## Further reading
149
152
150
-
-[Data classes guide on Data Quest](https://www.dataquest.io/blog/how-to-use-python-data-classes/)
153
+
-[Data classes guide on Data Quest](https://www.dataquest.io/blog/how-to-use-python-data-classes/)
0 commit comments