Skip to content

Commit a30865b

Browse files
Merge pull request #61 from NHSDigital/wd-data-classes-guide
Wd data classes guide
2 parents b97fd6c + d00cf5d commit a30865b

File tree

2 files changed

+151
-0
lines changed

2 files changed

+151
-0
lines changed
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
---
2+
title: Data Classes
3+
summary: by Alistair Jones
4+
5+
tags:
6+
- Data Classes
7+
- Config
8+
- Python
9+
- PySpark
10+
11+
---
12+
13+
#
14+
15+
!!! tip "TLDR"
16+
17+
- Classes are used for storing data and code that acts upon that data.
18+
- Functions forget what happens in each call, while the state of a class stores information between calls to the classes' methods.
19+
- Data classes are a special type of class that are useful for passing parameters around a pipeline
20+
21+
## Introduction
22+
23+
This is a brief guide into the what, why and how of 'data classes' - which are a special type of classes in Python.
24+
25+
The page provides a high-level overview of classes, aiming to serve as a jumping-off point rather than replicating the vast quantities of [documentation about classes](https://docs.python.org/3/tutorial/classes.html) on the internet.
26+
27+
The primary focus is on data classes and how they can benefit Reproducible Analytical Pipelines.
28+
29+
30+
## What is a class?
31+
32+
A class is essentially a reusable template for storing data and code which acts upon that data.
33+
34+
Consider what happens when we call a Python function: usually it will take some input data, do something with that data and possibly return some output.
35+
36+
In general, a function forgets about what happens between calls, so if we called it again with the same arguments we'd get the same result, again and again.
37+
38+
Classes are slightly different - there are two principal concepts:
39+
40+
- The state of a class represents the data it stores - for example, a class representing a person might have 'name', 'dob' and 'address' stored within its state
41+
- The methods of a class contain code that use and modify the state of a class.
42+
43+
See [this page](https://www.w3schools.com/python/python_classes.asp) for a tutorial to help with getting started with classes in Python.
44+
45+
46+
## Why should I care about classes?
47+
48+
Classes are a convenient way to program when you need to keep track of state information.
49+
50+
It is perfectly fine not to use them and indeed for many scenarios they can unnecessarily increase complexity and can contribute towards to [spaghetti code](https://en.wikipedia.org/wiki/Spaghetti_code)!
51+
52+
But when you find yourself passing the same data around the same set of functions again and again, then classes might help to reduce duplication and avoid repetition.
53+
54+
55+
## What is a 'data class'?
56+
57+
A data class is a special type of class in Python. As the name suggests, data classes are typically used as containers for passing data around different parts of a pipeline or program - they don't usually have much logic defined in methods, rather they are used to pass data between functions.
58+
59+
_(Note that here the term "data" is meant broadly - not just the tabular data that you get from your database (though it could be), but also any value, parameter, or variable that you might need to put in your code.)_
60+
61+
In a Reproducible Analytical Pipeline, we usual have some parameters that need to be passed together to different parts of the pipeline, such as the start and end dates for a publication.
62+
63+
If our functions for field definitions, creating tables, etc, all need to access these parameters, one way would be to have the same parameters provided multiple times in function arguments:
64+
65+
```py
66+
from datetime import date
67+
68+
def my_first_function(start_date: date, end_date: date):
69+
# some interesting logic
70+
71+
def my_second_function(start_date: date, end_date: date):
72+
# some other interesting logic
73+
```
74+
75+
_(You may note that this example makes use of type hints, which are incredibly useful and necessary for explaining and using data classes. However, for brevity, type hints will not be covered in depth here - for more information, [read this](https://dev.to/dev0928/what-are-type-hints-in-python-3c2k).)_
76+
77+
For a small number of parameters, and a small number of functions, this is probably fine. But imagine what this starts to look like as the number of parameters increases, as does the number of functions - there starts to be lots of repeated code, which is both harder to read and more prone to errors!
78+
79+
## Example: using data classes to store parameters
80+
81+
With data classes, we can rewrite the above as follow:
82+
83+
```py
84+
import dataclasses
85+
from datetime import date
86+
@dataclasses.dataclass
87+
88+
class PublicationDates:
89+
start_date: date
90+
end_date: date
91+
92+
def my_first_function(publication_dates: PublicationDates):
93+
# some interesting logic
94+
# access the start_date attribute via `publication_dates.start_date`
95+
# access the end_date attribute via `publication_dates.end_date`
96+
97+
def my_second_function(publication_dates: PublicationDates):
98+
# some other interesting logic
99+
# access the start_date attribute via `publication_dates.start_date`
100+
# access the end_date attribute via ` publication_dates.end_date`
101+
```
102+
103+
Notice now that instead of two arguments, we have one - just the `PublicationDates` data class.
104+
105+
Someone who wanted to use these functions could then do so as follows:
106+
107+
```py
108+
start_date = date(2023, 1, 1)
109+
end_date = date(2024, 1, 1)
110+
my_publication_dates = PublicationDates(start_date, end_date)
111+
112+
my_first_result = my_first_function(my_publication_dates)
113+
my_second_result = my_second_function(my_publication_dates)
114+
```
115+
116+
Although the benefits may be limited for this contrived example, we can see that:
117+
118+
1. The intent is clearer when we read our function - rather than just having independent parameters called `start_date` and `end_date` we immediately know that these variables are a pair, representing the publication start and end dates, since they are contained together within the data class.
119+
2. Since `my_publication_dates` is passed to both functions, we now have a single argument for each function, so the code is easier to read and there is less room for error (e.g. it's not possible to mix up our `start_date` and `end_date` . This doesn't make a huge difference in this example with two functions and two parameters, but imagine what happens as the pipeline gets bigger.
120+
121+
## Validations using data classes
122+
123+
A great feature of data classes is that after they are created, a special function - called `__postinit__` runs that you can use to do things like run validations or checks, or to derive additional attributes.
124+
125+
For example:
126+
127+
```py
128+
import dataclasses
129+
from datetime import date
130+
131+
@dataclasses.dataclass
132+
class PublicationDates:
133+
start_date: date
134+
end_date: date
135+
136+
def __postinit__(self):
137+
# Raise an error if start date less than end date
138+
assert self.start_date < self.end_date, "Start date was not less than end date!"
139+
140+
good_publication_dates = PublicationDates(start_date=date(2022, 1, 1), end_date=date(2023, 1, 1)) # This will be fine
141+
bad_publication_dates = PublicationDates(start_date=date(2023, 1, 1), end_date=date(2022, 1, 1)) # This will raise an AssertionError
142+
```
143+
144+
So now, in addition to the benefits we had previously, we can also have checks and validations running at the time that we define our parameters - rather than individually within each function, or via creating another function to do these checks that we need to remember to call - the logic is built into how the parameters are defined! This also makes it clearer to someone who wants to use these parameters that there are constraints on the values.
145+
146+
You don't have to define a `__postinit__` method but it can be useful in certain situations!
147+
148+
## Further reading
149+
150+
- [Data classes guide on Data Quest](https://www.dataquest.io/blog/how-to-use-python-data-classes/)

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,7 @@ nav:
9999
- Project structure and packaging: training_resources/python/project-structure-and-packaging.md
100100
- Using Python f-strings to run SQL queries: training_resources/python/using-f-strings-sql-queries.md
101101
- Using config files: training_resources/python/config-files.md
102+
- Data Classes: training_resources/python/data-classes.md
102103
- Virtual environments:
103104
- Why use virtual environments?: training_resources/python/virtual-environments/why-use-virtual-environments.md
104105
- Venv: training_resources/python/virtual-environments/venv.md

0 commit comments

Comments
 (0)