-
Notifications
You must be signed in to change notification settings - Fork 6
Expand file tree
/
Copy pathopensciencePy.qmd
More file actions
executable file
·225 lines (140 loc) · 10.9 KB
/
opensciencePy.qmd
File metadata and controls
executable file
·225 lines (140 loc) · 10.9 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
---
format:
html:
code-fold: false
jupyter: python3
---
# OpenScience in Python {#sec-open-science-Python .unnumbered}
Now that you know what computational notebooks are and why we should care about them, let's start using them! This section introduces you to using Python for manipulating tabular data. Please read through it carefully and pay attention to how ideas about manipulating data are translated into code. For this part, you can read directly from the course website, although it is recommended you follow the section interactively by running the code on your own.
Once you have read through, jump on the [Do-It-Yourself](https://pietrostefani.github.io/gds/openscienceDIY.html) section, which will provide you with a challenge that you should complete on your own, and will allow you to put what you have already learnt into practice.
## Data wrangling
Real world datasets tend to be messy. There is no way around it: datasets have "holes" (missing data), the amount of formats in which data can be stored is endless, and the best structure to share data is not always the optimum to analyze them, hence the need to wrangle (manipulate, transform and structure) them. As has been correctly pointed out in many outlets ([e.g.](https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=0)), much of the [time](https://twitter.com/BigDataBorat/status/306596352991830016) spent in what is called (Geo-)Data Science is related not only to sophisticated modeling and insight, but to more basic and less exotic tasks such as obtaining data, processing, turning them into a shape that makes analysis possible, and exploring it to get to know their basic properties.
In this session, you will use a few real world datasets and learn how to process them in Python so they can be transformed and manipulated, if necessary, and analyzed. For this, we will introduce some of the fundamental tools of data analysis and scientific computing. We use a prepared dataset that saves us much of the more intricate processing that goes beyond the introductory level the session is aimed at.
In this notebook, we discuss several patterns to clean and structure data properly, including tidying, subsetting, and aggregating; and we finish with some basic visualization. An additional extension presents more advanced tricks to manipulate tabular data.
Before we get our hands data-dirty, let us import all the additional libraries we will need to run the code:
## Loading packages
We will start by loading core packages for working with geographic vector and attribute data.
```{python}
# This ensures visualizations are plotted inside the notebook
import os # This provides several system utilities
import pandas as pd # This is the workhorse of data munging in Python
```
## Datasets
We will be exploring some demographic characteristics in Liverpool. To do that, we will use a dataset that contains population counts, split by ethnic origin. These counts are aggregated at the Lower Layer Super Output Area (LSOA from now on). LSOAs are an official Census geography defined by the Office of National Statistics. You can think of them, more or less, as neighbourhoods. Many data products (Census, deprivation indices, etc.) use LSOAs as one of their main geographies.
To do this, we will download a data folder from github called `census2021_ethn`. You should place this in a data folder you will use throughout the course.
**Import housesales data from csv**
```{python}
census2021 = pd.read_csv("data/census2021_ethn/liv_pop.csv", index_col='GeographyCode')
```
Let us stop for a minute to learn how we have read the file. Here are the main aspects to keep in mind:
- We are using the method `read_csv` from the `pandas` library, which we have imported with the alias `pd`.
- Here the csv is based on a data file but it could also be a web address or sometimes you find data in packages.
- The argument `index_col` is not strictly necessary but allows us to choose one of the columns as the index of the table. More on indices below.
- We are using `read_csv` because the file we want to read is in the csv format. However, `pandas` allows for many more formats to be read and write. A full list of formats supported may be found [here](https://www.datacamp.com/tutorial/r-data-import-tutorial).
- To ensure we can access the data we have read, we store it in an object that we call `census2021`. We will see more on what we can do with it below but, for now, just keep in mind that allows us to save the result of `read_csv`.
::: callout-important
You need to store the data file on your computer, and read it locally. To do that, you can follow these steps:
1. Download the `census2021_ethn` file by right-clicking [on this link](https://github.com/pietrostefani/gds/tree/main/data) and saving the file
2. Place the file in a data folder you have created where you intend to read it.
3. Your folder should have the following structure a. a gds folder (where you will save your quarto .qmd documents) b. a data folder c. the `census2021_ethn` folder inside your data folder.
:::
::: callout-tip
## Download a folder on github
1. First go to <https://download-directory.github.io/>
2. Then go to the folder you need to today. So for example copy: https://github.com/pietrostefani/gds/tree/main/data/London
3. Paste it in the green box... give it a few minutes
4. Check your downloads file and unzip
:::
## Data, sliced and diced
Now we are ready to start playing with and interrogating the dataset! What we have at our fingertips is a table that summarizes, for each of the LSOAs in Liverpool, how many people live in each, by the region of the world where they were born. We call these tables `DataFrame` objects, and they have a lot of functionality built-in to explore and manipulate the data they contain.
**Structure**
Let's start by exploring the structure of a `DataFrame`. We can print it by simply typing its name:
```{python}
census2021
```
Since they represent a table of data, `DataFrame` objects have two dimensions: rows and columns. Each of these is automatically assigned a name in what we will call its index. When printing, the index of each dimension is rendered in bold, as opposed to the standard rendering for the content. In the example above, we can see how the column index is automatically picked up from the .csv file's column names. For rows, we have specified when reading the file we wanted the column `GeographyCode`, so that is used. If we hadn’t specified any, `pandas` will automatically generate a sequence starting in `0` and going all the way to the number of rows minus one. This is the standard structure of a `DataFrame` object, so we will come to it over and over. Importantly, even when we move to spatial data, our datasets will have a similar structure.
One further feature of these tables is that they can hold columns with different types of data. In our example, this is not used as we have counts (or int, for integer, types) for each column. But it is useful to keep in mind we can combine this with columns that hold other type of data such as categories, text (str, for string), dates or, as we will see later in the course, geographic features.
**Inspecting**
We can check the top (bottom) X lines of the table by passing X to the method head (tail). For example, for the top/bottom five lines:
```{python}
census2021.head() # read first 5 rows
census2021.tail() # read last 5 rows
```
**Summarise**
We can get an overview of the values of the table:
```{python}
census2021.describe()
```
Note how the output is also a `DataFrame` object, so you can do with it the same things you would with the original table (e.g. writing it to a file).
In this case, the summary might be better presented if the table is "transposed":
```{python}
census2021.describe().T
```
## Columns
**Create new columns**
We can generate new variables by applying operations on existing ones. For example, we can calculate the total population by area. Here is a couple of ways to do it:
Longer, hardcoded:
```{python}
total = census2021['Europe'] + census2021['Africa'] + census2021['Middle East and Asia'] + census2021['The Americas and the Caribbean'] + census2021['Antarctica and Oceania']
# Print the top of the variable
total.head()
```
One shot:
```{python}
census2021['Total_Population'] = census2021.sum(axis=1)
# Print the top of the variable
census2021.head()
```
Note that we are summing over "axis=1". In a `DataFrame` object, "axis 0" and "axis 1" represent the rows and columns respectively.
A different spin on this is assigning new values: we can generate new variables with scalars, and modify those:
```{python}
# New variable with all ones
census2021['ones'] = 1
census2021.head()
```
**Delete columns**
Permanently deleting variables is also within reach of one command:
```{python}
del census2021['ones']
census2021.head()
```
## Queries
**Index-based queries**
Here we explore how we can subset parts of a `DataFrame` if we know exactly which bits we want. For example, if we want to extract the total and European population of the first four areas in the table:
We use `loc` with lists:
```{python}
eu_tot_first4 = census2021.loc[['E01006512', 'E01006513', 'E01006514', 'E01006515'], ['Total_Population', 'Europe']]
eu_tot_first4
```
**Condition-based queries**
However, sometimes, we do not know exactly which observations we want, but we do know what conditions they need to satisfy (e.g. areas with more than 2,000 inhabitants). For these cases, `DataFrames` support selection based on conditions. Let us see a few examples. Suppose we want to select...
Areas with more than 900 people in Total:
```{python}
pop900 = census2021.loc[census2021['Total_Population'] > 900, :]
pop900
```
Areas where there are no more than 750 Europeans:
```{python}
euro750 = census2021.loc[census2021['Europe'] < 750, :]
euro750
```
Areas with exactly ten person from Antarctica and Oceania:
```{python}
oneOA = census2021.loc[census2021['Antarctica and Oceania'] == 10, :]
oneOA
```
*Pro-tip*: These queries can grow in sophistication with almost no limits.
**Combining queries**
Now all of these queries can be combined with each other, for further flexibility. For example, imagine we want areas with more than 25 people from the Americas and Caribbean, but less than 1,500 in total:
```{python}
ac25_l500 = census2021.loc[(census2021['The Americas and the Caribbean'] > 25) & (census2021['Total_Population'] < 1500), :]
ac25_l500
```
## Sorting
Among the many operations `DataFrame` objects support, one of the most useful ones is to sort a table based on a given column. For example, imagine we want to sort the table by total population:
```{python}
db_pop_sorted = census2021.sort_values('Total_Population', ascending=False)
db_pop_sorted.head()
```
## Additional resources
- A good introduction to data manipulation in Python is Wes McKinney's ["Python for Data Analysis"](https://wesmckinney.com/book/)