gds/mapvectorPy.qmd at main · pietrostefani/gds · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
# Lab in Python {#sec-vector-data-Python .unnumbered}

## Choropleths

In this session, we will build on all we have learnt so far about loading and manipulating (spatial) data and apply it to one of the most commonly used forms of spatial analysis: choropleths. Remember these are maps that display the spatial distribution of a variable encoded in a color scheme, also called palette. Although there are many ways in which you can convert the values of a variable into a specific color, we will focus in this context only on a handful of them, in particular:

-   Unique values

-   Equal interval

-   Quantiles

-   Fisher-Jenks

## Installing Packages

Before all this mapping fun, let us get the importing of libraries and data loading out of the way:

```{python}
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
import numpy as np
from pysal.viz import mapclassify
import seaborn as sns
```

## Data

We will be using World Bank data for this section, looking at World Development Indicators and Education Statistics. We will be focusing on the Middle East and North Africa (MENA). We start by loading the relevant geometries:

```{python}
# Load the GeoJSON file into a GeoDataFrame
mena_sf = gpd.read_file("data/MENA/MENA.geojson")

# Plot the geometry to make sure it looks correct
mena_sf.plot()
plt.show()
```

Don't forget that before you go further, you want to check the `CRS` of the `sf` object as well as the `dataframe`.

::: {.callout-tip collapse="true"}
## Answer

```{python}
# Check the CRS of the GeoDataFrame
print(mena_sf.crs)

# Display the first few rows of the GeoDataFrame
print(mena_sf.head())
```
:::

We then load the `csv` with some World Development Indicators data.

```{python}
world_dev = pd.read_csv("data/MENA/mena_worlddevelop.csv")
```

And join the two objects using the relevant codes.

```{python}
# Join check that the code you are joining the data on is the same.
# First check the GeoDataFrame
mena_sf.head()

# Then the DataFrame
world_dev.head()

## rename code columns so they match
mena_sf.rename(columns={'code_a3': 'Country Code'}, inplace=True)
```

Merge `GeoDataFrame` and `DataFrame`

```{python}
world_dev_gdf = mena_sf.merge(world_dev, on='Country Code')
```

Now we are fully ready to map!

## Unique values

A choropleth for categorical variables simply assigns a different color to every potential value in the series. Variables could be both nominal or ordinal.

**Nominal**: Nominal variables represent categories or labels *without* any inherent order or ranking. The categories are distinct and do not have a natural progression or hierarchy, such as "apple," "banana," and "orange" for fruit types.

**Ordinal** : Ordinal variables represent categories or labels with a meaningful *order* or *ranking*. The relative order or hierarchy among the categories is significant, indicating a clear progression from lower to higher values, such as "low," "medium," and "high" for satisfaction levels.

In Python, creating categorical choropleths is possible with one line of code.

```{python}
world_dev_gdf.plot(
    column="income_group",  # Specifies the column "income_group" to color the plot based on categories
    categorical=True,       # Indicates that the "income_group" column is categorical (not continuous)
    legend=True             # Adds a legend to the plot, showing the different categories of "income_group"
)
# Show the plot
plt.show()
```

::: callout-important
These maps are all a bit rough a need **quite a bit** more work. They are *just* a starting point.
:::

## Equal Interval

If, instead of categorical variables, we want to display the geographical distribution of a continuous phenomenon, we need to select a way to encode each value into a color. One potential solution is applying what is usually called "equal intervals". The intuition of this method is to split the range of the distribution, the difference between the minimum and maximum value, into equally large segments and to assign a different color to each of them according to a palette that reflects the fact that values are ordered.

Creating the choropleth is relatively straightforward in `Python`. For example, to create an equal interval of GDP per capita in 2015 (v_2015).

First we need to prepare the data, going back to our [data wrangling](https://pietrostefani.github.io/gds/openscience.html).

```{python}
# Step 1: Filter rows where Series.Name is "GDP per capita, PPP (current international $)"
filtered_data = world_dev_gdf[world_dev_gdf["Series Name"] == "GDP per capita, PPP (current international $)"]

# Step 2: Remove rows where 'v_2015' is missing (i.e., remove NA values)
filtered_data = filtered_data.dropna(subset=["v_2015"])

# Step 3
filtered_data["v_2015"] = pd.to_numeric(filtered_data["v_2015"], errors='coerce').round()
filtered_data = filtered_data.dropna(subset=["v_2015"])  # Remove rows with NaN in 'v_2015'
filtered_data["v_2015"] = filtered_data["v_2015"].astype(int)

# Step 4: Store the final result in world_dev_filtered
world_dev_filtered = filtered_data

```

An equal interval classification scheme produces a map of 5 classes where each class size is equal, so that each class has an equal range in between the low and high possible value. This allows for the legend to be easily understood by the viewer, since the legend entries are all the same size. However, this scheme does not show data which are skewed towards one side all that well.

```{python}

fig, ax = plt.subplots(figsize=(15, 5))  # Increase the map size

# Plotting the GeoDataFrame `world_dev_filtered` on the `ax` axes
world_dev_filtered.plot(
    column="v_2015",  # Specify the column to be visualized; `v_2015` represents data for the year 2015
    legend=True,  # Add a legend to the map
    scheme="equal_interval",  # Use equal interval classification for the color scheme
    k=7,  # Divide the data into 7 intervals
    cmap="YlGn",  # Use the Yellow-Green colormap for coloring the map
    legend_kwds={
        'loc': 'center left',  # Position the legend in the center left of the map
        'bbox_to_anchor': (1.00, 0.2),  # Anchor the legend at this position (right of the plot)
        'fontsize': 10  # Set the font size of the legend text
    },
    edgecolor='grey',  # Add grey contours around each geographical feature
    linewidth=0.5,  # Adjust the thickness of the contours (optional)
    ax=ax  # Specify the axes object to plot on
)

# Adjust the subplot parameters to give more space for the legend
plt.subplots_adjust(right=0.70)  # Increase right space to accommodate a larger map

# Show the plot
plt.show()

```

Pay attention to the key differences:

-   Instead of specifyig `categorical` as `True`, we replace it by the argument scheme, which we will use for all choropleths that require a continuous classification scheme. In this case, we set it to `equal_interval`.

-   As above, we set the number of colors to 7. Note that we need not pass the bins we calculated above, the plotting method does it itself under the hood for us.

-   As optional arguments, we can change the colourmap to a yellow to green gradient, which is one of the recommended ones by `ColorBrewer` for a sequential palette.

::: callout-tip
## Mapclassify

The way colour maps are scaled can also be manipulated with the `scheme` option (if you have mapclassify installed).

The `scheme` option can be set to any scheme provided by mapclassify (e.g. ‘box_plot’, ‘equal_interval’, ‘fisher_jenks’, ‘fisher_jenks_sampled’, ‘headtail_breaks’, ‘jenks_caspall’, ‘jenks_caspall_forced’, ‘jenks_caspall_sampled’, ‘max_p_classifier’, ‘maximum_breaks’, ‘natural_breaks’, ‘quantiles’, ‘percentiles’, ‘std_mean’ or ‘user_defined’).

Arguments can be passed in classification_kwds dict.

See the mapclassify documentation for further details about these [map classification](https://pysal.org/mapclassify/) schemes.
:::

It is important to understand that equal intervals can first and foremost be visualised on the data distribution.

```{python}
classi = mapclassify.EqualInterval(world_dev_filtered["v_2015"], k=7)
classi
```

Once we have classified the variable, we can check the actual break points where values stop being in one class and become part of the next one:

```{python}
classi.bins
```

```{python}
# Set up the figure
f, ax = plt.subplots(1)
# Plot the kernel density estimation (KDE)
sns.kdeplot(world_dev_filtered["v_2015"], fill=True)
# Add a blue tick for every value at the bottom of the plot (rugs)
sns.rugplot(world_dev_filtered["v_2015"], alpha=0.5)
# Loop over each break point and plot a vertical red line
for cut in classi.bins:
    plt.axvline(cut, color='lightgreen', linewidth=1.25)
# Display image
plt.show()

```

Technically speaking, the figure is created by overlaying a KDE plot with vertical bars for each of the break points. This makes much more explicit the issue highlighted by which the first two bin contain a large amount of observations while the one with top values only encompasses a handful of them.

## Quantiles

One solution to obtain a more balanced classification scheme is using quantiles. This, by definition, assigns the same amount of values to each bin: the entire series is laid out in order and break points are assigned in a way that leaves exactly the same amount of observations between each of them. This "observation-based" approach contrasts with the "value-based" method of equal intervals and, although it can obscure the magnitude of extreme values, it can be more informative in cases with skewed distributions.

The code required to create the choropleth mirrors that needed above for equal intervals:

```{python}

fig, ax = plt.subplots(figsize=(15, 5))  # Increase the map size

# Plotting the GeoDataFrame `world_dev_filtered` on the `ax` axes
world_dev_filtered.plot(
    column="v_2015",  # Specify the column to be visualized; `v_2015` represents data for the year 2015
    legend=True,  # Add a legend to the map
    scheme="quantiles",  # Use quantile interval classification for the color scheme
    k=4,  # Divide the data into 4 intervals
    cmap="YlGn",  # Use the Yellow-Green colormap for coloring the map
    legend_kwds={
        'loc': 'center left',  # Position the legend in the center left of the map
        'bbox_to_anchor': (1.00, 0.2),  # Anchor the legend at this position (right of the plot)
        'fontsize': 10  # Set the font size of the legend text
    },
    edgecolor='grey',  # Add grey contours around each geographical feature
    linewidth=0.5,  # Adjust the thickness of the contours (optional)
    ax=ax  # Specify the axes object to plot on
)

# Adjust the subplot parameters to give more space for the legend
plt.subplots_adjust(right=0.70)  # Increase right space to accommodate a larger map

# Show the plot
plt.show()

```

Note how, in this case, the amount of polygons in each color is by definition much more balanced (almost equal in fact, except for rounding differences). This obscures outlier values, which get blurred by significantly smaller values in the same group, but allows to get more detail in the “most populated” part of the distribution, where instead of only white polygons, we can now discern more variability.

To get further insight into the quantile classification, let’s calculate it with `mapclassify`:

```{python}
classi = mapclassify.Quantiles(world_dev_filtered["v_2015"], k=4)
classi

```

And, similarly, the bins can also be inspected:

```{python}
classi.bins
```

The visualization of the distribution can be generated in a similar way as well:

```{python}
# Set up the figure
f, ax = plt.subplots(1)
# Plot the kernel density estimation (KDE)
sns.kdeplot(world_dev_filtered["v_2015"], fill=True)
# Add a blue tick for every value at the bottom of the plot (rugs)
sns.rugplot(world_dev_filtered["v_2015"], alpha=0.5)
# Loop over each break point and plot a vertical red line
for cut in classi.bins:
    plt.axvline(cut, color='lightgreen', linewidth=1.25)
# Display image
plt.show()
```

## Fisher-Jenks

Equal interval and quantiles are only two examples of very many classification schemes to encode values into colors. As an example of a more sophisticated one, let us create a Fisher-Jenks choropleth.

```{python}

fig, ax = plt.subplots(figsize=(15, 5))  # Increase the map size

# Plotting the GeoDataFrame `world_dev_filtered` on the `ax` axes
world_dev_filtered.plot(
    column="v_2015",  # Specify the column to be visualized; `v_2015` represents data for the year 2015
    legend=True,  # Add a legend to the map
    scheme="fisher_jenks",  # Use quantile interval classification for the color scheme
    k=7,  # Divide the data into 7 intervals
    cmap="YlGn",  # Use the Yellow-Green colormap for coloring the map
    legend_kwds={
        'loc': 'center left',  # Position the legend in the center left of the map
        'bbox_to_anchor': (1.00, 0.2),  # Anchor the legend at this position (right of the plot)
        'fontsize': 10  # Set the font size of the legend text
    },
    edgecolor='grey',  # Add grey contours around each geographical feature
    linewidth=0.5,  # Adjust the thickness of the contours (optional)
    ax=ax  # Specify the axes object to plot on
)

# Adjust the subplot parameters to give more space for the legend
plt.subplots_adjust(right=0.70)  # Increase right space to accommodate a larger map

# Show the plot
plt.show()

```

The same classification can be obtained with a similar approach as before:

```{python}
classi = mapclassify.FisherJenks(world_dev_filtered["v_2015"], k=7)
classi
```

Once we have classified the variable, we can check the actual break points where values stop being in one class and become part of the next one:

```{python}
classi.bins
```

Now let's look at the density plot

```{python}
# Set up the figure
f, ax = plt.subplots(1)
# Plot the kernel density estimation (KDE)
sns.kdeplot(world_dev_filtered["v_2015"], fill=True)
# Add a blue tick for every value at the bottom of the plot (rugs)
sns.rugplot(world_dev_filtered["v_2015"], alpha=0.5)
# Loop over each break point and plot a vertical red line
for cut in classi.bins:
    plt.axvline(cut, color='lightgreen', linewidth=1.25)
# Display image
plt.show()
```

For example, the bin with highest values covers a much wider span that the one with lowest, because there are fewer states in that value range.

You will notice a lot cooler difference once you play around with a larger dataset.

## Zooming into the map

A general map of an entire region, or urban area, can sometimes obscure local patterns because they happen at a much smaller scale that cannot be perceived in the global view. One way to solve this is by providing a focus of a smaller part of the map in a separate figure. Although there are many ways to do this in `R`, the most straightforward one is to define the bounding box.

As an example, let us consider the first `ggplot` map of this Lab:

### Zoom into full map

```{python}
# Setup the figure
f, ax = plt.subplots(1)
# Draw the choropleth
world_dev_gdf.plot(
    column="income_group",
    cmap="YlGn",
    legend=True,
    ax=ax
)
# Redimensionate X and Y axes to desired bounds
ax.set_ylim(30.520606, 36.285000)
ax.set_xlim(30.763478, 40.332570)

# Display image
plt.show()
```

## Additional resources

-   On Drawing beautiful choropleths.html with `Python` and `ggplot` see [here](https://geopandas.org/en/stable/gallery/choropleths.html)

-   If you want to have a look at **Choropleths in Python** have a look at the chapter on choropleth mapping by [Rey, Arribas-Bel and Wolf](https://geographicdata.science/book/notebooks/05_choropleth.html)

-   Some more on mapping [here](https://geopandas.org/en/stable/docs/user_guide/mapping.html)