Skip to content

Commit 626063b

Browse files
remove trailing whitespaces
1 parent 6800dcb commit 626063b

14 files changed

+409
-409
lines changed

source/acknowledgements.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ We would like to give special thanks to Navya Dahiya and Gloria Ye
5858
for completing the first round of translation of the R material to Python,
5959
and to Philip Austin for his leadership and guidance throughout the translation process.
6060
We also gratefully acknowledge the UBC Open Educational Resources Fund
61-
and the UBC Department of Statistics for supporting the translation of
61+
and the UBC Department of Statistics for supporting the translation of
6262
the original R textbook and exercises to the Python programming language.
6363

6464

source/authors.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ initiatives.
5252
+++
5353

5454
**[Joel Ostblom](https://joelostblom.com/)** is an Assistant Professor of Teaching in the Department of
55-
Statistics at the University of British Columbia.
55+
Statistics at the University of British Columbia.
5656
During his PhD, Joel developed a passion for data science and reproducibility
5757
through the development of quantitative image analysis pipelines for studying
5858
stem cell and developmental biology. He has since co-created or lead the
@@ -64,12 +64,12 @@ contributions to open source projects and data science learning resources.
6464

6565
+++
6666

67-
**[Lindsey Heagy](https://lindseyjh.ca/)** is an Assistant Professor in the Department of Earth, Ocean, and Atmospheric
68-
Sciences and director of the Geophysical Inversion Facility at the University of British Columbia.
69-
Her research combines computational methods in numerical simulations, inversions, and machine
70-
learning to answer questions about the subsurface of the Earth. Primary applications include
71-
mineral exploration, carbon sequestration, groundwater and environmental studies. She
72-
completed her BSc at the University of Alberta, her PhD at the University of British Columbia,
73-
and held a Postdoctoral research position at the University of California Berkeley prior to
74-
starting her current position at UBC.
67+
**[Lindsey Heagy](https://lindseyjh.ca/)** is an Assistant Professor in the Department of Earth, Ocean, and Atmospheric
68+
Sciences and director of the Geophysical Inversion Facility at the University of British Columbia.
69+
Her research combines computational methods in numerical simulations, inversions, and machine
70+
learning to answer questions about the subsurface of the Earth. Primary applications include
71+
mineral exploration, carbon sequestration, groundwater and environmental studies. She
72+
completed her BSc at the University of Alberta, her PhD at the University of British Columbia,
73+
and held a Postdoctoral research position at the University of California Berkeley prior to
74+
starting her current position at UBC.
7575

source/classification1.md

Lines changed: 114 additions & 114 deletions
Large diffs are not rendered by default.

source/classification2.md

Lines changed: 165 additions & 165 deletions
Large diffs are not rendered by default.

source/clustering.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ kernelspec:
2020
2121
# get rid of futurewarnings from sklearn kmeans
2222
import warnings
23-
warnings.simplefilter(action='ignore', category=FutureWarning)
23+
warnings.simplefilter(action='ignore', category=FutureWarning)
2424
2525
from chapter_preamble import *
2626
```
@@ -130,7 +130,7 @@ In this chapter we will focus on a data set from
130130
[the `palmerpenguins` R package](https://allisonhorst.github.io/palmerpenguins/) {cite:p}`palmerpenguins`. This
131131
data set was collected by Dr. Kristen Gorman and
132132
the Palmer Station, Antarctica Long Term Ecological Research Site, and includes
133-
measurements for adult penguins ({numref}`09-penguins`) found near there {cite:p}`penguinpaper`.
133+
measurements for adult penguins ({numref}`09-penguins`) found near there {cite:p}`penguinpaper`.
134134
Our goal will be to use two
135135
variables—penguin bill and flipper length, both in millimeters—to determine whether
136136
there are distinct types of penguins in our data.
@@ -834,7 +834,7 @@ kmeans
834834

835835
To actually run the K-means clustering, we combine the preprocessor and model object
836836
in a `Pipeline`, and use the `fit` function. Note that the K-means
837-
algorithm uses a random initialization of assignments, but since we set
837+
algorithm uses a random initialization of assignments, but since we set
838838
the random seed in the beginning of this chapter, the clustering will be reproducible.
839839

840840
```{code-cell} ipython3
@@ -848,24 +848,24 @@ penguin_clust
848848
```{index} K-means; inertia_, K-means; cluster_centers_, K-means; labels_, K-means; predict
849849
```
850850

851-
The fit `KMeans` object—which is the second item in the
851+
The fit `KMeans` object—which is the second item in the
852852
pipeline, and can be accessed as `penguin_clust[1]`—has a lot of information
853853
that can be used to visualize the clusters, pick K, and evaluate the total WSSD.
854-
Let's start by visualizing the clusters as a colored scatter plot! In
855-
order to do that, we first need to augment our
856-
original `penguins` data frame with the cluster assignments.
857-
We can access these using the `labels_` attribute of the clustering object
858-
("labels" is a common alternative term to "assignments" in clustering), and
854+
Let's start by visualizing the clusters as a colored scatter plot! In
855+
order to do that, we first need to augment our
856+
original `penguins` data frame with the cluster assignments.
857+
We can access these using the `labels_` attribute of the clustering object
858+
("labels" is a common alternative term to "assignments" in clustering), and
859859
add them to the data frame.
860860

861861
```{code-cell} ipython3
862862
penguins["cluster"] = penguin_clust[1].labels_
863863
penguins
864864
```
865865

866-
Now that we have the cluster assignments included in the `penguins` data frame, we can
866+
Now that we have the cluster assignments included in the `penguins` data frame, we can
867867
visualize them as shown in {numref}`cluster_plot`.
868-
Note that we are plotting the *un-standardized* data here; if we for some reason wanted to
868+
Note that we are plotting the *un-standardized* data here; if we for some reason wanted to
869869
visualize the *standardized* data, we would need to use the `fit` and `transform` functions
870870
on the `StandardScaler` preprocessor directly to obtain that first.
871871
As in {numref}`Chapter %s <viz>`,
@@ -937,7 +937,7 @@ For each value of K,
937937
we create a new KMeans model
938938
and wrap it in a `scikit-learn` pipeline
939939
with the preprocessor we created earlier.
940-
We store the WSSD values in a list that we will use to create a dataframe
940+
We store the WSSD values in a list that we will use to create a dataframe
941941
of both the K-values and their corresponding WSSDs.
942942

943943
```{note}
@@ -954,7 +954,7 @@ it is always the safest to assign it to a variable name for reuse.
954954
ks = range(1, 10)
955955
wssds = [
956956
make_pipeline(
957-
preprocessor,
957+
preprocessor,
958958
KMeans(n_clusters=k) # Create a new KMeans model with `k` clusters
959959
).fit(penguins)[1].inertia_
960960
for k in ks

source/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ kernelspec:
1515

1616
![](img/frontmatter/ds-a-first-intro-graphic.jpg)
1717

18-
# Data Science
18+
# Data Science
1919

2020
## *A First Introduction (Python Edition)*
2121

source/inference.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -317,7 +317,7 @@ with the `name` parameter:
317317
```
318318

319319
Below we put everything together
320-
and also filter the data frame to keep only the room types
320+
and also filter the data frame to keep only the room types
321321
that we are interested in.
322322

323323
```{code-cell} ipython3
@@ -776,7 +776,7 @@ How large is "large enough?" Unfortunately, it depends entirely on the problem a
776776
as a rule of thumb, often a sample size of at least 20 will suffice.
777777
```
778778

779-
<!---
779+
<!---
780780
```{note}
781781
If random samples of size $n$ are taken from a population, the sample mean
782782
$\bar{x}$ will be approximately Normal with mean $\mu$ and standard deviation

source/intro.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -579,13 +579,13 @@ and wrote `pd.read_csv`. The dot means that the thing on the left (`pd`, i.e., t
579579
thing on the right (the `read_csv` function). In the case of `can_lang.loc[]`, the thing on the left (the `can_lang` data frame)
580580
*provides* the thing on the right (the `loc[]` operation). In Python,
581581
both packages (like `pandas`) *and* objects (like our `can_lang` data frame) can provide functions
582-
and other objects that we access using the dot syntax.
582+
and other objects that we access using the dot syntax.
583583

584584
```{note}
585585
A note on terminology: when an object `obj` provides a function `f` with the
586586
dot syntax (as in `obj.f()`), sometimes we call that function `f` a *method* of `obj` or an *operation* on `obj`.
587-
Similarly, when an object `obj` provides another object `x` with the dot syntax (as in `obj.x`), sometimes we call the object `x` an *attribute* of `obj`.
588-
We will use all of these terms throughout the book, as you will see them used commonly in the community.
587+
Similarly, when an object `obj` provides another object `x` with the dot syntax (as in `obj.x`), sometimes we call the object `x` an *attribute* of `obj`.
588+
We will use all of these terms throughout the book, as you will see them used commonly in the community.
589589
And just because we programmers like to be confusing for no apparent reason: we *don't* use the "method", "operation", or "attribute" terminology
590590
when referring to functions and objects from packages, like `pandas`. So for example, `pd.read_csv`
591591
would typically just be referred to as a function, but not as a method or operation, even though it uses the dot syntax.
@@ -665,18 +665,18 @@ a first one&mdash;so fear not and explore! To answer this small
665665
question-along-the-way, we need to divide each count in the `mother_tongue`
666666
column by the total Canadian population according to the 2016
667667
census&mdash;i.e., 35,151,728&mdash;and multiply it by 100. We can perform
668-
this computation using the code `100 * ten_lang["mother_tongue"] / canadian_population`.
668+
this computation using the code `100 * ten_lang["mother_tongue"] / canadian_population`.
669669
Then to store the result in a new column (or
670670
overwrite an existing column), we specify the name of the new
671-
column to create (or old column to modify), then the assignment symbol `=`,
671+
column to create (or old column to modify), then the assignment symbol `=`,
672672
and then the computation to store in that column. In this case, we will opt to
673-
create a new column called `mother_tongue_percent`.
673+
create a new column called `mother_tongue_percent`.
674674

675675
```{note}
676676
You will see below that we write the Canadian population in
677677
Python as `35_151_728`. The underscores (`_`) are just there for readability,
678-
and do not affect how Python interprets the number. In other words,
679-
`35151728` and `35_151_728` are treated identically in Python,
678+
and do not affect how Python interprets the number. In other words,
679+
`35151728` and `35_151_728` are treated identically in Python,
680680
although the latter is much clearer!
681681
```
682682

@@ -695,7 +695,7 @@ ten_lang
695695
```
696696

697697
The `ten_lang_percent` data frame shows that
698-
the ten Aboriginal languages in the `ten_lang` data frame were spoken
698+
the ten Aboriginal languages in the `ten_lang` data frame were spoken
699699
as a mother tongue by between 0.008% and 0.18% of the Canadian population.
700700

701701
## Combining analysis steps with chaining and multiline expressions
@@ -831,7 +831,7 @@ each language. When you move on to more complicated analyses, this issue only
831831
gets worse. In contrast, a *visualization* would convey this information in a much
832832
more easily understood format.
833833
Visualizations are a great tool for summarizing information to help you
834-
effectively communicate with your audience, and creating effective data visualizations
834+
effectively communicate with your audience, and creating effective data visualizations
835835
is an essential component of any data
836836
analysis. In this section we will develop a visualization of the
837837
ten Aboriginal languages that were most often reported in 2016 as mother tongues in

source/reading.md

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ could live on your computer (*local*) or somewhere on the internet (*remote*).
8484
The place where the file lives on your computer is referred to as its "path". You can
8585
think of the path as directions to the file. There are two kinds of paths:
8686
*relative* paths and *absolute* paths. A relative path indicates where the file is
87-
with respect to your *working directory* (i.e., "where you are currently") on the computer.
87+
with respect to your *working directory* (i.e., "where you are currently") on the computer.
8888
On the other hand, an absolute path indicates where the file is
8989
with respect to the computer's filesystem base (or *root*) folder, regardless of where you are working.
9090

@@ -112,7 +112,7 @@ proceeds by listing out the sequence of folders you would have to enter to reach
112112
So in this case, `happiness_report.csv` would be reached by starting at the root, and entering the `home` folder,
113113
then the `dsci-100` folder, then the `worksheet_02` folder, and then finally the `data` folder. So its absolute
114114
path would be `/home/dsci-100/worksheet_02/data/happiness_report.csv`. We can load the file using its absolute path
115-
as a string passed to the `read_csv` function from `pandas`.
115+
as a string passed to the `read_csv` function from `pandas`.
116116
```python
117117
happy_data = pd.read_csv("/home/dsci-100/worksheet_02/data/happiness_report.csv")
118118
```
@@ -127,20 +127,20 @@ Note that there is no forward slash at the beginning of a relative path; if we a
127127
Python would look for a folder named `data` in the root folder of the computer&mdash;but that doesn't exist!
128128

129129
Aside from specifying places to go in a path using folder names (like `data` and `worksheet_02`), we can also specify two additional
130-
special places: the *current directory* and the *previous directory*. We indicate the current working directory with a single dot `.`, and
130+
special places: the *current directory* and the *previous directory*. We indicate the current working directory with a single dot `.`, and
131131
the previous directory with two dots `..`. So for instance, if we wanted to reach the `bike_share.csv` file from the `worksheet_02` folder, we could
132132
use the relative path `../tutorial_01/bike_share.csv`. We can even combine these two; for example, we could reach the `bike_share.csv` file using
133-
the (very silly) path `../tutorial_01/../tutorial_01/./bike_share.csv` with quite a few redundant directions: it says to go back a folder, then open `tutorial_01`,
133+
the (very silly) path `../tutorial_01/../tutorial_01/./bike_share.csv` with quite a few redundant directions: it says to go back a folder, then open `tutorial_01`,
134134
then go back a folder again, then open `tutorial_01` again, then stay in the current directory, then finally get to `bike_share.csv`. Whew, what a long trip!
135135

136-
So which kind of path should you use: relative, or absolute? Generally speaking, you should use relative paths.
137-
Using a relative path helps ensure that your code can be run
136+
So which kind of path should you use: relative, or absolute? Generally speaking, you should use relative paths.
137+
Using a relative path helps ensure that your code can be run
138138
on a different computer (and as an added bonus, relative paths are often shorter&mdash;easier to type!).
139139
This is because a file's relative path is often the same across different computers, while a
140-
file's absolute path (the names of
141-
all of the folders between the computer's root, represented by `/`, and the file) isn't usually the same
142-
across different computers. For example, suppose Fatima and Jayden are working on a
143-
project together on the `happiness_report.csv` data. Fatima's file is stored at
140+
file's absolute path (the names of
141+
all of the folders between the computer's root, represented by `/`, and the file) isn't usually the same
142+
across different computers. For example, suppose Fatima and Jayden are working on a
143+
project together on the `happiness_report.csv` data. Fatima's file is stored at
144144

145145
```
146146
/home/Fatima/project/data/happiness_report.csv
@@ -158,7 +158,7 @@ their different usernames. If Jayden has code that loads the
158158
`happiness_report.csv` data using an absolute path, the code won't work on
159159
Fatima's computer. But the relative path from inside the `project` folder
160160
(`data/happiness_report.csv`) is the same on both computers; any code that uses
161-
relative paths will work on both! In the additional resources section,
161+
relative paths will work on both! In the additional resources section,
162162
we include a link to a short video on the
163163
difference between absolute and relative paths.
164164

@@ -382,7 +382,7 @@ Non-Official & Non-Aboriginal languages Amharic 22465 12785 200 33670
382382
383383
```
384384

385-
Data frames in Python need to have column names. Thus if you read in data
385+
Data frames in Python need to have column names. Thus if you read in data
386386
without column names, Python will assign names automatically. In this example,
387387
Python assigns the column names `0, 1, 2, 3, 4, 5`.
388388
To read this data into Python, we specify the first
@@ -1237,7 +1237,7 @@ page = bs4.BeautifulSoup(wiki.content, "html.parser")
12371237
import bs4
12381238
12391239
# the above cell doesn't actually run; this one does run
1240-
# and loads the html data from a local, static file
1240+
# and loads the html data from a local, static file
12411241
12421242
with open("data/canada_wiki.html", "r") as f:
12431243
wiki_hidden = f.read()
@@ -1303,7 +1303,7 @@ Using `requests` and `BeautifulSoup` to extract data based on CSS selectors is
13031303
a very general way to scrape data from the web, albeit perhaps a little bit
13041304
complicated. Fortunately, `pandas` provides the
13051305
[`read_html`](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html)
1306-
function, which is easier method to try when the data
1306+
function, which is easier method to try when the data
13071307
appear on the webpage already in a tabular format. The `read_html` function takes one
13081308
argument&mdash;the URL of the page to scrape&mdash;and will return a list of
13091309
data frames corresponding to all the tables it finds at that URL. We can see
@@ -1422,7 +1422,7 @@ endpoint is `https://api.nasa.gov/planetary/apod`. Second, we write `?`, which d
14221422
list of *query parameters* will follow. And finally, we specify a list of
14231423
query parameters of the form `parameter=value`, separated by `&` characters. The NASA
14241424
"Astronomy Picture of the Day" API accepts the parameters shown in
1425-
{numref}`fig:NASA-API-parameters`.
1425+
{numref}`fig:NASA-API-parameters`.
14261426

14271427
```{figure} img/reading/NASA-API-parameters.png
14281428
:name: fig:NASA-API-parameters
@@ -1433,7 +1433,7 @@ along with syntax, default settings, and a description of each.
14331433

14341434
So for example, to obtain the image of the day
14351435
from July 13, 2023, the API query would have two parameters: `api_key=YOUR_API_KEY`
1436-
and `date=2023-07-13`. Remember to replace `YOUR_API_KEY` with the API key you
1436+
and `date=2023-07-13`. Remember to replace `YOUR_API_KEY` with the API key you
14371437
received from NASA in your email! Putting it all together, the query will look like the following:
14381438
```
14391439
https://api.nasa.gov/planetary/apod?api_key=YOUR_API_KEY&date=2023-07-13
@@ -1474,7 +1474,7 @@ you will recognize the same query URL that we pasted into the browser earlier.
14741474
We will then obtain a JSON representation of the
14751475
response using the `json` method.
14761476

1477-
<!-- we have disabled the below code for reproducibility, with hidden setting
1477+
<!-- we have disabled the below code for reproducibility, with hidden setting
14781478
of the nasa_data object. But you can reproduce this using the DEMO_KEY key -->
14791479
```python
14801480
import requests
@@ -1491,14 +1491,14 @@ import json
14911491
with open("data/nasa.json", "r") as f:
14921492
nasa_data = json.load(f)
14931493
# the last entry in the stored data is July 13, 2023, so print that
1494-
nasa_data[-1]
1494+
nasa_data[-1]
14951495
```
14961496

14971497
We can obtain more records at once by using the `start_date` and `end_date` parameters, as
14981498
shown in the table of parameters in {numref}`fig:NASA-API-parameters`.
14991499
Let's obtain all the records between May 1, 2023, and July 13, 2023, and store the result
15001500
in an object called `nasa_data`; now the response
1501-
will take the form of a Python list. Each item in the list will correspond to a single day's record (just like the `nasa_data_single` object),
1501+
will take the form of a Python list. Each item in the list will correspond to a single day's record (just like the `nasa_data_single` object),
15021502
and there will be 74 items total, one for each day between the start and end dates:
15031503

15041504
```python

0 commit comments

Comments
 (0)