UBC-DSCI
diff --git a/‎source/acknowledgements.md
Lines changed: 1 addition & 1 deletion b/‎source/acknowledgements.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎source/authors.md
Lines changed: 9 additions & 9 deletions b/‎source/authors.md
Lines changed: 9 additions & 9 deletions
diff --git a/‎source/classification1.md
Lines changed: 114 additions & 114 deletions b/‎source/classification1.md
Lines changed: 114 additions & 114 deletions
diff --git a/‎source/classification2.md
Lines changed: 165 additions & 165 deletions b/‎source/classification2.md
Lines changed: 165 additions & 165 deletions
diff --git a/‎source/clustering.md
Lines changed: 13 additions & 13 deletions b/‎source/clustering.md
Lines changed: 13 additions & 13 deletions
diff --git a/‎source/index.md
Lines changed: 1 addition & 1 deletion b/‎source/index.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎source/inference.md
Lines changed: 2 additions & 2 deletions b/‎source/inference.md
Lines changed: 2 additions & 2 deletions
diff --git a/‎source/intro.md
Lines changed: 10 additions & 10 deletions b/‎source/intro.md
Lines changed: 10 additions & 10 deletions
diff --git a/‎source/reading.md
Lines changed: 19 additions & 19 deletions b/‎source/reading.md
Lines changed: 19 additions & 19 deletions
@@ -58,7 +58,7 @@ We would like to give special thanks to Navya Dahiya and Gloria Ye
 for completing the first round of translation of the R material to Python,
 and to Philip Austin for his leadership and guidance throughout the translation process.
 We also gratefully acknowledge the UBC Open Educational Resources Fund
-and the UBC Department of Statistics for supporting the translation of 
+and the UBC Department of Statistics for supporting the translation of
 the original R textbook and exercises to the Python programming language.
 
 
@@ -52,7 +52,7 @@ initiatives.
 +++
 
 **[Joel Ostblom](https://joelostblom.com/)** is an Assistant Professor of Teaching in the Department of
-Statistics at the University of British Columbia. 
+Statistics at the University of British Columbia.
 During his PhD, Joel developed a passion for data science and reproducibility
 through the development of quantitative image analysis pipelines for studying
 stem cell and developmental biology. He has since co-created or lead the
@@ -64,12 +64,12 @@ contributions to open source projects and data science learning resources.
 
 +++
 
-**[Lindsey Heagy](https://lindseyjh.ca/)** is an Assistant Professor in the Department of Earth, Ocean, and Atmospheric 
-Sciences and director of the Geophysical Inversion Facility at the University of British Columbia. 
-Her research combines computational methods in numerical simulations, inversions, and machine 
-learning to answer questions about the subsurface of the Earth. Primary applications include 
-mineral exploration, carbon sequestration, groundwater and environmental studies. She 
-completed her BSc at the University of Alberta, her PhD at the University of British Columbia, 
-and held a Postdoctoral research position at the University of California Berkeley prior to 
-starting her current position at UBC. 
+**[Lindsey Heagy](https://lindseyjh.ca/)** is an Assistant Professor in the Department of Earth, Ocean, and Atmospheric
+Sciences and director of the Geophysical Inversion Facility at the University of British Columbia.
+Her research combines computational methods in numerical simulations, inversions, and machine
+learning to answer questions about the subsurface of the Earth. Primary applications include
+mineral exploration, carbon sequestration, groundwater and environmental studies. She
+completed her BSc at the University of Alberta, her PhD at the University of British Columbia,
+and held a Postdoctoral research position at the University of California Berkeley prior to
+starting her current position at UBC.
 
@@ -20,7 +20,7 @@ kernelspec:
 
 # get rid of futurewarnings from sklearn kmeans
 import warnings
-warnings.simplefilter(action='ignore', category=FutureWarning) 
+warnings.simplefilter(action='ignore', category=FutureWarning)
 
 from chapter_preamble import *
 ```
@@ -130,7 +130,7 @@ In this chapter we will focus on a data set from
 [the `palmerpenguins` R package](https://allisonhorst.github.io/palmerpenguins/) {cite:p}`palmerpenguins`. This
 data set was collected by Dr. Kristen Gorman and
 the Palmer Station, Antarctica Long Term Ecological Research Site, and includes
-measurements for adult penguins ({numref}`09-penguins`) found near there {cite:p}`penguinpaper`. 
+measurements for adult penguins ({numref}`09-penguins`) found near there {cite:p}`penguinpaper`.
 Our goal will be to use two
 variables&mdash;penguin bill and flipper length, both in millimeters&mdash;to determine whether
 there are distinct types of penguins in our data.
@@ -834,7 +834,7 @@ kmeans
 
 To actually run the K-means clustering, we combine the preprocessor and model object
 in a `Pipeline`, and use the `fit` function. Note that the K-means
-algorithm uses a random initialization of assignments, but since we set 
+algorithm uses a random initialization of assignments, but since we set
 the random seed in the beginning of this chapter, the clustering will be reproducible.
 
 ```{code-cell} ipython3
@@ -848,24 +848,24 @@ penguin_clust
 ```{index} K-means; inertia_, K-means; cluster_centers_, K-means; labels_, K-means; predict
 ```
 
-The fit `KMeans` object&mdash;which is the second item in the 
+The fit `KMeans` object&mdash;which is the second item in the
 pipeline, and can be accessed as `penguin_clust[1]`&mdash;has a lot of information
 that can be used to visualize the clusters, pick K, and evaluate the total WSSD.
-Let's start by visualizing the clusters as a colored scatter plot! In 
-order to do that, we first need to augment our 
-original `penguins` data frame with the cluster assignments. 
-We can access these using the `labels_` attribute of the clustering object 
-("labels" is a common alternative term to "assignments" in clustering), and 
+Let's start by visualizing the clusters as a colored scatter plot! In
+order to do that, we first need to augment our
+original `penguins` data frame with the cluster assignments.
+We can access these using the `labels_` attribute of the clustering object
+("labels" is a common alternative term to "assignments" in clustering), and
 add them to the data frame.
 
 ```{code-cell} ipython3
 penguins["cluster"] = penguin_clust[1].labels_
 penguins
 ```
 
-Now that we have the cluster assignments included in the `penguins` data frame, we can 
+Now that we have the cluster assignments included in the `penguins` data frame, we can
 visualize them as shown in {numref}`cluster_plot`.
-Note that we are plotting the *un-standardized* data here; if we for some reason wanted to 
+Note that we are plotting the *un-standardized* data here; if we for some reason wanted to
 visualize the *standardized* data, we would need to use the `fit` and `transform` functions
 on the `StandardScaler` preprocessor directly to obtain that first.
 As in {numref}`Chapter %s <viz>`,
@@ -937,7 +937,7 @@ For each value of K,
 we create a new KMeans model
 and wrap it in a `scikit-learn` pipeline
 with the preprocessor we created earlier.
-We store the WSSD values in a list that we will use to create a dataframe 
+We store the WSSD values in a list that we will use to create a dataframe
 of both the K-values and their corresponding WSSDs.
 
 ```{note}
@@ -954,7 +954,7 @@ it is always the safest to assign it to a variable name for reuse.
 ks = range(1, 10)
 wssds = [
     make_pipeline(
-    	preprocessor, 
+    	preprocessor,
     	KMeans(n_clusters=k)  # Create a new KMeans model with `k` clusters
     ).fit(penguins)[1].inertia_
     for k in ks
 
@@ -15,7 +15,7 @@ kernelspec:
 
 ![](img/frontmatter/ds-a-first-intro-graphic.jpg)
 
-# Data Science 
+# Data Science
 
 ## *A First Introduction (Python Edition)*
 
 
@@ -317,7 +317,7 @@ with the `name` parameter:
 ```
 
 Below we put everything together
-and also filter the data frame to keep only the room types 
+and also filter the data frame to keep only the room types
 that we are interested in.
 
 ```{code-cell} ipython3
@@ -776,7 +776,7 @@ How large is "large enough?" Unfortunately, it depends entirely on the problem a
 as a rule of thumb, often a sample size of at least 20 will suffice.
 ```
 
-<!--- 
+<!---
 ```{note}
 If random samples of size $n$ are taken from a population, the sample mean
 $\bar{x}$ will be approximately Normal with mean $\mu$ and standard deviation
 
@@ -579,13 +579,13 @@ and wrote `pd.read_csv`. The dot means that the thing on the left (`pd`, i.e., t
 thing on the right (the `read_csv` function). In the case of `can_lang.loc[]`, the thing on the left (the `can_lang` data frame)
 *provides* the thing on the right (the `loc[]` operation). In Python,
 both packages (like `pandas`) *and* objects (like our `can_lang` data frame) can provide functions
-and other objects that we access using the dot syntax. 
+and other objects that we access using the dot syntax.
 
 ```{note}
 A note on terminology: when an object `obj` provides a function `f` with the
 dot syntax (as in `obj.f()`), sometimes we call that function `f` a *method* of `obj` or an *operation* on `obj`.
-Similarly, when an object `obj` provides another object `x` with the dot syntax (as in `obj.x`), sometimes we call the object `x` an *attribute* of `obj`. 
-We will use all of these terms throughout the book, as you will see them used commonly in the community. 
+Similarly, when an object `obj` provides another object `x` with the dot syntax (as in `obj.x`), sometimes we call the object `x` an *attribute* of `obj`.
+We will use all of these terms throughout the book, as you will see them used commonly in the community.
 And just because we programmers like to be confusing for no apparent reason: we *don't* use the "method", "operation", or "attribute" terminology
 when referring to functions and objects from packages, like `pandas`. So for example, `pd.read_csv`
 would typically just be referred to as a function, but not as a method or operation, even though it uses the dot syntax.
@@ -665,18 +665,18 @@ a first one&mdash;so fear not and explore! To answer this small
 question-along-the-way, we need to divide each count in the `mother_tongue`
 column by the total Canadian population according to the 2016
 census&mdash;i.e., 35,151,728&mdash;and multiply it by 100. We can perform
-this computation using the code `100 * ten_lang["mother_tongue"] / canadian_population`. 
+this computation using the code `100 * ten_lang["mother_tongue"] / canadian_population`.
 Then to store the result in a new column (or
 overwrite an existing column), we specify the name of the new
-column to create (or old column to modify), then the assignment symbol `=`, 
+column to create (or old column to modify), then the assignment symbol `=`,
 and then the computation to store in that column. In this case, we will opt to
-create a new column called `mother_tongue_percent`. 
+create a new column called `mother_tongue_percent`.
 
 ```{note}
 You will see below that we write the Canadian population in
 Python as `35_151_728`. The underscores (`_`) are just there for readability,
-and do not affect how Python interprets the number. In other words, 
-`35151728` and `35_151_728` are treated identically in Python, 
+and do not affect how Python interprets the number. In other words,
+`35151728` and `35_151_728` are treated identically in Python,
 although the latter is much clearer!
 ```
 
@@ -695,7 +695,7 @@ ten_lang
 ```
 
 The `ten_lang_percent` data frame shows that
-the ten Aboriginal languages in the `ten_lang` data frame were spoken 
+the ten Aboriginal languages in the `ten_lang` data frame were spoken
 as a mother tongue by between 0.008% and 0.18% of the Canadian population.
 
 ## Combining analysis steps with chaining and multiline expressions
@@ -831,7 +831,7 @@ each language. When you move on to more complicated analyses, this issue only
 gets worse. In contrast, a *visualization* would convey this information in a much
 more easily understood format.
 Visualizations are a great tool for summarizing information to help you
-effectively communicate with your audience, and creating effective data visualizations 
+effectively communicate with your audience, and creating effective data visualizations
 is an essential component of any data
 analysis. In this section we will develop a visualization of the
  ten Aboriginal languages that were most often reported in 2016 as mother tongues in
 
@@ -84,7 +84,7 @@ could live on your computer (*local*) or somewhere on the internet (*remote*).
 The place where the file lives on your computer is referred to as its "path". You can
 think of the path as directions to the file. There are two kinds of paths:
 *relative* paths and *absolute* paths. A relative path indicates where the file is
-with respect to your *working directory* (i.e., "where you are currently") on the computer. 
+with respect to your *working directory* (i.e., "where you are currently") on the computer.
 On the other hand, an absolute path indicates where the file is
 with respect to the computer's filesystem base (or *root*) folder, regardless of where you are working.
 
@@ -112,7 +112,7 @@ proceeds by listing out the sequence of folders you would have to enter to reach
 So in this case, `happiness_report.csv` would be reached by starting at the root, and entering the `home` folder,
 then the `dsci-100` folder, then the `worksheet_02` folder, and then finally the `data` folder. So its absolute
 path would be `/home/dsci-100/worksheet_02/data/happiness_report.csv`. We can load the file using its absolute path
-as a string passed to the `read_csv` function from `pandas`. 
+as a string passed to the `read_csv` function from `pandas`.
 ```python
 happy_data = pd.read_csv("/home/dsci-100/worksheet_02/data/happiness_report.csv")
 ```
@@ -127,20 +127,20 @@ Note that there is no forward slash at the beginning of a relative path; if we a
 Python would look for a folder named `data` in the root folder of the computer&mdash;but that doesn't exist!
 
 Aside from specifying places to go in a path using folder names (like `data` and `worksheet_02`), we can also specify two additional
-special places: the *current directory* and the *previous directory*. We indicate the current working directory with a single dot `.`, and 
+special places: the *current directory* and the *previous directory*. We indicate the current working directory with a single dot `.`, and
 the previous directory with two dots `..`. So for instance, if we wanted to reach the `bike_share.csv` file from the `worksheet_02` folder, we could
 use the relative path `../tutorial_01/bike_share.csv`. We can even combine these two; for example, we could reach the `bike_share.csv` file using
-the (very silly) path `../tutorial_01/../tutorial_01/./bike_share.csv` with quite a few redundant directions: it says to go back a folder, then open `tutorial_01`, 
+the (very silly) path `../tutorial_01/../tutorial_01/./bike_share.csv` with quite a few redundant directions: it says to go back a folder, then open `tutorial_01`,
 then go back a folder again, then open `tutorial_01` again, then stay in the current directory, then finally get to `bike_share.csv`. Whew, what a long trip!
 
-So which kind of path should you use: relative, or absolute? Generally speaking, you should use relative paths. 
-Using a relative path helps ensure that your code can be run 
+So which kind of path should you use: relative, or absolute? Generally speaking, you should use relative paths.
+Using a relative path helps ensure that your code can be run
 on a different computer (and as an added bonus, relative paths are often shorter&mdash;easier to type!).
 This is because a file's relative path is often the same across different computers, while a
-file's absolute path (the names of 
-all of the folders between the computer's root, represented by `/`, and the file) isn't usually the same 
-across different computers. For example, suppose Fatima and Jayden are working on a 
-project together on the `happiness_report.csv` data. Fatima's file is stored at 
+file's absolute path (the names of
+all of the folders between the computer's root, represented by `/`, and the file) isn't usually the same
+across different computers. For example, suppose Fatima and Jayden are working on a
+project together on the `happiness_report.csv` data. Fatima's file is stored at
 
 ```
 /home/Fatima/project/data/happiness_report.csv
@@ -158,7 +158,7 @@ their different usernames.  If Jayden has code that loads the
 `happiness_report.csv` data using an absolute path, the code won't work on
 Fatima's computer.  But the relative path from inside the `project` folder
 (`data/happiness_report.csv`) is the same on both computers; any code that uses
-relative paths will work on both! In the additional resources section, 
+relative paths will work on both! In the additional resources section,
 we include a link to a short video on the
 difference between absolute and relative paths.
 
@@ -382,7 +382,7 @@ Non-Official & Non-Aboriginal languages	Amharic	22465	12785	200	33670
 
 ```
 
-Data frames in Python need to have column names.  Thus if you read in data 
+Data frames in Python need to have column names.  Thus if you read in data
 without column names, Python will assign names automatically. In this example,
 Python assigns the column names `0, 1, 2, 3, 4, 5`.
 To read this data into Python, we specify the first
@@ -1237,7 +1237,7 @@ page = bs4.BeautifulSoup(wiki.content, "html.parser")
 import bs4
 
 # the above cell doesn't actually run; this one does run
-# and loads the html data from a local, static file 
+# and loads the html data from a local, static file
 
 with open("data/canada_wiki.html", "r") as f:
     wiki_hidden = f.read()
@@ -1303,7 +1303,7 @@ Using `requests` and `BeautifulSoup` to extract data based on CSS selectors is
 a very general way to scrape data from the web, albeit perhaps a little bit
 complicated.  Fortunately, `pandas` provides the
 [`read_html`](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html)
-function, which is easier method to try when the data 
+function, which is easier method to try when the data
 appear on the webpage already in a tabular format.  The `read_html` function takes one
 argument&mdash;the URL of the page to scrape&mdash;and will return a list of
 data frames corresponding to all the tables it finds at that URL. We can see
@@ -1422,7 +1422,7 @@ endpoint is `https://api.nasa.gov/planetary/apod`. Second, we write `?`, which d
 list of *query parameters* will follow. And finally, we specify a list of
 query parameters of the form `parameter=value`, separated by `&` characters.  The NASA
 "Astronomy Picture of the Day" API accepts the parameters shown in
-{numref}`fig:NASA-API-parameters`. 
+{numref}`fig:NASA-API-parameters`.
 
 ```{figure} img/reading/NASA-API-parameters.png
 :name: fig:NASA-API-parameters
@@ -1433,7 +1433,7 @@ along with syntax, default settings, and a description of each.
 
 So for example, to obtain the image of the day
 from July 13, 2023, the API query would have two parameters: `api_key=YOUR_API_KEY`
-and `date=2023-07-13`. Remember to replace `YOUR_API_KEY` with the API key you 
+and `date=2023-07-13`. Remember to replace `YOUR_API_KEY` with the API key you
 received from NASA in your email! Putting it all together, the query will look like the following:
 ```
 https://api.nasa.gov/planetary/apod?api_key=YOUR_API_KEY&date=2023-07-13
@@ -1474,7 +1474,7 @@ you will recognize the same query URL that we pasted into the browser earlier.
 We will then obtain a JSON representation of the
 response using the `json` method.
 
-<!-- we have disabled the below code for reproducibility, with hidden setting 
+<!-- we have disabled the below code for reproducibility, with hidden setting
 of the nasa_data object. But you can reproduce this using the DEMO_KEY key -->
 ```python
 import requests
@@ -1491,14 +1491,14 @@ import json
 with open("data/nasa.json", "r") as f:
     nasa_data = json.load(f)
 # the last entry in the stored data is July 13, 2023, so print that
-nasa_data[-1] 
+nasa_data[-1]
 ```
 
 We can obtain more records at once by using the `start_date` and `end_date` parameters, as
 shown in the table of parameters in {numref}`fig:NASA-API-parameters`.
 Let's obtain all the records between May 1, 2023, and July 13, 2023, and store the result
 in an object called `nasa_data`; now the response
-will take the form of a Python list. Each item in the list will correspond to a single day's record (just like the `nasa_data_single` object), 
+will take the form of a Python list. Each item in the list will correspond to a single day's record (just like the `nasa_data_single` object),
 and there will be 74 items total, one for each day between the start and end dates:
 
 ```python