Skip to content

Commit 9e68db3

Browse files
reading pdf fixes
1 parent be4bc5b commit 9e68db3

File tree

1 file changed

+44
-21
lines changed

1 file changed

+44
-21
lines changed

source/reading.md

Lines changed: 44 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -109,14 +109,16 @@ So in this case, `happiness_report.csv` would be reached by starting at the root
109109
then the `dsci-100` folder, then the `project3` folder, and then finally the `data` folder. So its absolute
110110
path would be `/home/dsci-100/project3/data/happiness_report.csv`. We can load the file using its absolute path
111111
as a string passed to the `read_csv` function from `pandas`.
112-
```python
112+
```{code-cell} ipython3
113+
:tags: ["remove-output"]
113114
happy_data = pd.read_csv("/home/dsci-100/project3/data/happiness_report.csv")
114115
```
115116
If we instead wanted to use a relative path, we would need to list out the sequence of steps needed to get from our current
116117
working directory to the file, with slashes `/` separating each step. Since we are currently in the `project3` folder,
117118
we just need to enter the `data` folder to reach our desired file. Hence the relative path is `data/happiness_report.csv`,
118119
and we can load the file using its relative path as a string passed to `read_csv`.
119-
```python
120+
```{code-cell} ipython3
121+
:tags: ["remove-output"]
120122
happy_data = pd.read_csv("data/happiness_report.csv")
121123
```
122124
Note that there is no forward slash at the beginning of a relative path; if we accidentally typed `"/data/happiness_report.csv"`,
@@ -147,13 +149,13 @@ all of the folders between the computer's root, represented by `/`, and the file
147149
across different computers. For example, suppose Fatima and Jayden are working on a
148150
project together on the `happiness_report.csv` data. Fatima's file is stored at
149151

150-
```
152+
```text
151153
/home/Fatima/project3/data/happiness_report.csv
152154
```
153155

154156
while Jayden's is stored at
155157

156-
```
158+
```text
157159
/home/Jayden/project3/data/happiness_report.csv
158160
```
159161

@@ -275,11 +277,13 @@ With this extra information being present at the top of the file, using
275277
into Python. In the case of this file, Python just prints a `ParserError`
276278
message, indicating that it wasn't able to read the file.
277279

278-
```python
280+
```{code-cell} ipython3
281+
:tags: ["remove-output"]
279282
canlang_data = pd.read_csv("data/can_lang_meta-data.csv")
280283
```
281-
```text
282-
ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 6
284+
```{code-cell} ipython3
285+
:tags: ["remove-input"]
286+
print("ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 6")
283287
```
284288

285289
```{index} ParserError
@@ -841,7 +845,8 @@ be able to connect to a database using this information.
841845
```{index} ibis; postgres, ibis; connect
842846
```
843847

844-
```python
848+
```{code-cell} ipython3
849+
:tags: ["remove-output"]
845850
conn = ibis.postgres.connect(
846851
database="can_mov_db",
847852
host="fakeserver.stat.ubc.ca",
@@ -859,12 +864,14 @@ connecting to and working with an SQLite database. For example, we can again use
859864
```{index} ibis; list_tables
860865
```
861866

862-
```python
867+
```{code-cell} ipython3
868+
:tags: ["remove-output"]
863869
conn.list_tables()
864870
```
865871

866-
```text
867-
["themes", "medium", "titles", "title_aliases", "forms", "episodes", "names", "names_occupations", "occupation", "ratings"]
872+
```{code-cell} ipython3
873+
:tags: ["remove-input"]
874+
print('["themes", "medium", "titles", "title_aliases", "forms", "episodes", "names", "names_occupations", "occupation", "ratings"]')
868875
```
869876

870877
We see that there are 10 tables in this database. Let's first look at the
@@ -874,16 +881,20 @@ database.
874881
```{index} ibis; table
875882
```
876883

877-
```python
884+
```{code-cell} ipython3
885+
:tags: ["remove-output"]
878886
ratings_table = conn.table("ratings")
879887
ratings_table
880888
```
881889

882-
```text
890+
```{code-cell} ipython3
891+
:tags: ["remove-input"]
892+
print("""
883893
AlchemyTable: ratings
884894
title string
885895
average_rating float64
886896
num_votes int64
897+
""")
887898
```
888899

889900
```{index} ibis; []
@@ -892,12 +903,15 @@ AlchemyTable: ratings
892903
To find the lowest rating that exists in the data base, we first need to
893904
select the `average_rating` column:
894905

895-
```python
906+
```{code-cell} ipython3
907+
:tags: ["remove-output"]
896908
avg_rating = ratings_table[["average_rating"]]
897909
avg_rating
898910
```
899911

900-
```text
912+
```{code-cell} ipython3
913+
:tags: ["remove-input"]
914+
print("""
901915
r0 := AlchemyTable: ratings
902916
title string
903917
average_rating float64
@@ -906,6 +920,7 @@ r0 := AlchemyTable: ratings
906920
Selection[r0]
907921
selections:
908922
average_rating: r0.average_rating
923+
""")
909924
```
910925

911926
```{index} database; ordering, ibis; order_by, ibis; head
@@ -914,7 +929,8 @@ Selection[r0]
914929
Next we use the `order_by` function from `ibis` order the table by `average_rating`,
915930
and then the `head` function to select the first row (i.e., the lowest score).
916931

917-
```python
932+
```{code-cell} ipython3
933+
:tags: ["remove-output"]
918934
lowest = avg_rating.order_by("average_rating").head(1)
919935
lowest.execute()
920936
```
@@ -925,7 +941,6 @@ lowest = pd.DataFrame({"average_rating" : [1.0]})
925941
lowest
926942
```
927943

928-
929944
We see the lowest rating given to a movie is 1, indicating that it must have
930945
been a really bad movie...
931946

@@ -1250,7 +1265,8 @@ page we want to scrape by providing its URL in quotations to the `requests.get`
12501265
function. This function obtains the raw HTML of the page, which we then
12511266
pass to the `BeautifulSoup` function for parsing:
12521267

1253-
```python
1268+
```{code-cell} ipython3
1269+
:tags: ["remove-output"]
12541270
import requests
12551271
import bs4
12561272
@@ -1338,7 +1354,8 @@ below that `read_html` found 17 tables on the Wikipedia page for Canada.
13381354
```{index} read function; read_html
13391355
```
13401356

1341-
```python
1357+
```{code-cell} ipython3
1358+
:tags: ["remove-output"]
13421359
canada_wiki_tables = pd.read_html("https://en.wikipedia.org/wiki/Canada")
13431360
len(canada_wiki_tables)
13441361
```
@@ -1514,7 +1531,8 @@ response using the `json` method.
15141531

15151532
<!-- we have disabled the below code for reproducibility, with hidden setting
15161533
of the nasa_data object. But you can reproduce this using the DEMO_KEY key -->
1517-
```python
1534+
```{code-cell} ipython3
1535+
:tags: ["remove-output"]
15181536
import requests
15191537
15201538
nasa_data_single = requests.get(
@@ -1539,7 +1557,8 @@ in an object called `nasa_data`; now the response
15391557
will take the form of a Python list. Each item in the list will correspond to a single day's record (just like the `nasa_data_single` object),
15401558
and there will be 74 items total, one for each day between the start and end dates:
15411559

1542-
```python
1560+
```{code-cell} ipython3
1561+
:tags: ["remove-output"]
15431562
nasa_data = requests.get(
15441563
"https://api.nasa.gov/planetary/apod?api_key=YOUR_API_KEY&start_date=2023-05-01&end_date=2023-07-13"
15451564
).json()
@@ -1548,6 +1567,10 @@ len(nasa_data)
15481567

15491568
```{code-cell} ipython3
15501569
:tags: [remove-input]
1570+
# need to secretly re-load the nasa data again because the above running code destroys it
1571+
# see PR 341 for why we need to do things this way (essentially due to PDF build)
1572+
with open("data/nasa.json", "r") as f:
1573+
nasa_data = json.load(f)
15511574
len(nasa_data)
15521575
```
15531576

0 commit comments

Comments
 (0)