Skip to content

Commit e2f19b9

Browse files
minor polish
1 parent 936c7a8 commit e2f19b9

File tree

1 file changed

+34
-29
lines changed

1 file changed

+34
-29
lines changed

source/reading.md

Lines changed: 34 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1241,32 +1241,33 @@ variable—which we then parse using `BeautifulSoup` and store in the
12411241
`page` variable. Next, we pass the CSS selectors we obtained from
12421242
SelectorGadget to the `select` method of the `page` object. Make sure to
12431243
surround the selectors with quotation marks; `select` expects that argument is
1244-
a string. The method then selects *nodes* from the HTML document that match the CSS
1244+
a string. We store the result of the `select` function in the `population_nodes`
1245+
variable. Note that `select` returns a list; below we slice the list to
1246+
print only the first 5 elements for clarity.
1247+
1248+
```{code-cell} ipython3
1249+
population_nodes = page.select(
1250+
"td:nth-child(8) , td:nth-child(6) , td:nth-child(4) , .mw-parser-output div td:nth-child(2)"
1251+
)
1252+
population_nodes[:5]
1253+
```
1254+
1255+
Each of the items in the `population_nodes` list is a *node* from the HTML document that matches the CSS
12451256
selectors you specified. A *node* is an HTML tag pair (e.g., `<td>` and `</td>`
12461257
which defines the cell of a table) combined with the content stored between the
12471258
tags. For our CSS selector `td:nth-child(6)`, an example node that would be
12481259
selected would be:
12491260

12501261
```html
1251-
<td style="text-align:left;background:#f0f0f0;">
1262+
<td style="text-align:left;">
12521263
<a href="/wiki/London,_Ontario" title="London, Ontario">London</a>
12531264
</td>
12541265
```
12551266

1256-
We store the result of the `select` function in the `population_nodes`
1257-
variable. Note that it returns a list; we slice the list to only print the
1258-
first 5 elements.
1259-
1260-
```{code-cell} ipython3
1261-
population_nodes = page.select(
1262-
"td:nth-child(8) , td:nth-child(6) , td:nth-child(4) , .mw-parser-output div td:nth-child(2)"
1263-
)
1264-
population_nodes[:5]
1265-
```
1266-
1267-
Next we extract the meaningful data&mdash;in other words, we get rid of the
1267+
Next, we extract the meaningful data&mdash;in other words, we get rid of the
12681268
HTML code syntax and tags&mdash;from the nodes using the `get_text` function.
12691269
In the case of the example node above, `get_text` function returns `"London"`.
1270+
Once again we show only the first 5 elements for clarity.
12701271

12711272
```{code-cell} ipython3
12721273
[row.get_text() for row in population_nodes[:5]]
@@ -1291,8 +1292,8 @@ Using `requests` and `BeautifulSoup` to extract data based on CSS selectors is
12911292
a very general way to scrape data from the web, albeit perhaps a little bit
12921293
complicated. Fortunately, `pandas` provides the
12931294
[`read_html`](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html)
1294-
function, which is easier method to try when you know the data are tabular, and
1295-
appear on the webpage as an HTML table. The `read_html` function takes one
1295+
function, which is easier method to try when the data
1296+
appear on the webpage already in a tabular format. The `read_html` function takes one
12961297
argument&mdash;the URL of the page to scrape&mdash;and will return a list of
12971298
data frames corresponding to all the tables it finds at that URL. We can see
12981299
below that `read_html` found 17 tables on the Wikipedia page for Canada.
@@ -1358,8 +1359,8 @@ The James Webb Space Telescope's NIRCam image of the Rho Ophiuchi molecular clou
13581359

13591360
+++
13601361

1361-
First, you will need to visit the [NASA APIs page](https://api.nasa.gov/) and generate an API key
1362-
if you do not already have one. Note that a valid email address is required to
1362+
First, you will need to visit the [NASA APIs page](https://api.nasa.gov/) and generate an API key.
1363+
Note that a valid email address is required to
13631364
associate with the key. The signup form looks something like {numref}`fig:NASA-API-signup`.
13641365
After filling out the basic information, you will receive the token via email.
13651366
Make sure to store the key in a safe place, and keep it private.
@@ -1400,7 +1401,7 @@ That should be more than enough for our purposes in this section.
14001401
#### Accessing the NASA API
14011402

14021403
The NASA API is what is known as an *HTTP API*: this is a particularly common
1403-
(and simple!) kind of API, where you can obtain data simply by accessing a
1404+
kind of API, where you can obtain data simply by accessing a
14041405
particular URL as if it were a regular website. To make a query to the NASA
14051406
API, we need to specify three things. First, we specify the URL *endpoint* of
14061407
the API, which is simply a URL that helps the remote server understand which
@@ -1422,7 +1423,7 @@ along with syntax, default settings, and a description of each.
14221423

14231424
So for example, to obtain the image of the day
14241425
from July 13, 2023, the API query would have two parameters: `api_key=YOUR_API_KEY`
1425-
and `date=2023-07-13`:
1426+
and `date=2023-07-13`.
14261427
```
14271428
https://api.nasa.gov/planetary/apod?api_key=YOUR_API_KEY&date=2023-07-13
14281429
```
@@ -1481,7 +1482,8 @@ with open("data/nasa.json", "r") as f:
14811482
nasa_data[-1]
14821483
```
14831484

1484-
We can obtain more records at once by using the `start_date` and `end_date` parameters.
1485+
We can obtain more records at once by using the `start_date` and `end_date` parameters, as
1486+
shown in the table of parameters in {numref}`fig:NASA-API-parameters`.
14851487
Let's obtain all the records between May 1, 2023, and July 13, 2023, and store the result
14861488
in an object called `nasa_data`; now the response
14871489
will take the form of a Python list, with one dictionary item similar to the above
@@ -1500,10 +1502,13 @@ len(nasa_data)
15001502
```
15011503

15021504
For further data processing using the techniques in this book, you'll need to turn this list of dictionaries
1503-
into a `pandas` data frame.
1504-
these items For the demonstration purpose, let's only use a
1505-
few variables of interest: `created_at`, `user.screen_name`, `retweeted`,
1506-
and `full_text`, and construct a `pandas` DataFrame using the extracted information.
1505+
into a `pandas` data frame. Here we will extract the `date`, `title`, `copyright`, and `url` variables
1506+
from the JSON data, and construct a `pandas` DataFrame using the extracted information.
1507+
1508+
```{note}
1509+
Understanding this code is not required for the remainder of the textbook. It is included for those
1510+
readers who would like to parse JSON data into a `pandas` data frame in their own data analyses.
1511+
```
15071512

15081513
```{code-cell} ipython3
15091514
data_dict = {
@@ -1522,15 +1527,15 @@ nasa_df = pd.DataFrame(data_dict)
15221527
nasa_df
15231528
```
15241529

1525-
Success! We have created a small data set using the NASA
1530+
Success&mdash;we have created a small data set using the NASA
15261531
API! This data is also quite different from what we obtained from web scraping;
1527-
the extracted information can be easily converted into a `pandas` data frame
1528-
(although not *every* API will provide data in such a nice format).
1532+
the extracted information is readily available in a JSON format, as opposed to raw
1533+
HTML code (although not *every* API will provide data in such a nice format).
15291534
From this point onward, the `nasa_df` data frame is stored on your
15301535
machine, and you can play with it to your heart's content. For example, you can use
15311536
`pandas.to_csv` to save it to a file and `pandas.read_csv` to read it into Python again later;
15321537
and after reading the next few chapters you will have the skills to
1533-
do even more interesting things. If you decide that you want
1538+
do even more interesting things! If you decide that you want
15341539
to ask any of the various NASA APIs for more data
15351540
(see [the list of awesome NASA APIS here](https://api.nasa.gov/)
15361541
for more examples of what is possible), just be mindful as usual about how much

0 commit comments

Comments
 (0)