@@ -1241,32 +1241,33 @@ variable—which we then parse using `BeautifulSoup` and store in the
1241
1241
` page ` variable. Next, we pass the CSS selectors we obtained from
1242
1242
SelectorGadget to the ` select ` method of the ` page ` object. Make sure to
1243
1243
surround the selectors with quotation marks; ` select ` expects that argument is
1244
- a string. The method then selects * nodes* from the HTML document that match the CSS
1244
+ a string. We store the result of the ` select ` function in the ` population_nodes `
1245
+ variable. Note that ` select ` returns a list; below we slice the list to
1246
+ print only the first 5 elements for clarity.
1247
+
1248
+ ``` {code-cell} ipython3
1249
+ population_nodes = page.select(
1250
+ "td:nth-child(8) , td:nth-child(6) , td:nth-child(4) , .mw-parser-output div td:nth-child(2)"
1251
+ )
1252
+ population_nodes[:5]
1253
+ ```
1254
+
1255
+ Each of the items in the ` population_nodes ` list is a * node* from the HTML document that matches the CSS
1245
1256
selectors you specified. A * node* is an HTML tag pair (e.g., ` <td> ` and ` </td> `
1246
1257
which defines the cell of a table) combined with the content stored between the
1247
1258
tags. For our CSS selector ` td:nth-child(6) ` , an example node that would be
1248
1259
selected would be:
1249
1260
1250
1261
``` html
1251
- <td style =" text-align :left ;background : #f0f0f0 ; " >
1262
+ <td style =" text-align :left ;" >
1252
1263
<a href =" /wiki/London,_Ontario" title =" London, Ontario" >London</a >
1253
1264
</td >
1254
1265
```
1255
1266
1256
- We store the result of the ` select ` function in the ` population_nodes `
1257
- variable. Note that it returns a list; we slice the list to only print the
1258
- first 5 elements.
1259
-
1260
- ``` {code-cell} ipython3
1261
- population_nodes = page.select(
1262
- "td:nth-child(8) , td:nth-child(6) , td:nth-child(4) , .mw-parser-output div td:nth-child(2)"
1263
- )
1264
- population_nodes[:5]
1265
- ```
1266
-
1267
- Next we extract the meaningful data&mdash ; in other words, we get rid of the
1267
+ Next, we extract the meaningful data&mdash ; in other words, we get rid of the
1268
1268
HTML code syntax and tags&mdash ; from the nodes using the ` get_text ` function.
1269
1269
In the case of the example node above, ` get_text ` function returns ` "London" ` .
1270
+ Once again we show only the first 5 elements for clarity.
1270
1271
1271
1272
``` {code-cell} ipython3
1272
1273
[row.get_text() for row in population_nodes[:5]]
@@ -1291,8 +1292,8 @@ Using `requests` and `BeautifulSoup` to extract data based on CSS selectors is
1291
1292
a very general way to scrape data from the web, albeit perhaps a little bit
1292
1293
complicated. Fortunately, ` pandas ` provides the
1293
1294
[ ` read_html ` ] ( https://pandas.pydata.org/docs/reference/api/pandas.read_html.html )
1294
- function, which is easier method to try when you know the data are tabular, and
1295
- appear on the webpage as an HTML table . The ` read_html ` function takes one
1295
+ function, which is easier method to try when the data
1296
+ appear on the webpage already in a tabular format . The ` read_html ` function takes one
1296
1297
argument&mdash ; the URL of the page to scrape&mdash ; and will return a list of
1297
1298
data frames corresponding to all the tables it finds at that URL. We can see
1298
1299
below that ` read_html ` found 17 tables on the Wikipedia page for Canada.
@@ -1358,8 +1359,8 @@ The James Webb Space Telescope's NIRCam image of the Rho Ophiuchi molecular clou
1358
1359
1359
1360
+++
1360
1361
1361
- First, you will need to visit the [ NASA APIs page] ( https://api.nasa.gov/ ) and generate an API key
1362
- if you do not already have one. Note that a valid email address is required to
1362
+ First, you will need to visit the [ NASA APIs page] ( https://api.nasa.gov/ ) and generate an API key.
1363
+ Note that a valid email address is required to
1363
1364
associate with the key. The signup form looks something like {numref}` fig:NASA-API-signup ` .
1364
1365
After filling out the basic information, you will receive the token via email.
1365
1366
Make sure to store the key in a safe place, and keep it private.
@@ -1400,7 +1401,7 @@ That should be more than enough for our purposes in this section.
1400
1401
#### Accessing the NASA API
1401
1402
1402
1403
The NASA API is what is known as an * HTTP API* : this is a particularly common
1403
- (and simple!) kind of API, where you can obtain data simply by accessing a
1404
+ kind of API, where you can obtain data simply by accessing a
1404
1405
particular URL as if it were a regular website. To make a query to the NASA
1405
1406
API, we need to specify three things. First, we specify the URL * endpoint* of
1406
1407
the API, which is simply a URL that helps the remote server understand which
@@ -1422,7 +1423,7 @@ along with syntax, default settings, and a description of each.
1422
1423
1423
1424
So for example, to obtain the image of the day
1424
1425
from July 13, 2023, the API query would have two parameters: ` api_key=YOUR_API_KEY `
1425
- and ` date=2023-07-13 ` :
1426
+ and ` date=2023-07-13 ` .
1426
1427
```
1427
1428
https://api.nasa.gov/planetary/apod?api_key=YOUR_API_KEY&date=2023-07-13
1428
1429
```
@@ -1481,7 +1482,8 @@ with open("data/nasa.json", "r") as f:
1481
1482
nasa_data[-1]
1482
1483
```
1483
1484
1484
- We can obtain more records at once by using the ` start_date ` and ` end_date ` parameters.
1485
+ We can obtain more records at once by using the ` start_date ` and ` end_date ` parameters, as
1486
+ shown in the table of parameters in {numref}` fig:NASA-API-parameters ` .
1485
1487
Let's obtain all the records between May 1, 2023, and July 13, 2023, and store the result
1486
1488
in an object called ` nasa_data ` ; now the response
1487
1489
will take the form of a Python list, with one dictionary item similar to the above
@@ -1500,10 +1502,13 @@ len(nasa_data)
1500
1502
```
1501
1503
1502
1504
For further data processing using the techniques in this book, you'll need to turn this list of dictionaries
1503
- into a ` pandas ` data frame.
1504
- these items For the demonstration purpose, let's only use a
1505
- few variables of interest: ` created_at ` , ` user.screen_name ` , ` retweeted ` ,
1506
- and ` full_text ` , and construct a ` pandas ` DataFrame using the extracted information.
1505
+ into a ` pandas ` data frame. Here we will extract the ` date ` , ` title ` , ` copyright ` , and ` url ` variables
1506
+ from the JSON data, and construct a ` pandas ` DataFrame using the extracted information.
1507
+
1508
+ ``` {note}
1509
+ Understanding this code is not required for the remainder of the textbook. It is included for those
1510
+ readers who would like to parse JSON data into a `pandas` data frame in their own data analyses.
1511
+ ```
1507
1512
1508
1513
``` {code-cell} ipython3
1509
1514
data_dict = {
@@ -1522,15 +1527,15 @@ nasa_df = pd.DataFrame(data_dict)
1522
1527
nasa_df
1523
1528
```
1524
1529
1525
- Success! We have created a small data set using the NASA
1530
+ Success& mdash ; we have created a small data set using the NASA
1526
1531
API! This data is also quite different from what we obtained from web scraping;
1527
- the extracted information can be easily converted into a ` pandas ` data frame
1528
- (although not * every* API will provide data in such a nice format).
1532
+ the extracted information is readily available in a JSON format, as opposed to raw
1533
+ HTML code (although not * every* API will provide data in such a nice format).
1529
1534
From this point onward, the ` nasa_df ` data frame is stored on your
1530
1535
machine, and you can play with it to your heart's content. For example, you can use
1531
1536
` pandas.to_csv ` to save it to a file and ` pandas.read_csv ` to read it into Python again later;
1532
1537
and after reading the next few chapters you will have the skills to
1533
- do even more interesting things. If you decide that you want
1538
+ do even more interesting things! If you decide that you want
1534
1539
to ask any of the various NASA APIs for more data
1535
1540
(see [ the list of awesome NASA APIS here] ( https://api.nasa.gov/ )
1536
1541
for more examples of what is possible), just be mindful as usual about how much
0 commit comments