You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Python is a powerful programming language that can be used for a variety of tasks, including analyzing data from a CSV file. We'll go over how to use Python to import data and run an analysis on it. We'll be using the <ahref="https://pandas.pydata.org/"target="_blank">pandas</a> library, a popular data analysis tool for Python.
44
23
45
-
<ahref="https://www.amazon.com/gp/bestsellers/books"target="_blank">Amazon Best Sellers</a> are updated every hour. The actual list is made of 100 books, but the data we're working with features just the top 50 books. 📖
</a> are updated every hour. The actual list is made of 100 books, but the data we're
27
+
working with features just the top 50 books. 📖
46
28
47
29
## The Dataset
48
30
49
31
In this tutorial, we will work with a CSV (comma-separated values) file that features some fun data about the top 50 best selling books on Amazon from 2009 to 2019 (provided by <ahref="https://www.kaggle.com/datasets/sootersaalu/amazon-top-50-bestselling-books-2009-2019?resource=download"target="_ blank">Kaggle</a>).
**Note**: If you don't have a Kaggle account, you can also download it <ahref="https://github.com/codedex-io/projects/blob/main/projects/analyze-spreadsheet-data-with-pandas-chatgpt/amazon-best-sellers-analysis/bestsellers.csv"target="_blank">here</a> in our GitHub:
54
39
55
-
<imgsrc="https://raw.githubusercontent.com/codedex-io/projects/main/projects/analyze-spreadsheet-data-with-pandas-chatgpt/file_download_btn_github.png"alt="File Download Button on GitHub" />
The **.csv** file contains 550 books. Here are the seven columns:
58
46
@@ -91,7 +79,7 @@ import pandas as pd
91
79
## Step 2: Import pandas and Load the Spreadsheet
92
80
93
81
Next, we need to import the pandas library and load the data into our Python program.
94
-
Download the **bestsellers.csv**file and add it to the same folder as your **main.py** file, **amazon-best-sellers-analysis**.
82
+
Download the **bestsellers.csv** file and add it to the same folder as your **main.py** file, **amazon-best-sellers-analysis**.
95
83
96
84
To read CSV files, we'll use the `.read_csv()` function provided by pandas. Then we will save this data to a new `df` variable:
97
85
@@ -199,7 +187,7 @@ Here are a few examples:
199
187
200
188
Using methods from our `df` DataFrame object, we can get a glimpse of which authors have the most books on the Amazon Best Sellers list.
201
189
202
-
This can be done by selecting the `'Author'` column data and using the `value_counts()` method. We can assign this to an `author_counts` variable:
190
+
This can be done by selecting the `'Author'` column data and using the `value_counts()` method. We can assign this to an `author_counts` variable:
203
191
204
192
```py
205
193
author_counts = df['Author'].value_counts()
@@ -266,13 +254,17 @@ Congratulations! We've made it to the end of the tutorial! 🎊
266
254
267
255
We were able to harness the power of Python libraries like pandas to analyze data from a CSV file. Specifically, we did the following:
268
256
269
-
- Imported book data about the top 50 books on Amazon from 2009 to 2019.
257
+
- Imported book data about the top 50 books on Amazon from 2009 to 2019.
270
258
- Explored and cleaned the data with DataFrame methods.
271
-
- Exported the modified data to a new CSV file.
259
+
- Exported the modified data to a new CSV file.
272
260
273
261
View the full source for this project <a href="https://github.com/codedex-io/projects/blob/main/projects/analyze-spreadsheet-data-with-pandas-chatgpt/amazon-best-sellers-analysis/main.py" target="_blank">here</a>.
274
262
275
263
Also, check out the following resources to learn more about data analysis with Python:
No matter where you are on your journey to mastering data science, it's always helpful to practice the basics of finding, cleaning, and analyzing real-world datasets. Back in 2020, COVID-19 sent us many of us into quarantine and while its long-term impact is still relatively unknown, we can reference a handful of public datasets to begin to scratch the surface.
19
+
No matter where you are on your journey to mastering data science, it's always helpful to practice the basics of finding, cleaning, and analyzing real-world datasets. Back in 2020, COVID-19 sent us many of us into quarantine and while its long-term impact is still relatively unknown, we can reference a handful of public datasets to begin to scratch the surface.
43
20
44
21
In this project tutorial, we'll be analyzing a dataset gathered from the 2022 [U.S. Census](https://data.census.gov/) covering geographic relocation roughly two years after quarantine.
45
22
46
23
<RoundedImage
47
24
link="https://i.imgur.com/QSycenX.gif"
48
-
description="U.S. Census Data Analysis"
25
+
description="U.S. Census Data Analysis"
49
26
/>
50
27
51
28
We will begin to test our assumptions and answer some basic questions about various demographic groups using SciPy, NumPy, Pandas, and some basic working knowledge of statistics.
52
29
53
30
The questions include:
54
31
55
32
- Is there a difference in mobility patterns between those that moved within their home state versus across states lines in New York and California in particular?
56
-
- And do trends vary amongst citizenship status?
33
+
- And do trends vary amongst citizenship status?
57
34
- Is there a difference in those same patterns amongst educational status between the Northeast (New Jersey, Pennsylvania, Rhode Island, Vermont, etc.) and the South (Georgia, Maryland, Virginia, etc.)?
58
35
- What about marital status across conservative divisions like the South Atlantic (Washington D.C., Georgia, Florida, North Carolina, etc.) and the Mountain States (Colorado, Wyoming, Nevada, Arizona, etc.)? Do we notice a difference in geographic mobility there as well?
59
36
@@ -63,19 +40,18 @@ As you can see below, the original data provided by [census.gov](https://data.ce
63
40
64
41
<RoundedImage
65
42
link="https://i.imgur.com/uvbRfkQ.png"
66
-
description="U.S. Census Data Analysis"
43
+
description="U.S. Census Data Analysis"
67
44
/>
68
45
<RoundedImage
69
46
link="https://i.imgur.com/nxdFv8j.png"
70
-
description="U.S. Census Data Analysis"
47
+
description="U.S. Census Data Analysis"
71
48
/>
72
49
73
-
74
50
When this happens, it's helpful to have some basic data preparation skills. While this isn't typically a requirement for using the SciPy package or conducting basic statistical analysis, you can look at each step we took to clean and structure the data by referencing the source code [here](https://colab.research.google.com/drive/1ujk1u0TWqlNolFwv9-rUNMjaghZuLLZK).
75
51
76
52
## About the Clean Datasets
77
53
78
-
The source code cranks out multiple categories of the same data, including information on the total population in 2022:
54
+
The source code cranks out multiple categories of the same data, including information on the total population in 2022:
79
55
80
56
- those that moved within the same county and/or state
81
57
- those that moved between states
@@ -85,13 +61,13 @@ For the categories listed, each dataset contains the following columns, which ar
85
61
86
62
<RoundedImage
87
63
link="https://i.imgur.com/dzkXTSC.gif"
88
-
description="U.S. Census Data Analysis"
64
+
description="U.S. Census Data Analysis"
89
65
/>
90
66
91
67
### Geographical Data
92
68
93
69
-**Geography ID**: a unique identifier used to reference specific geographic areas
94
-
-**Census Tract**: a small, relatively permanent subdivision of a county
70
+
-**Census Tract**: a small, relatively permanent subdivision of a county
95
71
-**State**: the state in which the Census Tract is located
96
72
-**County**: the county within the state in which the Census Tract resides
97
73
-**Region**: the broader geographic area in which the state or county is located, typically referring to one of four major regions: Northeast, Midwest, South, or West
@@ -128,11 +104,10 @@ When conducting an exploratory analysis, we first want to make sure that our dat
128
104
129
105
Generally speaking, most data science models abide by what we call parametric assumptions, which refer to normal distribution of a fixed set of parameters. In our particular case, those parameters include, but are not limited to, the columns we listed above. The three parametric assumptions are independence, normality, and homogeneity of variances.
130
106
131
-
Additionally, traditional **A/B testing** typically utilizes one of two methods: either a **chi-squared** (which looks for dependence between two categorical variables) or a **t-test** (which looks for a statistically significant difference between the averages of two groups) to validate what we refer to as the null hypothesis (which is the assumption that there is no relationship or comparison between two patterns of behavior).
107
+
Additionally, traditional **A/B testing** typically utilizes one of two methods: either a **chi-squared** (which looks for dependence between two categorical variables) or a **t-test** (which looks for a statistically significant difference between the averages of two groups) to validate what we refer to as the null hypothesis (which is the assumption that there is no relationship or comparison between two patterns of behavior).
132
108
133
109
For this tutorial, we'll be running t-tests.
134
110
135
-
136
111
## Getting Started
137
112
138
113
To get started, you'll need the following [datasets](https://drive.google.com/drive/folders/1xO33dvJV_RySl77y2W-7lxIvBW7PUoEg?usp=sharing) and a copy of [this Google Colab notebook](https://colab.research.google.com/drive/1GWiNXPVuRTORqEBNFV7zpTGZD_yeprNt?usp=sharing).
@@ -141,7 +116,7 @@ Feel free to manually upload the CSVs to the notebook if you don't already see t
141
116
142
117
<RoundedImage
143
118
link="https://i.imgur.com/Iz1PLIY.png"
144
-
description="U.S. Census Data Analysis"
119
+
description="U.S. Census Data Analysis"
145
120
/>
146
121
147
122
First we'll begin by importing the necessary packages:
@@ -167,7 +142,6 @@ variant = pd.read_csv(v)
167
142
# variant.head()
168
143
```
169
144
170
-
171
145
## Let's Explore
172
146
173
147
Let's begin by manually creating an empty dataframe (table) based on each level of detail (County, State, Division, and Region) listed by the U.S. Census.
@@ -190,7 +164,7 @@ state["Relocated Between States"] = variant.groupby("State")["Total Population"]
190
164
state.head()
191
165
```
192
166
193
-
Comparing California residents to those from New York only, **is there a significant difference in mobility between those that relocated within the same** area (in this case, state) **versus those that moved across state lines?**
167
+
Comparing California residents to those from New York only, **is there a significant difference in mobility between those that relocated within the same** area (in this case, state) **versus those that moved across state lines?**
194
168
195
169
We'll use the `.loc[]` method to search for the two states and extract the summed values that we calculated in the exercise above.
196
170
@@ -237,6 +211,7 @@ print("p-value:", p_value)
237
211
The p-value is much higher in this instance, suggesting that we can be only 62% certain that there was a difference in mobility amongst immigrants between the two states.
238
212
239
213
Now what about when comparing U.S. citizens only?
214
+
240
215
```python
241
216
cny3 = pd.DataFrame()
242
217
cny3["Total U.S. Citizens (Native)"] = d.groupby("State")["Total US Citizens (Native)"].sum()
@@ -309,7 +286,7 @@ So what have we learned?? We've learned that:
309
286
- No, there does not appear to be a difference in those same patterns amongst educational status between the Northeast (New Jersey, Pennsylvania, Rhode Island, Vermont, etc.) and the South (Georgia, Maryland, Virginia, D.C., etc.).
310
287
- No, there also does not appear to be a difference across marital status for conservative divisions like the South Atlantic (Washington D.C., Georgia, Florida, North Carolina, etc.) and the Mountain States (Colorado, Wyoming, Nevada, Arizona, etc.) either.
311
288
312
-
Why does this matter? It matters because it demonstrates that there's actually a sound and scientific method for answering these questions when they come up. Feel free to try your hand at doing the same the next time you run into an interesting dataset! Or, consider ways you can examine how mobility influences local economies, or even how it impacts the environment.
289
+
Why does this matter? It matters because it demonstrates that there's actually a sound and scientific method for answering these questions when they come up. Feel free to try your hand at doing the same the next time you run into an interesting dataset! Or, consider ways you can examine how mobility influences local economies, or even how it impacts the environment.
0 commit comments