Skip to content

Commit 071b618

Browse files
committed
update requistates
1 parent 0924a8e commit 071b618

File tree

44 files changed

+1028
-1714
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+1028
-1714
lines changed

projects/50-terminal-project-ideas-using-python/50-terminal-project-ideas-using-python.mdx

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,11 @@ description: A list of 50 terminal project ideas to implement in your choice of
77
published: live
88
header: https://raw.githubusercontent.com/codedex-io/projects/main/projects/50-terminal-project-ideas-using-python/header.png
99
bannerImage: https://raw.githubusercontent.com/codedex-io/projects/main/projects/50-terminal-project-ideas-using-python/header.png
10+
prerequisites: Python fundamentals
11+
versions: Python 3.10
12+
courses:
13+
- python
14+
readTime: 20
1015
tags:
1116
- beginner
1217
- python

projects/analyze-spreadsheet-data-with-pandas-chatgpt/analyze-spreadsheet-data-with-pandas-chatgpt.mdx

Lines changed: 27 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -7,52 +7,40 @@ description: Learn how to import and analyze Amazon data with pandas, a Python l
77
published: live
88
header: https://raw.githubusercontent.com/codedex-io/projects/main/projects/analyze-spreadsheet-data-with-pandas-chatgpt/header.png
99
bannerImage: https://raw.githubusercontent.com/codedex-io/projects/main/projects/analyze-spreadsheet-data-with-pandas-chatgpt/header.png
10+
readTime: 60
11+
prerequisites: Python fundamentals
12+
versions: Python 3.9.6, pandas 2.0.1
13+
courses:
14+
- python
1015
tags:
1116
- intermediate
1217
- python
1318
---
1419

15-
<BannerImage
16-
link="https://raw.githubusercontent.com/codedex-io/projects/main/projects/analyze-spreadsheet-data-with-pandas-chatgpt/header.png"
17-
description="Title Image"
18-
uid={true}
19-
cl="for-sidebar"
20-
/>
21-
22-
# Analyze Best Selling Amazon Books with Pandas
23-
24-
<AuthorAvatar
25-
author_name="Grace Peters"
26-
author_avatar="/images/projects/authors/grace-peters-chatgpt.png"
27-
username="gracepeters"
28-
uid={true}
29-
/>
30-
31-
<BannerImage
32-
link="https://raw.githubusercontent.com/codedex-io/projects/main/projects/analyze-spreadsheet-data-with-pandas-chatgpt/header.png"
33-
description="Title Image"
34-
uid={true}
35-
/>
36-
37-
**Prerequisites:** Python
38-
**Versions:** Python 3.9.6, pandas 2.0.1
39-
**Read Time:** 60 minutes
40-
4120
## Introduction
4221

4322
Python is a powerful programming language that can be used for a variety of tasks, including analyzing data from a CSV file. We'll go over how to use Python to import data and run an analysis on it. We'll be using the <a href="https://pandas.pydata.org/" target="_blank">pandas</a> library, a popular data analysis tool for Python.
4423

45-
<a href="https://www.amazon.com/gp/bestsellers/books" target="_blank">Amazon Best Sellers</a> are updated every hour. The actual list is made of 100 books, but the data we're working with features just the top 50 books. 📖
24+
<a href="https://www.amazon.com/gp/bestsellers/books" target="_blank">
25+
Amazon Best Sellers
26+
</a> are updated every hour. The actual list is made of 100 books, but the data we're
27+
working with features just the top 50 books. 📖
4628

4729
## The Dataset
4830

4931
In this tutorial, we will work with a CSV (comma-separated values) file that features some fun data about the top 50 best selling books on Amazon from 2009 to 2019 (provided by <a href="https://www.kaggle.com/datasets/sootersaalu/amazon-top-50-bestselling-books-2009-2019?resource=download" target="_ blank">Kaggle</a>).
5032

51-
<img src="https://raw.githubusercontent.com/codedex-io/projects/main/projects/analyze-spreadsheet-data-with-pandas-chatgpt/best-sellers-csv-data.png" alt="Best seller CSV data" />
33+
<img
34+
src="https://raw.githubusercontent.com/codedex-io/projects/main/projects/analyze-spreadsheet-data-with-pandas-chatgpt/best-sellers-csv-data.png"
35+
alt="Best seller CSV data"
36+
/>
5237

5338
**Note**: If you don't have a Kaggle account, you can also download it <a href="https://github.com/codedex-io/projects/blob/main/projects/analyze-spreadsheet-data-with-pandas-chatgpt/amazon-best-sellers-analysis/bestsellers.csv" target="_blank">here</a> in our GitHub:
5439

55-
<img src="https://raw.githubusercontent.com/codedex-io/projects/main/projects/analyze-spreadsheet-data-with-pandas-chatgpt/file_download_btn_github.png" alt="File Download Button on GitHub" />
40+
<img
41+
src="https://raw.githubusercontent.com/codedex-io/projects/main/projects/analyze-spreadsheet-data-with-pandas-chatgpt/file_download_btn_github.png"
42+
alt="File Download Button on GitHub"
43+
/>
5644

5745
The **.csv** file contains 550 books. Here are the seven columns:
5846

@@ -91,7 +79,7 @@ import pandas as pd
9179
## Step 2: Import pandas and Load the Spreadsheet
9280

9381
Next, we need to import the pandas library and load the data into our Python program.
94-
Download the **bestsellers.csv** file and add it to the same folder as your **main.py** file, **amazon-best-sellers-analysis**.
82+
Download the **bestsellers.csv** file and add it to the same folder as your **main.py** file, **amazon-best-sellers-analysis**.
9583

9684
To read CSV files, we'll use the `.read_csv()` function provided by pandas. Then we will save this data to a new `df` variable:
9785

@@ -199,7 +187,7 @@ Here are a few examples:
199187
200188
Using methods from our `df` DataFrame object, we can get a glimpse of which authors have the most books on the Amazon Best Sellers list.
201189
202-
This can be done by selecting the `'Author'` column data and using the `value_counts()` method. We can assign this to an `author_counts` variable:
190+
This can be done by selecting the `'Author'` column data and using the `value_counts()` method. We can assign this to an `author_counts` variable:
203191
204192
```py
205193
author_counts = df['Author'].value_counts()
@@ -266,13 +254,17 @@ Congratulations! We've made it to the end of the tutorial! 🎊
266254
267255
We were able to harness the power of Python libraries like pandas to analyze data from a CSV file. Specifically, we did the following:
268256
269-
- Imported book data about the top 50 books on Amazon from 2009 to 2019.
257+
- Imported book data about the top 50 books on Amazon from 2009 to 2019.
270258
- Explored and cleaned the data with DataFrame methods.
271-
- Exported the modified data to a new CSV file.
259+
- Exported the modified data to a new CSV file.
272260
273261
View the full source for this project <a href="https://github.com/codedex-io/projects/blob/main/projects/analyze-spreadsheet-data-with-pandas-chatgpt/amazon-best-sellers-analysis/main.py" target="_blank">here</a>.
274262
275263
Also, check out the following resources to learn more about data analysis with Python:
276264
277-
- <a href="https://pandas.pydata.org/docs/" target="_blank">pandas documentation</a>
278-
- <a href="https://dataanalysispython.readthedocs.io/en/latest" target="_blank">Data Analysis in Python (Read the Docs)</a>
265+
- <a href="https://pandas.pydata.org/docs/" target="_blank">
266+
pandas documentation
267+
</a>
268+
- <a href="https://dataanalysispython.readthedocs.io/en/latest" target="_blank">
269+
Data Analysis in Python (Read the Docs)
270+
</a>

projects/analyze-us-census-data-with-scipy/analyze-us-census-data-with-scipy.mdx

Lines changed: 18 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -6,54 +6,31 @@ datePublished: 2025-01-13
66
published: live
77
description: Learn how to analyze U.S. census data with SciPy
88
header: https://firebasestorage.googleapis.com/v0/b/codedex-io.appspot.com/o/assets%2Findex%2F12423.png?alt=media&token=721aaaaa-f431-438e-bd19-c3f6a97afb41
9+
readTime: 45
10+
prerequisites: <a href="/python">Python</a>, <a href="/numpy">NumPy</a>, SciPy
11+
versions: Python 3
912
tags:
1013
- intermediate
1114
- python
1215
---
1316

14-
<BannerImage
15-
link="https://firebasestorage.googleapis.com/v0/b/codedex-io.appspot.com/o/assets%2Findex%2F12423.png?alt=media&token=721aaaaa-f431-438e-bd19-c3f6a97afb41"
16-
description="Title Image"
17-
uid={true}
18-
cl="for-sidebar"
19-
/>
20-
21-
# Analyze U.S. Census Data with SciPy
22-
23-
<AuthorAvatar
24-
author_name=""
25-
author_avatar="/images/projects/authors/"
26-
username=""
27-
uid={true}
28-
/>
29-
30-
<BannerImage
31-
link="https://firebasestorage.googleapis.com/v0/b/codedex-io.appspot.com/o/assets%2Findex%2F12423.png?alt=media&token=721aaaaa-f431-438e-bd19-c3f6a97afb41"
32-
description="Banner"
33-
uid={true}
34-
/>
35-
36-
**Prerequisites**: Python, NumPy, SciPy
37-
**Version**: Python 3
38-
**Read Time**: 45 minutes
39-
4017
## Introduction
4118

42-
No matter where you are on your journey to mastering data science, it's always helpful to practice the basics of finding, cleaning, and analyzing real-world datasets. Back in 2020, COVID-19 sent us many of us into quarantine and while its long-term impact is still relatively unknown, we can reference a handful of public datasets to begin to scratch the surface.
19+
No matter where you are on your journey to mastering data science, it's always helpful to practice the basics of finding, cleaning, and analyzing real-world datasets. Back in 2020, COVID-19 sent us many of us into quarantine and while its long-term impact is still relatively unknown, we can reference a handful of public datasets to begin to scratch the surface.
4320

4421
In this project tutorial, we'll be analyzing a dataset gathered from the 2022 [U.S. Census](https://data.census.gov/) covering geographic relocation roughly two years after quarantine.
4522

4623
<RoundedImage
4724
link="https://i.imgur.com/QSycenX.gif"
48-
description="U.S. Census Data Analysis"
25+
description="U.S. Census Data Analysis"
4926
/>
5027

5128
We will begin to test our assumptions and answer some basic questions about various demographic groups using SciPy, NumPy, Pandas, and some basic working knowledge of statistics.
5229

5330
The questions include:
5431

5532
- Is there a difference in mobility patterns between those that moved within their home state versus across states lines in New York and California in particular?
56-
- And do trends vary amongst citizenship status?
33+
- And do trends vary amongst citizenship status?
5734
- Is there a difference in those same patterns amongst educational status between the Northeast (New Jersey, Pennsylvania, Rhode Island, Vermont, etc.) and the South (Georgia, Maryland, Virginia, etc.)?
5835
- What about marital status across conservative divisions like the South Atlantic (Washington D.C., Georgia, Florida, North Carolina, etc.) and the Mountain States (Colorado, Wyoming, Nevada, Arizona, etc.)? Do we notice a difference in geographic mobility there as well?
5936

@@ -63,19 +40,18 @@ As you can see below, the original data provided by [census.gov](https://data.ce
6340

6441
<RoundedImage
6542
link="https://i.imgur.com/uvbRfkQ.png"
66-
description="U.S. Census Data Analysis"
43+
description="U.S. Census Data Analysis"
6744
/>
6845
<RoundedImage
6946
link="https://i.imgur.com/nxdFv8j.png"
70-
description="U.S. Census Data Analysis"
47+
description="U.S. Census Data Analysis"
7148
/>
7249

73-
7450
When this happens, it's helpful to have some basic data preparation skills. While this isn't typically a requirement for using the SciPy package or conducting basic statistical analysis, you can look at each step we took to clean and structure the data by referencing the source code [here](https://colab.research.google.com/drive/1ujk1u0TWqlNolFwv9-rUNMjaghZuLLZK).
7551

7652
## About the Clean Datasets
7753

78-
The source code cranks out multiple categories of the same data, including information on the total population in 2022:
54+
The source code cranks out multiple categories of the same data, including information on the total population in 2022:
7955

8056
- those that moved within the same county and/or state
8157
- those that moved between states
@@ -85,13 +61,13 @@ For the categories listed, each dataset contains the following columns, which ar
8561

8662
<RoundedImage
8763
link="https://i.imgur.com/dzkXTSC.gif"
88-
description="U.S. Census Data Analysis"
64+
description="U.S. Census Data Analysis"
8965
/>
9066

9167
### Geographical Data
9268

9369
- **Geography ID**: a unique identifier used to reference specific geographic areas
94-
- **Census Tract**: a small, relatively permanent subdivision of a county
70+
- **Census Tract**: a small, relatively permanent subdivision of a county
9571
- **State**: the state in which the Census Tract is located
9672
- **County**: the county within the state in which the Census Tract resides
9773
- **Region**: the broader geographic area in which the state or county is located, typically referring to one of four major regions: Northeast, Midwest, South, or West
@@ -128,11 +104,10 @@ When conducting an exploratory analysis, we first want to make sure that our dat
128104

129105
Generally speaking, most data science models abide by what we call parametric assumptions, which refer to normal distribution of a fixed set of parameters. In our particular case, those parameters include, but are not limited to, the columns we listed above. The three parametric assumptions are independence, normality, and homogeneity of variances.
130106

131-
Additionally, traditional **A/B testing** typically utilizes one of two methods: either a **chi-squared** (which looks for dependence between two categorical variables) or a **t-test** (which looks for a statistically significant difference between the averages of two groups) to validate what we refer to as the null hypothesis (which is the assumption that there is no relationship or comparison between two patterns of behavior).
107+
Additionally, traditional **A/B testing** typically utilizes one of two methods: either a **chi-squared** (which looks for dependence between two categorical variables) or a **t-test** (which looks for a statistically significant difference between the averages of two groups) to validate what we refer to as the null hypothesis (which is the assumption that there is no relationship or comparison between two patterns of behavior).
132108

133109
For this tutorial, we'll be running t-tests.
134110

135-
136111
## Getting Started
137112

138113
To get started, you'll need the following [datasets](https://drive.google.com/drive/folders/1xO33dvJV_RySl77y2W-7lxIvBW7PUoEg?usp=sharing) and a copy of [this Google Colab notebook](https://colab.research.google.com/drive/1GWiNXPVuRTORqEBNFV7zpTGZD_yeprNt?usp=sharing).
@@ -141,7 +116,7 @@ Feel free to manually upload the CSVs to the notebook if you don't already see t
141116

142117
<RoundedImage
143118
link="https://i.imgur.com/Iz1PLIY.png"
144-
description="U.S. Census Data Analysis"
119+
description="U.S. Census Data Analysis"
145120
/>
146121

147122
First we'll begin by importing the necessary packages:
@@ -167,7 +142,6 @@ variant = pd.read_csv(v)
167142
# variant.head()
168143
```
169144

170-
171145
## Let's Explore
172146

173147
Let's begin by manually creating an empty dataframe (table) based on each level of detail (County, State, Division, and Region) listed by the U.S. Census.
@@ -190,7 +164,7 @@ state["Relocated Between States"] = variant.groupby("State")["Total Population"]
190164
state.head()
191165
```
192166

193-
Comparing California residents to those from New York only, **is there a significant difference in mobility between those that relocated within the same** area (in this case, state) **versus those that moved across state lines?**
167+
Comparing California residents to those from New York only, **is there a significant difference in mobility between those that relocated within the same** area (in this case, state) **versus those that moved across state lines?**
194168

195169
We'll use the `.loc[]` method to search for the two states and extract the summed values that we calculated in the exercise above.
196170

@@ -237,6 +211,7 @@ print("p-value:", p_value)
237211
The p-value is much higher in this instance, suggesting that we can be only 62% certain that there was a difference in mobility amongst immigrants between the two states.
238212

239213
Now what about when comparing U.S. citizens only?
214+
240215
```python
241216
cny3 = pd.DataFrame()
242217
cny3["Total U.S. Citizens (Native)"] = d.groupby("State")["Total US Citizens (Native)"].sum()
@@ -267,6 +242,7 @@ region["Bachelor's Degree"] = control.groupby("Region")["Bachelor's Degree"].sum
267242
nem = region.loc[region.index.isin(["Northeast", "South"])]
268243
# nem
269244
```
245+
270246
```python
271247
t_stat, p_value = stats.ttest_ind(nem["High School Graduate (or its Equivalency)"], nem["Bachelor's Degree"])
272248

@@ -285,6 +261,7 @@ division["Married"] = control.groupby("Division")["Married"].sum()
285261
sam = division.loc[division.index.isin(["South Atlantic", "Mountain"])]
286262
# sam
287263
```
264+
288265
```python
289266
t_stat, p_value = stats.ttest_ind(sam["Never Married"], sam["Married"])
290267

@@ -309,7 +286,7 @@ So what have we learned?? We've learned that:
309286
- No, there does not appear to be a difference in those same patterns amongst educational status between the Northeast (New Jersey, Pennsylvania, Rhode Island, Vermont, etc.) and the South (Georgia, Maryland, Virginia, D.C., etc.).
310287
- No, there also does not appear to be a difference across marital status for conservative divisions like the South Atlantic (Washington D.C., Georgia, Florida, North Carolina, etc.) and the Mountain States (Colorado, Wyoming, Nevada, Arizona, etc.) either.
311288

312-
Why does this matter? It matters because it demonstrates that there's actually a sound and scientific method for answering these questions when they come up. Feel free to try your hand at doing the same the next time you run into an interesting dataset! Or, consider ways you can examine how mobility influences local economies, or even how it impacts the environment.
289+
Why does this matter? It matters because it demonstrates that there's actually a sound and scientific method for answering these questions when they come up. Feel free to try your hand at doing the same the next time you run into an interesting dataset! Or, consider ways you can examine how mobility influences local economies, or even how it impacts the environment.
313290

314291
Thanks for coding with us!
315292

0 commit comments

Comments
 (0)