|
| 1 | +--- |
| 2 | +title: Analyze U.S. Census Data with SciPy |
| 3 | +author: |
| 4 | +uid: |
| 5 | +# datePublished: |
| 6 | +published: false |
| 7 | +description: |
| 8 | +header: https://raw.githubusercontent.com/codedex-io/projects/main/projects/analyze-us-census-data-with-scipy/header.gif |
| 9 | +tags: |
| 10 | + - intermediate |
| 11 | + - python |
| 12 | +--- |
| 13 | + |
| 14 | +<BannerImage |
| 15 | + link="https://raw.githubusercontent.com/codedex-io/projects/main/projects/analyze-us-census-data-with-scipy/header.gif" |
| 16 | + description="Title Image" |
| 17 | + uid={true} |
| 18 | + cl="for-sidebar" |
| 19 | +/> |
| 20 | + |
| 21 | +# Analyze U.S. Census Data with SciPy |
| 22 | + |
| 23 | +<AuthorAvatar |
| 24 | + author_name="" |
| 25 | + author_avatar="/images/projects/authors/" |
| 26 | + username="" |
| 27 | + uid={true} |
| 28 | +/> |
| 29 | + |
| 30 | +<BannerImage |
| 31 | + link="https://raw.githubusercontent.com/codedex-io/projects/main/projects/analyze-us-census-data-with-scipy/header.gif" |
| 32 | + description="Banner" |
| 33 | + uid={true} |
| 34 | +/> |
| 35 | + |
| 36 | +**Prerequisites**: Intermediate Python, SciPy |
| 37 | +**Version**: Python 3 |
| 38 | +**Read Time**: X minutes |
| 39 | + |
| 40 | +## Introduction |
| 41 | + |
| 42 | +No matter where you are on your journey to mastering data science, it's always helpful to practice the basics of finding, cleaning, and analyzing real-world datasets. Back in 2020, COVID-19 sent us many of us into quarantine and while its long-term impact is still relatively unknown, we can reference a handful of public datasets to begin to scratch the surface. |
| 43 | + |
| 44 | +In this tutorial, we'll be analyzing a dataset gathered from the 2022 [U.S. Census](https://data.census.gov/) covering geographic relocation roughly two years after quarantine. |
| 45 | + |
| 46 | +<RoundedImage |
| 47 | + link="https://i.imgur.com/QSycenX.gif" |
| 48 | + description="U.S. Census Data Analysis" |
| 49 | +/> |
| 50 | + |
| 51 | +We will begin to test our assumptions and answer some basic questions about various demographic groups using SciPy, NumPy, Pandas, and some basic working knowledge of statistics, including the following: |
| 52 | + |
| 53 | +- Is there a difference in mobility patterns between those that moved within their home state versus across states lines in New York and California in particular? |
| 54 | + - And do trends vary amongst citizenship status? |
| 55 | +- Is there a difference in those same patterns amongst educational status between the Northeast (New Jersey, Pennsylvania, Rhode Island, Vermont, etc.) and the South (Georgia, Maryland, Virginia, D.C., etc.)? |
| 56 | +- What about marital status across conservative divisions like the South Atlantic (Washington D.C., Georgia, Florida, North Carolina, etc.) and the Mountain States (Colorado, Wyoming, Nevada, Arizona, etc.)? Do we notice a difference in geographic mobility there as well? |
| 57 | + |
| 58 | +## Cleaning Raw Data |
| 59 | + |
| 60 | +As you can see below, the original data provided by census.gov contains two separate CSVs, one with the raw data and another with metadata that contains details of what each column represents. |
| 61 | + |
| 62 | +<RoundedImage |
| 63 | + link="https://i.imgur.com/uvbRfkQ.png" |
| 64 | + description="U.S. Census Data Analysis" |
| 65 | +/> |
| 66 | +<RoundedImage |
| 67 | + link="https://i.imgur.com/nxdFv8j.png" |
| 68 | + description="U.S. Census Data Analysis" |
| 69 | +/> |
| 70 | + |
| 71 | + |
| 72 | +When this happens, it's helpful to have some basic data preparation skills. While this isn't typically a requirement for using the SciPy package or conducting basic statistical analysis, you can look at each step we took to clean and structure the data by referencing the source code [here](https://colab.research.google.com/drive/1ujk1u0TWqlNolFwv9-rUNMjaghZuLLZK). |
| 73 | + |
| 74 | +## About the Clean Datasets |
| 75 | + |
| 76 | +The source code cranks out multiple categories of the same data, including information on the total population in 2022: |
| 77 | + |
| 78 | +- those that moved within the same county and/or state |
| 79 | +- those that moved between states |
| 80 | +- those moved from abroad. |
| 81 | + |
| 82 | +For the categories listed, each dataset contains the following columns, which are all characterized by the levels of detail noted below: |
| 83 | + |
| 84 | +<RoundedImage |
| 85 | + link="https://i.imgur.com/dzkXTSC.gif" |
| 86 | + description="U.S. Census Data Analysis" |
| 87 | +/> |
| 88 | + |
| 89 | +### Geographical Data |
| 90 | + |
| 91 | +Geography ID: a unique identifier used to reference specific geographic areas |
| 92 | + |
| 93 | +Census Tract: a small, relatively permanent subdivision of a county |
| 94 | + |
| 95 | +State: the state in which the Census Tract is located |
| 96 | + |
| 97 | +County: the county within the state in which the Census Tract resides |
| 98 | + |
| 99 | +Region: the broader geographic area in which the state or county is located, typically referring to one of four major regions: Northeast, Midwest, South, or West |
| 100 | + |
| 101 | +Division: a sub-region within a Census Bureau-defined region, used for more detailed geographic analysis |
| 102 | + |
| 103 | +Total Population: the total number of people residing in a specific Census Tract |
| 104 | + |
| 105 | +### Citizenship Status |
| 106 | + |
| 107 | +Total U.S. Citizens (Native): the total number of individuals who are U.S. citizens by birth |
| 108 | + |
| 109 | +Total U.S. Citizens (Naturalized): the total number of individuals who have obtained U.S. citizenship through the naturalization process after being born in another country |
| 110 | + |
| 111 | +Total Non-Citizens: the total number of individuals who are not U.S. citizens, including both legal immigrants, visa holders, and undocumented individuals |
| 112 | + |
| 113 | +### Marital Status |
| 114 | + |
| 115 | +Married: the total number of individuals who are legally married at the time of the census |
| 116 | + |
| 117 | +Never Married: the total number of individuals who have never been legally married |
| 118 | + |
| 119 | +Separated: the total number of individuals who are legally married but currently living apart from their spouse due to marital separation |
| 120 | + |
| 121 | +Divorced: the total number of individuals who have been legally divorced |
| 122 | + |
| 123 | +Widowed: the total number of individuals who have lost their spouse and have not remarried |
| 124 | + |
| 125 | +### Educational Attainment |
| 126 | + |
| 127 | +Less than a High School Graduate: the total number of individuals who have not completed high school or its equivalent |
| 128 | + |
| 129 | +High School Graduate (or its Equivalency): the total number of individuals who have completed high school or obtained an equivalent diploma, such as a GED |
| 130 | + |
| 131 | +Some College or Associate's Degree: the total number of individuals who have attended college or earned an Associate's Degree but have not completed a Bachelor's Degree |
| 132 | + |
| 133 | +Bachelor's Degree: the total number of individuals who have earned a Bachelor's Degree, typically after completing four years of undergraduate education at a university or college |
| 134 | + |
| 135 | +Graduate or Professional Degree: the total number of individuals who have earned a Master's Degree, Doctoral Degree (Ph.D.), or other professional degrees such as a Law Degree (J.D.) or Medical Degree (M.D.) |
| 136 | + |
| 137 | +In this tutorial, we'll use SciPy to run some analysis and find out whether there are statistically significant differences in relocation patterns for each group - but first, let’s review the basics. |
| 138 | + |
| 139 | +## Some Basic Stats |
| 140 | + |
| 141 | +When conducting an exploratory analysis, we first want to make sure that our data abides by the [underlying assumptions](https://www.pythonfordatascience.org/parametric-assumptions-python/) of whatever method we use. This is extremely important because it can make or break the credibility of our work in the long-term, and that’s a huge no-no for the data science industry. |
| 142 | + |
| 143 | +Generally speaking, most data science models abide by what we call parametric assumptions, which refer to normal distribution of a fixed set of parameters. In our particular case, those parameters include, but are not limited to, the columns we listed above. The three parametric assumptions are independence, normality, and homogeneity of variances. |
| 144 | + |
| 145 | +Additionally, traditional A/B testing typically utilizes one of two methods: either a chi-squared (which looks for dependence between two categorical variables) or a t-test (which looks for a statistically significant difference between the averages of two groups) to validate what we refer to as the null hypothesis (which is the assumption that there is no relationship or comparison between two patterns of behavior). |
| 146 | + |
| 147 | +For this tutorial, we'll be running t-tests. |
| 148 | + |
| 149 | +## Getting Started |
| 150 | + |
| 151 | +To get started, you'll need the following [datasets](https://drive.google.com/drive/folders/1xO33dvJV_RySl77y2W-7lxIvBW7PUoEg?usp=sharing) and a copy of [this Google Colab notebook](https://colab.research.google.com/drive/1GWiNXPVuRTORqEBNFV7zpTGZD_yeprNt?usp=sharing). |
| 152 | + |
| 153 | +Feel free to manually upload the CSVs to the notebook if you don't already see them embedded in your copy. |
| 154 | + |
| 155 | +<RoundedImage |
| 156 | + link="https://i.imgur.com/Iz1PLIY.png" |
| 157 | + description="U.S. Census Data Analysis" |
| 158 | +/> |
| 159 | + |
| 160 | +First we'll begin by importing the necessary packages: |
| 161 | + |
| 162 | +```python |
| 163 | +import pandas as pd |
| 164 | +import numpy as np |
| 165 | +from scipy import stats |
| 166 | +``` |
| 167 | + |
| 168 | +Next, we'll load the CSV files to their own dataframes using pandas. |
| 169 | + |
| 170 | +```python |
| 171 | +c = ("/content/moved_same_state.csv") |
| 172 | +v = ("/content/moved_between_states.csv") |
| 173 | +``` |
| 174 | +```python |
| 175 | +control = pd.read_csv(c) |
| 176 | +variant = pd.read_csv(v) |
| 177 | + |
| 178 | +#control.head() |
| 179 | +#variant.head() |
| 180 | +``` |
| 181 | + |
| 182 | +## Let's Explore |
| 183 | + |
| 184 | +Let's begin by manually creating an empty dataframe (table) based on each level of detail (County, State, Division, and Region) listed by the U.S. Census. |
| 185 | + |
| 186 | +```python |
| 187 | +county = pd.DataFrame() |
| 188 | +state = pd.DataFrame() |
| 189 | +division = pd.DataFrame() |
| 190 | +region = pd.DataFrame() |
| 191 | +``` |
| 192 | + |
| 193 | +Now let's complete a simple pandas exercise. Sum the total number of people that relocated within the U.S. in both the control and variant groups at the state level. |
| 194 | + |
| 195 | +<RoundedImage |
| 196 | + link="https://i.imgur.com/8LlsAhS.png" |
| 197 | + description="U.S. Census Data Analysis" |
| 198 | +/> |
| 199 | + |
| 200 | + |
| 201 | +```python |
| 202 | +state["Relocated Within State"] = control.groupby("State")["Total Population"].sum() |
| 203 | +state["Relocated Between States"] = variant.groupby("State")["Total Population"].sum() |
| 204 | + |
| 205 | +state.head() |
| 206 | +``` |
| 207 | +Comparing California residents to those from New York only, **is there a significant difference in mobility between those that relocated within the same** area (in this case, state) **versus those that moved across state lines?** |
| 208 | + |
| 209 | +We'll use the .loc method to search for the two states and extract the summed values that we calculated in the exercise above. |
| 210 | +```python |
| 211 | +cny = state.loc[["California", "New York"]] |
| 212 | + |
| 213 | +cny |
| 214 | +``` |
| 215 | +<RoundedImage |
| 216 | + link="https://i.imgur.com/IR9CX8c.png" |
| 217 | + description="U.S. Census Data Analysis" |
| 218 | +/> |
| 219 | + |
| 220 | + |
| 221 | +```python |
| 222 | +t_stat, p_value = stats.ttest_ind(cny["Relocated Within State"], cny["Relocated Between States"]) |
| 223 | + |
| 224 | +print("t-statistic:", t_stat) |
| 225 | +print("p-value:", p_value) |
| 226 | +``` |
| 227 | + |
| 228 | +A p-value of ~0.20 suggests that we can be roughly 80% certain the rate at which New York acquired new residents compared to retaining its current ones was higher than that of California roughly two years after quarantine. |
| 229 | + |
| 230 | +The common threshold for statistical significance is < 0.05, however, which would indicate a confidence interval of 95%. |
| 231 | + |
| 232 | +Comparing prior residents only (meaning those that moved within state lines), was there a significant difference in mobility amongst immigrants between the two states? |
| 233 | + |
| 234 | +This time, instead of summing the values before we index the states we want to look at, we'll filter the dataset so that we are running calculations on only the specific categories that we want to test. |
| 235 | + |
| 236 | +```python |
| 237 | +d = control[(control["State"] == "California") | (variant["State"] == "New York")] |
| 238 | + |
| 239 | +cny2 = pd.DataFrame() |
| 240 | +cny2["Total U.S. Citizens (Naturalized)"] = d.groupby("State")["Total US Citizens (Naturalized)"].sum() |
| 241 | +cny2["Total Non-Citizens"] = d.groupby("State")["Total Non-Citizens"].sum() |
| 242 | + |
| 243 | +cny2 |
| 244 | +``` |
| 245 | + |
| 246 | +```python |
| 247 | +t_stat, p_value = stats.ttest_ind(cny2["Total U.S. Citizens (Naturalized)"], cny2["Total Non-Citizens"]) |
| 248 | + |
| 249 | +print("t-statistic:", t_stat) |
| 250 | +print("p-value:", p_value) |
| 251 | +``` |
| 252 | + |
| 253 | +The p-value is much higher in this instance, suggesting that we can be only 62% certain that there was a difference in mobility amongst immigrants between the two states. |
| 254 | + |
| 255 | +Now what about when comparing U.S. citizens only? |
| 256 | +```python |
| 257 | +cny3 = pd.DataFrame() |
| 258 | +cny3["Total U.S. Citizens (Native)"] = d.groupby("State")["Total US Citizens (Native)"].sum() |
| 259 | +cny3["Total U.S. Citizens (Naturalized)"] = cny2["Total U.S. Citizens (Naturalized)"] |
| 260 | + |
| 261 | +cny3 |
| 262 | +``` |
| 263 | + |
| 264 | +```python |
| 265 | +t_stat, p_value = stats.ttest_ind(cny3["Total U.S. Citizens (Native)"], cny3["Total U.S. Citizens (Naturalized)"]) |
| 266 | + |
| 267 | +print("t-statistic:", t_stat) |
| 268 | +print("p-value:", p_value) |
| 269 | +``` |
| 270 | + |
| 271 | +The p-value is even higher in this instance, suggesting that there is no difference in domestic mobility between the two amongst U.S. citizens (or that there's merely a 35% chance that it is). |
| 272 | + |
| 273 | +## Additional Questions |
| 274 | + |
| 275 | +When comparing the Northeast to the South, is there a difference between the total number of high school graduates that relocated compared to those with bachelors' degrees? |
| 276 | + |
| 277 | +From this point on, we'll reuse the methods from the above section. |
| 278 | + |
| 279 | +```python |
| 280 | +region["High School Graduate (or its Equivalency)"] = control.groupby("Region")["High School Graduate (or its Equivalency)"].sum() |
| 281 | +region["Bachelor's Degree"] = control.groupby("Region")["Bachelor's Degree"].sum() |
| 282 | + |
| 283 | +nem = region.loc[region.index.isin(["Northeast", "South"])] |
| 284 | +#nem |
| 285 | +``` |
| 286 | +```python |
| 287 | +t_stat, p_value = stats.ttest_ind(nem["High School Graduate (or its Equivalency)"], nem["Bachelor's Degree"]) |
| 288 | + |
| 289 | +print("t-statistic:", t_stat) |
| 290 | +print("p-value:", p_value) |
| 291 | +``` |
| 292 | + |
| 293 | +Lastly, let's compare marital status in more conservative divisions like the South Atlantic (Washington D.C., Georgia, Florida, North Carolina, etc.) and the Mountain States (Colorado, Wyoming, Nevada, Arizona, etc.). |
| 294 | + |
| 295 | +Did those who have yet to marry relocate more often than those who already are amongst the control group (meaning those that moved within state lines)? |
| 296 | + |
| 297 | +```python |
| 298 | +division["Never Married"] = control.groupby("Division")["Never Married"].sum() |
| 299 | +division["Married"] = control.groupby("Division")["Married"].sum() |
| 300 | + |
| 301 | +sam = division.loc[division.index.isin(["South Atlantic", "Mountain"])] |
| 302 | +#sam |
| 303 | +``` |
| 304 | +```python |
| 305 | +t_stat, p_value = stats.ttest_ind(sam["Never Married"], sam["Married"]) |
| 306 | + |
| 307 | +print("t-statistic:", t_stat) |
| 308 | +print("p-value:", p_value) |
| 309 | +``` |
| 310 | + |
| 311 | +Now answer the same exact question at the county level using two counties that you know of and follow the same formula as above. |
| 312 | + |
| 313 | +```python |
| 314 | +county["Never Married"] = control.groupby("County")["Never Married"].sum() |
| 315 | +county["Married"] = control.groupby("County")["Married"].sum() |
| 316 | + |
| 317 | +#home = county.loc[county.index.isin(["Your Home county", "Home County 2"])] |
| 318 | +``` |
| 319 | + |
| 320 | +## Conclusion |
| 321 | + |
| 322 | +So what have we learned?? We've learned that: |
| 323 | + |
| 324 | +- Yes, there a difference in mobility patterns between those that moved within their home state versus across states lines in New York and California (with roughly 80% certainty), although citizenship status seems to bear little to no relationship between the two. |
| 325 | +- No, there does not appear to be a difference in those same patterns amongst educational status between the Northeast (New Jersey, Pennsylvania, Rhode Island, Vermont, etc.) and the South (Georgia, Maryland, Virginia, D.C., etc.). |
| 326 | +- No, there also does not appear to be a difference across marital status for conservative divisions like the South Atlantic (Washington D.C., Georgia, Florida, North Carolina, etc.) and the Mountain States (Colorado, Wyoming, Nevada, Arizona, etc.) either. |
| 327 | + |
| 328 | +Why does this matter? It matters because it demonstrates that there's actually a sound and scientific method for answering these questions when they come up. Feel free to try your hand at doing the same the next time you run into an interesting dataset. |
| 329 | + |
| 330 | +Thanks for coding with us! |
0 commit comments