Skip to content

This repository streamlines the extraction of time series from American Community Survey 5-Year Data (ACS5), American Community Survey 1-Year Data (ACS5) and Decennial Census (SF1).

Notifications You must be signed in to change notification settings

NSAPH-Data-Processing/us_census_time_series_extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

US Census Bureau Time Series Extractor

Introduction

This repository streamlines the extraction of time series from American Community Survey 5-Year Data (ACS5), American Community Survey 1-Year Data (ACS1) and Decennial Census (SF1).

Data Description

American Community Survey 5-Year Data (ACS5)

  • Time Coverage : 2007 - 2018
  • Population: All 50 states including the District of Columbia, Puerto Rico, and other U.S. territories.
  • Geographical Coverage: nation, all states (including DC and Puerto Rico), all metropolitan areas, all congressional districts (116th congress), all counties, all places, all tracts and block groups.
  • ZCTA Coverage
  • Data Source: The American Community Survey (ACS) is an ongoing survey that provides data every year -- giving communities the current information they need to plan investments and services. The ACS covers a broad range of topics about social, economic, demographic, and housing characteristics of the U.S. population.

American Community Survey 1-Year Data (ACS1)

  • Time Coverage : 2005 - 2020
  • Population: All 50 states including the District of Columbia, Puerto Rico, and other U.S. territories.
  • Geographical Coverage: available for the nation, all 50 states, the District of Columbia, Puerto Rico, every congressional district, every metropolitan area, and all counties and places with populations of 65,000 or more.
  • No ZCTA coverage
  • Data Source: The American Community Survey (ACS) is an ongoing survey that provides data every year -- giving communities the current information they need to plan investments and services. The ACS covers a broad range of topics about social, economic, demographic, and housing characteristics of the U.S. population.

Decennial Census

  • Time Coverage: For the years 2000, 2010
  • Geographical Coverage: Summary File 1 (SF 1) is released as individual files for each of the 50 states, the District of Columbia, and Puerto Rico, and for the United States.
  • Data Source: Summary File 1 (SF 1) contains the data compiled from the questions asked of all people and about every housing unit. Population items include sex, age, race, Hispanic or Latino origin, household relationship, household type, household size, family type, family size, and group quarters.

Data Dictionary

Column Name Description
survey Type of survey, it takes one of the following values: dec, acs1, acs5
year Year of the variable estimates.
geographic level Id of geogragraphies at a given geographic level. The column name takes the name of the geographic level (zcta, county, state)
Various sociodemographic columns Columns containing variables related to sociodemographic characteristics such as population distribution by age groups, ethnic composition, housing statistics, and other demographic and socioeconomic variables. The column name is the name of the variable

Repository Content

The repository contains:

Data Lineage

  • Data Source :The primary data source for this project is the American Community Survey 5-Year Data (ACS5), which is publicly available and maintained by U.S. Census Bureau.

  • Extraction : We leverage the Census API to efficiently extract data.

  • Processing & Final Dataset : We transform the subset of variables obtained from the API and generate datasets of selected sociodemographic concepts.

Processing Rules

Processing rules

ACS5 estimates represent estimates over 5-year periods, so how to map a single year to each value is not evident. Here we assign the last year of the 5-year period to the acs5 estimates.

American Community Survey 1-Year Data and Hispanic Variable

Hispanic data was incorporated starting from the 2009 ACS 1-year estimates and hence not available for the years 20005-2008 for ACS 1-Year estimates.

Run

Pipeline

You can run the pipeline steps manually or run the snakemake pipeline described in the Snakefile.

Run the pipeline steps manually to fetch a single variable

The census variable codes that are used to create the time series are defined in the yaml file stored in conf/variables/<config_yaml>. The data paths attached to the pipeline are defined in conf/datapaths/<config_yaml>.

Fix the key params in config

For example:

  - datapaths: datapaths
  - variables: core
# Create and activate the conda environment
conda env create -f requirements.yml
conda activate census_series

# Set your Census API key
export CENSUS_API_KEY='your_api_key_here'

# Create the data directory paths
python src/create_datapaths.py

# Execute the main script
PYTHONPATH=. python src/fetch_variables.py variable=pop_native geo_type=county survey=acs1

Run snakemake pipeline

The census variable codes that are used to create the time series are defined in the yaml file stored in conf/variables/<config_yaml>. The data paths attached to the pipeline are defined in conf/datapaths/<config_yaml>.

Fix the key params in config

For example:

  - datapaths: core_cannon
  - variables: core

To extract all variables and merge them use the snakemake workflow.

# Create and activate the conda environment
conda env create -f requirements.yml
conda activate census_series

# Set your Census API key
export CENSUS_API_KEY='your_api_key_here'
export PYTHONPATH='.'

Create the data directory paths

python src/create_datapaths.py

Execute the Snakemake pipeline

snakemake --cores 1 #select number of cores


### Dockerized Pipeline

For an isolated and reproducible environment, the pipeline is also dockerized. To run the Dockerized task decide in which folder you want the output files to be stored <output_path> and run

```bash
# Run the Dockerized pipeline 
docker pull nsaph/census_series:latest
docker run -v <output_path>:/app/data/output/ --env CENSUS_API_KEY=<your_api_key_here> nsaph/census_series:latest

Note: Remember to replace your_api_key_here with your actual Census API key.

If you want to build your own container try

# To build the Docker image
docker build -t census_series .

For multiplatform

docker buildx build --platform linux/amd64,linux/arm64 -t nsaph/census_series:<version> . --push

Remember this step is unnecessary as the built image is available under nsaph/census_series:latest.

About

This repository streamlines the extraction of time series from American Community Survey 5-Year Data (ACS5), American Community Survey 1-Year Data (ACS5) and Decennial Census (SF1).

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •