US Census Bureau Time Series Extractor

Introduction
Data Description
Data Dictionary
Repository Content
Data Lineage
Processing Rules
Run

Introduction

This repository streamlines the extraction of time series from American Community Survey 5-Year Data (ACS5), American Community Survey 1-Year Data (ACS1) and Decennial Census (SF1).

Data Description

American Community Survey 5-Year Data (ACS5)

Time Coverage : 2007 - 2018
Population: All 50 states including the District of Columbia, Puerto Rico, and other U.S. territories.
Geographical Coverage: nation, all states (including DC and Puerto Rico), all metropolitan areas, all congressional districts (116th congress), all counties, all places, all tracts and block groups.
ZCTA Coverage
Data Source: The American Community Survey (ACS) is an ongoing survey that provides data every year -- giving communities the current information they need to plan investments and services. The ACS covers a broad range of topics about social, economic, demographic, and housing characteristics of the U.S. population.

American Community Survey 1-Year Data (ACS1)

Time Coverage : 2005 - 2020
Population: All 50 states including the District of Columbia, Puerto Rico, and other U.S. territories.
Geographical Coverage: available for the nation, all 50 states, the District of Columbia, Puerto Rico, every congressional district, every metropolitan area, and all counties and places with populations of 65,000 or more.
No ZCTA coverage
Data Source: The American Community Survey (ACS) is an ongoing survey that provides data every year -- giving communities the current information they need to plan investments and services. The ACS covers a broad range of topics about social, economic, demographic, and housing characteristics of the U.S. population.

Decennial Census

Time Coverage: For the years 2000, 2010
Geographical Coverage: Summary File 1 (SF 1) is released as individual files for each of the 50 states, the District of Columbia, and Puerto Rico, and for the United States.
Data Source: Summary File 1 (SF 1) contains the data compiled from the questions asked of all people and about every housing unit. Population items include sex, age, race, Hispanic or Latino origin, household relationship, household type, household size, family type, family size, and group quarters.

Data Dictionary

Column Name	Description
`survey`	Type of survey, it takes one of the following values: dec, acs1, acs5
`year`	Year of the variable estimates.
geographic level	Id of geogragraphies at a given geographic level. The column name takes the name of the geographic level (zcta, county, state)
Various sociodemographic columns	Columns containing variables related to sociodemographic characteristics such as population distribution by age groups, ethnic composition, housing statistics, and other demographic and socioeconomic variables. The column name is the name of the variable

Repository Content

The repository contains:

conf/: configuration files.
src/fetch_variables.py: the main script for querying Census API.
requirements.yml: conda environment setup.

Data Lineage

Data Source :The primary data source for this project is the American Community Survey 5-Year Data (ACS5), which is publicly available and maintained by U.S. Census Bureau.
Extraction : We leverage the Census API to efficiently extract data.
Processing & Final Dataset : We transform the subset of variables obtained from the API and generate datasets of selected sociodemographic concepts.

Processing Rules

Processing rules

ACS5 estimates represent estimates over 5-year periods, so how to map a single year to each value is not evident. Here we assign the last year of the 5-year period to the acs5 estimates.

American Community Survey 1-Year Data and Hispanic Variable

Hispanic data was incorporated starting from the 2009 ACS 1-year estimates and hence not available for the years 20005-2008 for ACS 1-Year estimates.

Run

Pipeline

You can run the pipeline steps manually or run the snakemake pipeline described in the Snakefile.

Run the pipeline steps manually to fetch a single variable

The census variable codes that are used to create the time series are defined in the yaml file stored in conf/variables/<config_yaml>. The data paths attached to the pipeline are defined in conf/datapaths/<config_yaml>.

Fix the key params in config

For example:

  - datapaths: datapaths
  - variables: core

# Create and activate the conda environment
conda env create -f requirements.yml
conda activate census_series

# Set your Census API key
export CENSUS_API_KEY='your_api_key_here'

# Create the data directory paths
python src/create_datapaths.py

# Execute the main script
PYTHONPATH=. python src/fetch_variables.py variable=pop_native geo_type=county survey=acs1

Run snakemake pipeline

The census variable codes that are used to create the time series are defined in the yaml file stored in conf/variables/<config_yaml>. The data paths attached to the pipeline are defined in conf/datapaths/<config_yaml>.

Fix the key params in config

For example:

  - datapaths: core_cannon
  - variables: core

To extract all variables and merge them use the snakemake workflow.

# Create and activate the conda environment
conda env create -f requirements.yml
conda activate census_series

# Set your Census API key
export CENSUS_API_KEY='your_api_key_here'
export PYTHONPATH='.'

Create the data directory paths

python src/create_datapaths.py

Execute the Snakemake pipeline

snakemake --cores 1 #select number of cores


### Dockerized Pipeline

For an isolated and reproducible environment, the pipeline is also dockerized. To run the Dockerized task decide in which folder you want the output files to be stored <output_path> and run

```bash
# Run the Dockerized pipeline 
docker pull nsaph/census_series:latest
docker run -v <output_path>:/app/data/output/ --env CENSUS_API_KEY=<your_api_key_here> nsaph/census_series:latest

Note: Remember to replace your_api_key_here with your actual Census API key.

If you want to build your own container try

# To build the Docker image
docker build -t census_series .

For multiplatform

docker buildx build --platform linux/amd64,linux/arm64 -t nsaph/census_series:<version> . --push

Remember this step is unnecessary as the built image is available under nsaph/census_series:latest.

Name		Name	Last commit message	Last commit date
Latest commit History 171 Commits
conf		conf
data		data
notes		notes
src		src
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
Snakefile		Snakefile
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

US Census Bureau Time Series Extractor

Introduction

Data Description

American Community Survey 5-Year Data (ACS5)

American Community Survey 1-Year Data (ACS1)

Decennial Census

Data Dictionary

Repository Content

Data Lineage

Processing Rules

Run

Pipeline

Fix the key params in config

Fix the key params in config

Create the data directory paths

Execute the Snakemake pipeline

About

Uh oh!

Releases 4

Packages

Contributors 4

Uh oh!

Languages

NSAPH-Data-Processing/us_census_time_series_extractor

Folders and files

Latest commit

History

Repository files navigation

US Census Bureau Time Series Extractor

Introduction

Data Description

American Community Survey 5-Year Data (ACS5)

American Community Survey 1-Year Data (ACS1)

Decennial Census

Data Dictionary

Repository Content

Data Lineage

Processing Rules

Run

Pipeline

Fix the key params in config

Fix the key params in config

Create the data directory paths

Execute the Snakemake pipeline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 4

Uh oh!

Languages

Packages