Air pollution is an issue with potentially serious repercussions for the health of individuals and populations. In order to educate the community about air pollutants and their effect on the health, the EPA (Environmental Protection Agency) developed the “AQI,” or Air Quality Index, which ranks local air quality on a ranking system that includes good, moderate, unhealthy for sensitive groups, unhealthy, very unhealthy, and hazardous. The specific pollutants assessed are ozone, particle pollution (particulate matter,) carbon monoxide, nitrogen dioxide and sulfur dioxide.
Research on the health effects of exposure to these pollutants is ongoing. One area of interest is the effect of air pollution on the unborn children of pregnant women. Research suggests that high concentrations of pollutants are harmful to developing fetuses, and even relatively low concentrations have been shown to have moderate, negative effects. For example, results from multiple studies show that mothers who are exposed to particle pollution are more likely to give birth to babies of low birth weight. Meanwhile, low birth weight in babies is a risk factor for infant mortality, and has implications for health and development that can be lifelasting.
Results of these studies are sparking alarm among environmental science, medical, and education specialists alike. Because of this, our team was hired by a task force composed of administrators from each of these communities to confirm these findings, and shed additional light on the situation. The goal of the current project, therefore, is to highlight the effects of the various pollutants on the percentage of births per county that are of low birth weight. With our findings, we hope to underscore the importance of implementing mitigating factors in order to protect the youngest members of society.
In order to do this, we collect, merge and analyze data from the CDC (Center for Disease Control) and EPA in order to search for possible links. We then create a model to predict whether or not each county has a “high rate” of low birth weight births for each United States county using classification model(s). We calculated the current mean rate of low birth weight excluding at-risk populations (6.88%) and then created a boolean feature of whether the location had a high rate or not for the year the data was collected. Our baseline standard is 58.6% babies born in the ‘normal’ birth weight category. In the evaluation process, we use accuracy and false negative rate as our metrics, and define a successful model as one that maximizes and minimizes these, respectively.
Air Quality Index - A Guide to Air Quality and Your Health
https://www.airnow.gov/sites/default/files/2018-04/aqi_brochure_02_14_0.pdf
Air Pollution, Stress Contribute to Low-birth-weight Babies for L.A. Latinas, Study Finds
https://news.usc.edu/202917/stress-air-pollution-and-babies_latinas/#:~:text=Conclusions%20from%20study%20on%20stress,%2Fm3)%20of%20PM2.
Using the Air Quality Index
https://www.airnow.gov/aqi/aqi-basics/using-air-quality-index/
Air Quality Service API
https://aqs.epa.gov/aqsweb/documents/data_api.html
CDC Natality
https://wonder.cdc.gov/natality.html
—
Can we use data collected from EPA and CDC websites to predict a higher prevalence of low birth weights in United States counties based on the pollutants present?
The acquisition and cleaning of our data were extensive processes that are described below.
The data was collected from the EPA for Air Quality Index. Its format is individual CSV files that were generated by the AQS_query.py script in our project repo and served by the Air Quality System (AQS) API. This script required an email and API key from aqs.epa.gov. Due to the large amount of data that needed to be collected, this task was split between our project team members to collect various years of data.
Each state for each set of years generated a CSV file that was read in and concatenated. After analyzing the content, we dropped the majority of the 57 columns present, and retained only the 8 columns that are needed for the current project. Next, we reshaped the data so that the AQI pollutants are columns and the values match with the year/state/county they are describing in order to create a complete dataset for modeling.
The air quality data was downloaded from aqs.epa.gov as files containing annual data from years 2006-2021. We combined these CSV files into a single dataframe for use in this project.
We used the WONDER tool from the CDC to export the necessary files into a notebook. For both files, we filtered out any births with maternal risk factors in order to avoid any confounding influences. Then, we renamed certain columns in order to facilitate their use and dropped unneeded columns such as notes, footnotes, and year code. We then merged the CDC dataframes into a single dataframe that contains 1 entry for each year/county combination that has both the low birthweight and all birthweight data. After this, we performed further data cleaning of the state and county information, such as dropping unidentified counties.
Generally, we found that the AQS data has more detail; however, there were fewer counties with a large set of pollutant data. Because of this, we created both a set of merged data from the AQS API data and a set from the AQI Annual Summary data. Both were merged in a way where the current year of air quality data (AQS or AQI) is merged with the following year CDC data.
After cleaning the data as discussed above, two files were created that were the foundation for the Exploratory Data Analysis (EDA) and modeling. These files are available in the data directory of the project repository.
annual_aqi_clean.csv: contains the annual summary of count of days by AQI health categories as well as count of days where some of the main pollutants were the most prevalent, combined with the low birth weight data including the calculated rate and indication of whether the rate is high.aqs_by_county_clean.csv: contains more specific sensor measurements by county summarized by year, combined with the low birth weight data including the calculated rate and indication of whether the rate is high.
The various steps of our process are contained in the following notebooks and scripts:
AQS_query.py: script used to collect the sensor data from the Air Quality System01_data_collect_and_clean.ipynb: notebook for collecting and collating data as well as cleaning, merging, and exporting the cleaned files02_EDA.ipynb: notebook exploring the cleaned data files created from the data collection notebook including various visualizations03_Modeling.ipynb: notebook containing the modeling carried out including evaluation of the final model
We used Pandas, Scikit-learn, Numpy, Matplotlib, Seaborn, and TensorFlow.
Our best working model is a Neural Net classifier. It is saved in the data/models notebook with reference _7537. Code for loading it is present in the modeling notebook, so that if one wishes to see its performance they do not need to retrain the model. Our final accuracy is 75.37%, an improvement of 40% over baseline prediction. Our false negative rate is 20%.
We sought to describe a link between air quality in the prior year and rate of low birthrate in the following year. The biggest challenge we faced was incomplete data. In spite of this, we have results that do show a connection, although it is small. In order to create a more predictive model in the future, we will need more robust air quality data.
- Attempt to make a more complete data set for pollutants using daily or other data
- Bring a more specific time element to the analysis (i.e. 9 months before a target month)
- Build a model for interpretability to understand links between specific air pollutants and LBW
- Recommend continued research & efforts to reduce air pollution