Replies: 10 comments 16 replies
-
Website Review: Kaggle – Global Air Pollution Data🔗 https://www.kaggle.com/datasets/sazidthe1/global-air-pollution-data This dataset can be a valuable asset for the data collection phase of our research question: How do respiratory and cardiovascular health outcomes vary across global cities categorized by their dominant air pollutant? ✅ Why this dataset is useful:1. Covers Multiple Cities Worldwide: The dataset includes air quality data from many global cities, which aligns with our objective to assess and compare urban air pollution. 2. Categorized by Pollutant Type: Key pollutants like PM2.5, PM10, NO₂, SO₂, CO, and O₃ are tracked—helping us determine the dominant pollutant in each city. 3. Time Series Format: Pollution data is provided over time, allowing us to analyze trends and correlate with health outcomes (hospital admissions, mortality rates, etc.). 4. Complements Health Data Sources: While this dataset focuses on pollution levels, it can be linked with external health data (e.g., Global Burden of Disease, WHO databases) to explore respiratory and cardiovascular impacts. 6. Structured and Ready to Use: Data is provided in CSV format, making it easy to integrate into our processing pipeline. |
Beta Was this translation helpful? Give feedback.
-
💡Possible Backup if We Can’t Find City-Level Health Data Hey! Just wanted to share something I came across while searching a bit around I hadn’t heard of it before, but it might help if we don’t find direct or consistent health data at the city level. It’s called the Global Burden of Disease (GBD) study, led by IHME It compiles disease estimates (like for asthma, COPD, stroke, heart disease, etc.) and allows you to compare across countries or regions using standardized health metrics like: DALYs (Disability-Adjusted Life Years) YLLs (Years of Life Lost) Deaths per 100,000, etc. On their interactive tool (VizHub) on the website, we can: Choose the exact metric you want (like DALYs YALLs or whatever) Select the specific disease category (e.g. cardiovascular or respiratory diseases) And even directly filter by risk factor, like air pollution including ambient PM2.5, ozone, and more. BUT The latest full data set is from 2021, which seems to be the most recent release here's the links: 🔗 GBD VizHub (explore and download): https://vizhub.healthdata.org/gbd-results/ Might be worth keeping this in mind if we hit a dead end on city-level raw health data it could still help! I’ll keep researching thoroughly tomorrow and explore available datasets and sources. |
Beta Was this translation helpful? Give feedback.
-
WHO Air Quality DatabaseA comprehensive dataset with over 40,000 records of PM2.5, PM10, and NO₂ concentrations across global cities. The data represents annual mean concentrations collected between 2010 and 2022 from official, government-verified monitoring stations. ✅ Pros
|
Beta Was this translation helpful? Give feedback.
-
To be honest, I haven’t found anything truly convenient so far most of the datasets were either incomplete or behind a paywall. So I suggest that we either narrow our focus to specific countries (which makes it easier to find reliable health data), or go with Linah’s idea of using the Global Burden of Disease (GBD) dataset. But I’ll keep searching and will share any updates I find. |
Beta Was this translation helpful? Give feedback.
-
I found a great report titled "State of Global Air 2024 – A Special Report on Global Exposure to Air Pollution and Its Health Impacts, with a Focus on Children’s Health". Although I was excited to see "2024" in the title, the data it presents actually covers the years 1990 to 2021. The report is a collaborative effort between the Health Effects Institute and the Global Burden of Disease project. It provides a comprehensive global overview of air pollution exposure (PM2.5, NO₂, ozone) and its impact on respiratory and cardiovascular health, with a special focus on children under five. 🔗 https://www.stateofglobalair.org/resources/report/state-global-air-report-2024 |
Beta Was this translation helpful? Give feedback.
-
Hi team! 👋 Falaq and I wanted to share an update based on the datasets we've found so far and we’d want your feedback and thoughts. It's not about changing the main research question or objectives.
The only shift is in the timeframe, from a strict 2024 focus to a range from 2018 to 2021, which fits better with available data (like IHME’s GBD and WHO’s pollution data that @Adamx090 shared). Why the Change in these Years? We chose 2018–2021 on purpose not just as a workaround, but to: Capture real health and pollution trends over time Include the impact of COVID-19, which is a major shift worth analyzing Broaden our scope and add value, instead of focusing on a single (unavailable) year We also wanted to acknowledge that while these years might sound like ancient history, they actually let us study meaningful shifts before, during, and after the pandemic. suggested Question:
⭕Possible Extension: Predicting 2024 Since we can get 2024 pollution concentrations, how about thinking of building a basic prediction model to forecast 2024 health outcomes based on trends from 2018–2021? This would be clearly marked as exploratory but it adds a forward-looking piece to our project that could be really interesting. Data We Have So Far (I'll share the files on Slack) : WHO ambient air pollution data IHME GBD estimates (2018–2021) (for DALYs/deaths by disease and risk factor) |
Beta Was this translation helpful? Give feedback.
-
Hi @linahKhayri, well done you guys, great idea you have here. I just went through the dataset proposed by @Adamx090 myself. Your suggestions are valid, and i strongly agree with you. This may be our only viable course of action at this point, considering the challenges we've been facing lately regarding the data sourcing. I have an addition though, how about we work with a wider range of the dataset, say from 2010 - 2021 from @Adamx090 proposed dataset and adjust the filter of yours to cover the same year range, then build a model that predicts a varable, llike you have suggested, since models tend to perform better when they are built with large datasets. I am yet to fully grasp the nature of your own dataset (IHME GBD estimates), so ill reserve further comments for now. I also feel our Research question may become overly complicated if we factor in the COVID 19 pandemic. Simplicity sometimes is key. Hence i suggest we stick to our current question. Just a suggestion tho, let hear what others have say. Well done again |
Beta Was this translation helpful? Give feedback.
-
@linahKhayri and @FalaqMajeed Thank you for considering this thoughtful addition to our research. Adding a 2024 prediction component can be a valuable and forward-looking extension to our project — as long as we clearly mark it as exploratory. This transparency ensures that our findings are interpreted as preliminary insights rather than definitive conclusions. The benefits are clear: we can leverage the availability of recent PM2.5 data, apply simple regression or trend analysis methods, and generate a basic forecast of health outcomes (e.g., DALYs or mortality) for 2024. This not only enriches our project with a future-oriented perspective but also enhances its relevance for policymakers by offering insights into potential upcoming health risks linked to air pollution trends. |
Beta Was this translation helpful? Give feedback.
-
Another suggestion from my side: instead of using 2024 pollutant data, we could use EPA’s AQS (Air Quality System) data for the years 2018–2021 to align directly with the available health outcome data. ✅ Potential Advantages:Temporal consistency: Health and pollution data would cover the exact same years, improving the reliability of our comparisons. Data quality: EPA data is regularly validated and publicly accessible, with high-resolution daily and annual summaries. Geographic coverage: Enables city- or county-level analysis within the U.S., which could support detailed insights. This trade-off might make our models stronger for U.S.-based analysis and help maintain internal consistency across the dataset. Since the EPA data covers only the United States, this approach would be most suitable if the study focuses solely on U.S. cities. You can explore and download the data here: https://aqs.epa.gov/aqsweb/airdata/download_files.html#Daily Alternatively, we could also consider using the datasets shared by @Adamx090, which may offer broader geographic coverage or complementary data. https://www.who.int/publications/m/item/who-ambient-air-quality-database-%28update-jan-2024%29 |
Beta Was this translation helpful? Give feedback.
-
Review of the Study: Impact of Air Pollution on Human Health📎 Kaggle Notebook Link This study uses publicly available data to examine the health effects—particularly mortality—of air pollution across countries, with a focus on PM2.5 and indoor/outdoor pollution sources.
📊 Available Datasets
✅ Pros
Covers both indoor and outdoor air pollution-related deaths, focusing on respiratory and cardiovascular causes.
Includes breakdowns by year and age group, useful for limited trend and vulnerability analysis.
Provided in CSV format; clean structure allows merging with external datasets like WHO or GBD.
Enables analysis of pollution effects by region or country and classification of dominant pollutant types. ❌ Cons
All data is aggregated at the country or regional level; city-specific analysis is not possible directly.
Focuses only on mortality. Hospital admissions, disease prevalence, or chronic condition data are not included.
Other urban pollutants like NO₂ and O₃ are underrepresented or missing, limiting full pollutant classification. 🌍 How This Content Supports Our ProjectResearch Question:
The dataset fits a proxy strategy—grouping cities by national pollution profiles—to compensate for missing city-level data.
Strong coverage of mortality outcomes related to air pollution, especially respiratory and cardiovascular-related deaths.
Enables high-level trend and demographic analysis using year and age group filters.
While focused mostly on PM2.5, the dataset distinguishes between indoor and outdoor pollution effects.
GBD (Global Burden of Disease) data is publicly accessible and highly compatible with this dataset. Provided by the Institute for Health Metrics and Evaluation (IHME), GBD data can be accessed through the GBD Results. Users can filter by country, year, disease type (e.g., COPD, ischemic heart disease), age group, and sex, and download the results in CSV format. 📂 Available metrics include: Mortality rates from air pollution-attributable diseases Years of Life Lost (YLL) due to respiratory and cardiovascular conditions Prevalence and incidence of specific diseases related to air quality 📌 This makes it feasible to link pollution exposure data with relevant health outcomes for statistical analysis such as correlation or regression modeling at the country level. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
This discussion is dedicated to coordinating the data collection phase for our research question:
_
_
Objective:
To identify, collect, and organize relevant datasets that will enable us to:
Assess air quality indicators across multiple global cities.
Determine the dominant pollutant in each city (e.g., PM2.5, NO₂, O₃).
Link pollution levels to respiratory and cardiovascular health outcomes such as hospital admissions, mortality rates, or disease prevalence.
Key Tasks:
Compile a list of reliable data sources (WHO, OpenAQ, Global Burden of Disease, local health departments, etc.)
Identify target cities or regions to include in the study.
Gather air pollution data (preferably categorized by pollutant type and time).
Collect health outcome data for respiratory and cardiovascular conditions.
Beta Was this translation helpful? Give feedback.
All reactions