Skip to content

A Machine Learning algorithm to detect ADHD among female patients based on the fMRI and socio-economic data

Notifications You must be signed in to change notification settings

Kulieshova/ADHD-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ADHD and sex classifier

🎯 Project Highlights

  • Built a multioutput classification model using XGBoost to identify gender-specific and ADHD-related brain connectivity patterns
  • Achieved an F1 score of .6962 and a ranking of 310 out of 540 participating teams on the final Kaggle Leaderboard
  • Implemented data balancing using SMOTE, ADYSYN, and SMOTEEN to optimize results within compute constraints

🔗 WiDS Datathon 2025 | Kaggle Competition Page


👩🏽‍💻 Setup & Execution

Clone the repo by pasting this into the terminal:

git clone git@github.com:kulieshova/ADHD-prediction.git

Using the terminal, cd into the ADHD-prediction/ folder

cd  ADHD-prediction/

Download the data from WiDS Kaggle Page

Run the cells inside XGB model.ipynb


🏗️ Project Overview

The WiDS Datathon 2025 on Kaggle is a core component of Break Through Tech’s Spring AI Studio. In this initiative, teams of four BTT students apply the advanced machine learning techniques they mastered during the program’s ML course to a real-world challenge. The Kaggle challenge focuses on building robust predictive models for ADHD diagnosis while simultaneously uncovering subtle, sex-specific neural patterns by analyzing differences in brain connectivity and activation between males and females. The project has significant implications for ADHD research and clinical care. By integrating diverse data modalities and uncovering nuanced, sex-specific neural patterns, the developed models could fundamentally transform how ADHD is diagnosed and managed. In particular, identifying distinct fMRI signatures that differentiate between male and female patients offers a pathway to address the persistent underdiagnosis of ADHD in women, paving the way for more personalized and effective treatment strategies.


📊 Data Exploration

  • Structure of the Data:

    • Train Categorical: (1213, 10)
    • Train Connectome: (1213, 19901)
    • Train Quantitative: (1213, 19)
    • Train Solutions: (1213, 3)
    • Test Categorical: (304, 10)
    • Test Connectome: (304, 19901)
    • Test Quantitative: (304, 19)
  • Handling Missing Values:

    • For numerical features, missing values were imputed with the mean.
    • For categorical features, missing values were replaced with the string "unknown".
  • Data Exploration and Preprocessing Approaches:

    • Verified the sizes of all three datasets to ensure that participant IDs were matching across splits.
    • Conducted exploratory data analysis to assess the distribution, range, and anomalies within each feature set.
    • Examined the connectome data for consistency and sparsity given its high dimensionality.
  • Data Balancing Techniques:

    • To address class imbalances in both sex (male/female) and ADHD diagnosis (ADHD/no ADHD), we applied an undersampling strategy for overpopulated classes (male with no ADHD, male with ADHD) combined with SMOTE for underpopulated classes (female with ADHD, female with no ADHD). This approach ensured a more balanced distribution across all categories, ultimately contributing to improved model performance.
    • ADHD/non-ADHD × male/female:
    • Original distribution: {'1_0': 581, '1_1': 250, '0_0': 216, '0_1': 166}
    • After balancing: {'0_0': 250, '0_1': 250, '1_0': 250, '1_1': 250}
    • Exploring multiple methods including class weighting, ADASYN, and SMOTEENN, with SMOTE proving most effective for this particular dataset.
  • Challenges and Assumptions:

    • Imbalance in Data Examples:
      The dataset exhibited an imbalance between male and female participants (with almost twice as much data representing male participants), which contributed to lower accuracy in sex prediction models.
    • Data Integration:
      Aligning participant IDs across multiple data types required careful preprocessing to maintain data integrity.
    • Computational Constraints:
      The large size of the connectome data (around 20,000 features) posed challenges in terms of computation and memory usage during initial exploration and model training.
Feauture distribution in training data Visualization
Gender distribution image
ADHD distribution image

🧠 Model Development

  • XGBoost with SMOTE oversampling and linear objective
    Kaggle score: 0.69326
  • LightGBM
    Kaggle score: 0.41
  • CatBoost
    Kaggle score: 0.34
  • Feature selection and hyperparameter tuning via XGBClassifier with parameters:
    • random_state = 9
    • n_estimators = 100
    • learning_rate = 0.5
    • max_depth = 10

The Right Model, the Right Method

Our initial modeling strategy involved predicting both ADHD_Outcome and Sex_F simultaneously using a multi-output classifier.

This approach underperformed, as it failed to capture the distinct statistical structure of each target and often conflated their signals — particularly problematic given the correlation between sex and ADHD status in the dataset.

Midway through the project, we pivoted to training separate models for each target, which markedly improved performance for the imbalanced and complex ADHD classification task.

Why Separating the Targets Helped

  • Distinct feature dependencies: ADHD and sex are predicted by different feature subsets; joint modeling diluted the signal for ADHD.
  • Reduced overfitting: The joint model overfit on the easier Sex_F prediction, compromising ADHD accuracy.
  • Task-specific optimization: Separate models allowed for tuning loss functions, metrics, and class balancing strategies tailored to each target.

📈 Results & Key Findings

For XGB model:

  • Overall Kaggle accuracy: .69326
  • Cross-validation accuracy:
image
Findings in training data Visualization
25 most predictive features for ADHD image
25 most predictive features for gender image

🖼️ Impact Narrative

1. Brain Activity Patterns and Gender Differences

Our project integrated fMRI data with behavioral and socio-demographic measures to pinpoint the neural signatures of ADHD—and how these signatures differ between males and females. Here are some specifics from our feature analysis:

🧠 ADHD-Related fMRI Markers:

  • Features like MRI_Track_Scan_Location_1 (coefficient ≈ 0.2001) and MRI_Track_Scan_Location_3 (≈ 0.1853) indicate that spatial brain activation patterns in regions tied to attention and executive function are predictive of ADHD. These markers suggest disruptions in networks such as the default mode and prefrontal circuits.

🧑‍🧑‍🧒‍🧒 Behavioral and Socio-Demographic Influences:

  • The top predictor for ADHD outcome was SDQ_SDQ_Hyperactivity (≈ 0.6567), highlighting the strong link between hyperactivity symptoms and ADHD. Other behavioral measures like SDQ_SDQ_Conduct_Problems (≈ 0.3615) also emerged as key predictors. Socio-demographic features (e.g., Basic_Demos_Enroll_Year_2016 and Basic_Demos_Enroll_Year_2018) further underline how contextual factors might modulate ADHD expression.

👧 Gender-Specific Differences:

  • When predicting sex, our model identified distinct features. For instance, Barratt_Barratt_P1_Edu_21 (≈ 0.2661) suggests that educational or related socio-economic factors differ significantly between genders. Additionally, several fMRI-derived features—such as 158throw_191thcolumn (≈ 0.2478)—imply that subtle variations in brain connectivity or activation patterns are instrumental in differentiating males from females. Behavioral markers like SDQ_SDQ_Emotional_Problems (≈ 0.2381) and SDQ_SDQ_Prosocial (≈ 0.2322) further suggest that emotional regulation and social behavior may manifest differently, hinting that female ADHD patients might exhibit unique neural and behavioral profiles compared to males.

2. Contributions to ADHD Research and Clinical Care

By combining detailed fMRI analyses with behavioral and socio-demographic data, our work advances ADHD research and clinical practice in several meaningful ways:

🧑‍⚕️ Refined Diagnostic Models:

  • The identification of both shared and gender-specific predictors (e.g., hyperactivity and fMRI scan locations for ADHD, versus education and nuanced fMRI “throw” features for sex) supports the development of more precise, individualized diagnostic criteria. This is particularly crucial for recognizing ADHD in females, who are frequently underdiagnosed due to subtler or different symptom presentations.

💊 Personalized Intervention Strategies:

  • With clear evidence that brain activity patterns and behavioral traits vary by gender, clinicians can tailor intervention strategies. For example, treatments for female patients might focus more on emotional regulation and cognitive control if their neuroimaging data suggest distinct activation in relevant brain regions.

🚀 Next Steps & Future Improvements

Limitations

One of the primary challenges we encountered was the limitation in computational and technical resources. The large dataset significantly slowed down the hyperparameter tuning process, meaning we couldn't explore the full range of parameter configurations efficiently. This constraint forced us to compromise on the depth of hyperparameter optimization, potentially leaving some performance gains on the table.

Future Improvements with More Resources:

With additional time and better computational infrastructure (e.g., access to high-performance computing clusters or GPUs), we would:

  • 🛠️ Expand Hyperparameter Tuning: Implement a more extensive grid search or advanced optimization techniques such as Bayesian optimization. This would allow for a more thorough exploration of the hyperparameter space, potentially yielding a more optimal model.

  • 🙌 Enhance Model Scalability: Run more complex and resource-intensive models, such as end-to-end deep learning approaches that work directly on raw fMRI data, capturing intricate spatial and temporal patterns that might be missed with our current feature-engineered approach.

  • 📊 Optimize Data Processing Pipelines: Invest in more efficient data preprocessing and model training pipelines to better handle large datasets, reducing training times and enabling more iterative experimentation.

About

A Machine Learning algorithm to detect ADHD among female patients based on the fMRI and socio-economic data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors