A Data-Driven Framework for Predicting Chronic Hepatitis B through Feature Prioritization from a Sparse Clinical Dataset

Overview

This repository contains the complete code and analysis for a study aimed at predicting the progression from acute to chronic Hepatitis B (AHB). The project addresses the common clinical informatics challenge of working with a high-dimensional dataset from a limited patient cohort.

The core of this work is an end-to-end machine learning pipeline developed in Julia. This pipeline processes raw, complex clinical data, trains a state-of-the-art XGBoost model, and extracts interpretable results to identify the most significant biomarkers for predicting disease outcome. The model's validity is demonstrated by its high predictive accuracy and its independent rediscovery of biomarkers known to be clinically relevant to liver function and disease progression.

Key Features

End-to-End Data Pipeline: From raw CSVs to final publication-ready results and figures.
Robust Data Cleaning: A comprehensive workflow to handle missing data, inconsistent formatting, and complex column headers.
Predictive Modeling: Utilizes an XGBoost classifier, evaluated with rigorous Leave-One-Out Cross-Validation (LOOCV).
Feature Importance Analysis: Identifies and ranks the key clinical factors that are most predictive of the outcome.
Exploratory Data Analysis (EDA): Generates descriptive statistics and visualizations to characterize the patient cohort.
Comparative Analysis: Compares biomarkers across three distinct clinical states: Acute-Resolved, Acute-to-Chronic, and Established-Chronic.

Key Results

The project successfully developed a predictive model and identified a core set of biomarkers.

Model Performance

The XGBoost model achieved a cross-validated accuracy of 90.9% in predicting patient outcomes, demonstrating a strong predictive signal within the dataset.

Predictive Biomarkers

Feature importance analysis identified six key predictors. The model's independent discovery of established liver function markers validates the robustness of the framework.

Rank	Feature	Importance Score
1	`SGPT_ALT_AHB`	2.26573
2	`ARC_IgM_Core_S_Co__AHB`	1.10977
3	`TB_AHB`	0.982959
4	`Age_AHB`	0.981654
5	`Pre_S2_AHB`	0.671907
6	`PT_AHB`	0.516444

The full table is saved in xgboost_important_features.csv.

Visualizations

Below are examples of the exploratory and comparative plots generated by the analysis scripts.

Initial ALT by Outcome	Viral Load Across Groups

(Note: Ensure the generated .png files are in the repository for these images to display.)

Getting Started

Prerequisites

Julia (v1.6 or later)
The following data files in the root directory:
- AHB_data.csv
- CHABE_data.csv

Installation & Setup

Clone the repository:

git clone [your-repository-url]
cd [your-repository-name]

Launch Julia:
```
julia
```
Activate the project environment and install dependencies: Inside the Julia REPL, press ] to enter the package manager.
```
(@v1.x) pkg> activate .
(.) pkg> add DataFrames CSV Statistics MLJ CategoricalArrays ScientificTypes XGBoost Plots StatsPlots
(.) pkg> # Press backspace to exit the package manager
```
This will install all required packages based on the Project.toml and Manifest.toml files (if they are included in the repo).

Usage

The analysis is divided into three main scripts. They should be run in the following order.

Main Predictive Analysis: This script runs the core XGBoost model, evaluates its performance, and extracts the feature importances.
```
julia analysis.jl
```
This will generate xgboost_important_features.csv.
Exploratory Data Analysis (EDA): This script generates the descriptive statistics and plots for the acute cohort.
```
julia eda.jl
```
This will generate descriptive_statistics.csv and several *.png plots.
Comparative Analysis: This script combines the AHB and CHABE cohorts to produce comparative visualizations.
```
julia comparative_analysis.jl
```
This will generate the comparative_*.png plots.

Project Structure

├── AHB_data.csv ├── CHABE_data.csv ├── analysis.jl ├── eda.jl ├── comparative_analysis.jl ├── README.md ├── Project.toml ├── Manifest.toml

Citation

Manuscript in communication.

If you use this work, please cite the upcoming publication:

*(Author(s)). (Year). "A Data-Driven Framework for Predicting Chronic Hepatitis B through Feature Prioritization from a Sparse Clinical Dataset". *

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.gitignore		.gitignore
README.md		README.md
alt_boxplot.png		alt_boxplot.png
alt_vs_tb_scatter.png		alt_vs_tb_scatter.png
analysis.jl		analysis.jl
comparative_alt_boxplot.png		comparative_alt_boxplot.png
comparative_analysis.jl		comparative_analysis.jl
comparative_igm_boxplot.png		comparative_igm_boxplot.png
comparative_tb_boxplot.png		comparative_tb_boxplot.png
comparative_viral_load_boxplot.png		comparative_viral_load_boxplot.png
eda.jl		eda.jl
hep.jl		hep.jl
igm_boxplot.png		igm_boxplot.png
outcome_by_treatment_barchart.png		outcome_by_treatment_barchart.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Data-Driven Framework for Predicting Chronic Hepatitis B through Feature Prioritization from a Sparse Clinical Dataset

Overview

Key Features

Key Results

Model Performance

Predictive Biomarkers

Visualizations

Getting Started

Prerequisites

Installation & Setup

Usage

Project Structure

Citation

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

drbenedictpaul/hepatitis_B

Folders and files

Latest commit

History

Repository files navigation

A Data-Driven Framework for Predicting Chronic Hepatitis B through Feature Prioritization from a Sparse Clinical Dataset

Overview

Key Features

Key Results

Model Performance

Predictive Biomarkers

Visualizations

Getting Started

Prerequisites

Installation & Setup

Usage

Project Structure

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages