A Unified Framework for Predicting snoRNA-Disease Associations through Linear Regression and Gradient Boosting
Ummay Maria Muna, Shanta Biswas, Riasat Azim
The intricate role of small nuclear RNAs (snoRNAs), a subset of small RNA molecules guiding chemical modifications in other RNAs, spans diverse biological processes. Dysfunctions in snoRNAs significantly contribute to the genesis and progression of complex diseases. However, conventional experimental methods are laborious and costly, impeding snoRNA-disease association identification. To address this, we propose a pioneering GBDT-LR model, merging gradient boosting decision trees (GBDT) with logistic regression (LR). Leveraging k-means clustering to screen negative samples, GBDT-LR extracts distinctive features via GBDT and subsequently feeds them into an LR model for association score prediction. This approach yields an impressive 93% accuracy and 88% ROC AUC, revolutionizing the identification of associations between non-coding RNAs and diseases. This computational strategy, integrating available data and tools, efficiently predicts unknown associations between diseases and snoRNAs. Leveraging machine learning techniques, particularly the adept GBDT model in feature extraction, followed by LR for association prediction, demonstrates significant potential in predicting complex disease associations with high accuracy.