Final code project for the Big Data Engineering course in the Masters in Computational Biology (UPM), with the purpose of training several Spark-based image classification models for predicting Pneumonia from patients in a Chest X-Ray Image dataset. In addition, the following repository will check if the candidate model is scalable using a high level python interface based on BigDL-DLlib model employment.
Example of pneumonia, retrieved from Kermany et. al, 2018
The code of this project was initially designed to run in Google Colab. However, if a simple python interface is used, you can convert the Jupyter Notebook into a python file with the nbconverter python package. The dependencies needed to run the code in Pneumonia_Identification_Big_Data_Final_Project.ipynb are available in requirements.txt, and can be installed in the following way.
pip install -r requirements.txt
To import the prerelease version of BigDL-DLlib with spark3, you can execute the following line of code in colab,
!pip -qq install bigdl-spark3
Or use this line of code
!pip install https://sourceforge.net/projects/analytics-zoo/files/dllib-py-spark3/bigdl_dllib_spark3-0.14.0b20211107-py3-none-manylinux1_x86_64.whl
The code already automates the task of downloading the images into the code working directory, but a fraction of this data leveragable for training can stil be found in folders for the repository.The image dataset is composed of three folders train, test and val, each of them having two folders relating to images from normal patients (present in the nested NORMAL folders) and patients undergoing pneumonia (present in the nested PNEUMONIA folders)
| Version | Date | License | Dataset Folders | Citation | Source | Acquired from |
|---|---|---|---|---|---|---|
| v.2.0 | 06/01/2018 | CC BY 4.0 | (test,train,val) |
Kermany, D. S., Goldbaum, M., Cai, W., Valentim, C. C. S., Liang, H., Baxter, S. L., McKeown, A., Yang, G., Wu, X., Yan, F., Dong, J., Prasadha, M. K., Pei, J., Ting, M. Y. L., Zhu, J., Li, C., Hewett, S., Dong, J., Ziyar, I., … Zhang, K. (2018). Identifying Medical Diagnoses and Treatable Diseases by Image-Based Deep Learning. Cell, 172(5), 1122-1131.e9. https://doi.org/10.1016/j.cell.2018.02.010 | Mendeley Data | Kaggle |
The code includes the image embedding preprocessing stages, and the training and evaluation of ML models in the pipeline
| Version | Date | Script | Description |
|---|---|---|---|
| v.3 | 6/02/2023 | Pneumonia_Identification_Big_Data_Final_Project.ipynb |
Pneumonia identification (All preprocessing, model training and evaluation stages) |
The requirements for the execution of the code are present in requirements.txt
