A robust, modular, and production-ready platform for solar power system data analysis, machine learning, and prediction. Built with ZenML, Streamlit, MLflow, and a rich Python data science stack, this project enables end-to-end workflows from data ingestion and EDA to model training, deployment, and inferenceβall with experiment tracking and a user-friendly web interface.
- Training Pipeline: Data ingestion, missing value handling, feature engineering, outlier detection, model training, evaluation, and model saving.
- Deployment Pipeline: Loads trained models, processes new data, and generates predictions for deployment scenarios.
- Inference Pipeline: Dynamically loads models and produces predictions on new data, supporting batch inference.
- Data Ingestion: Supports CSV and ZIP files, with extensible ingestion logic.
- Missing Value Handling: Multiple strategies (drop, mean, median, mode, constant, KNN, CatBoost, categorical fill).
- Feature Engineering: Categorical encoding, new feature creation (e.g., power, area).
- Outlier Detection: Z-score and IQR-based methods.
- Model Building: Supports Linear Regression, Random Forest, XGBoost, CatBoost.
- Model Evaluation: Regression metrics with MLflow logging.
- Model Saving/Loading: Robust serialization and MLflow model registry integration.
- Prediction & Saving: Batch predictions and artifact logging.
- EDA Tab: Upload CSVs and perform:
- Missing value analysis (counts, heatmaps)
- Univariate, bivariate, and multivariate analysis (histograms, boxplots, scatterplots, correlation heatmaps, pairplots)
- Prediction Tab: Upload data, run the full inference pipeline, and download predictions.
- Home & About Tabs: Project overview and author information.
- Production-Grade Logging: All user actions, errors, and pipeline events are logged for traceability.
- Modular EDA code in
Analysis/AnalyzeSrc/for:- Univariate, bivariate, and multivariate analysis
- Missing value visualization
- Encoding strategies and data inspection
- All model metrics, artifacts, and predictions are logged and tracked for reproducibility and comparison.
- All pipeline and model parameters are managed via a single
config.pydataclass for easy customization.
- Modular codebase with clear separation of concerns.
- Logging in every process (steps, pipelines, app).
- Ready for cloud or on-prem deployment.
- ZenML: Orchestrates modular, reproducible ML pipelines.
- Streamlit: Interactive web UI for EDA and prediction.
- MLflow: Experiment tracking, model registry, and artifact logging.
- scikit-learn, XGBoost, CatBoost, LightGBM: Model training and evaluation.
- pandas, numpy, matplotlib, seaborn, plotly: Data manipulation and visualization.
- joblib: Model serialization.
- colorama: Terminal color support.
- Other: Jupyter, Pillow, python-dateutil, threadpoolctl, etc.
See requirements.txt for the full list.
.
βββ App/ # Streamlit web app (EDA, prediction, UI)
β βββ app.py
β βββ eda.py
β βββ predict.py
βββ Steps/ # ZenML pipeline steps (modular ML logic)
β βββ DataIngestionStep.py
β βββ FeatureEngineeringStep.py
β βββ HandleMissingValueStep.py
β βββ OutlierDetectionStep.py
β βββ ModelBuildingStep.py
β βββ ModelEvaluationStep.py
β βββ ModelSaverStep.py
β βββ ModelLoaderStep.py
β βββ PredictionStep.py
β βββ PredictionsSaverStep.py
β βββ DynamicModelLoaderStep.py
β βββ SplitFeaturesTargetStep.py
βββ Pipelines/ # ZenML pipeline orchestrations
β βββ TrainingPipeline.py
β βββ InferencePipeline.py
β βββ DeploymentPipeline.py
βββ Src/ # Core ML/data logic (feature engineering, ingestion, etc.)
βββ Analysis/AnalyzeSrc/ # Advanced EDA utilities
βββ config.py # Centralized configuration (SystemConfig)
βββ run_training.py # Script to run the training pipeline
βββ run_inference.py # Script to run the inference pipeline
βββ run_deployment.py # Script to run the deployment pipeline
βββ requirements.txt
βββ README.md
pip install -r requirements.txtpython run_training.py- Uses parameters from
config.py. - Trains and saves a model to
artifacts/model.joblib.
python run_inference.py --data_path path/to/input.csv --feature_columns col1,col2,...- Produces predictions in
artifacts/predictions.csv.
python run_deployment.py- Loads a trained model and generates predictions on test data.
cd App
streamlit run app.py- Explore EDA and make predictions via the web UI.
- EDA Tab: Upload data, visualize missing values, distributions, relationships, and correlations.
- Prediction Tab: Upload new data, run the full inference pipeline, and download results.
- Logging: All actions and errors are logged to
App/app.logfor easy debugging.
- Pipeline Parameters: Edit
config.pyto change data paths, model types, feature columns, and more. - Add Steps: Extend the
Steps/directory with new ZenML steps for custom logic. - EDA: Add or modify EDA modules in
Analysis/AnalyzeSrc/.
- All model metrics, artifacts, and predictions are logged with MLflow.
- To view MLflow UI:
Then open http://localhost:5000 in your browser.
mlflow ui
ZenML provides a built-in dashboard to visualize, monitor, and manage your pipelines, steps, and artifacts.
To launch the ZenML dashboard, simply run:
zenml upThis will start the ZenML dashboard locally. Open your browser and go to http://localhost:8237 to:
- View all pipeline runs and their statuses
- Inspect step outputs, artifacts, and logs
- Monitor experiment lineage and metadata
- Manage stacks, orchestrators, and more
The dashboard is a powerful tool for tracking your ML workflow and debugging pipeline executions.
- App logs:
App/app.log - Pipeline logs:
Pipelines/pipeline.log - Step logs:
Steps/step.log - All major actions, transitions, and errors are traceable for robust monitoring and debugging.
- Strategy Pattern: Used extensively for modularizing logic in data ingestion, missing value handling, feature engineering, outlier detection, model building, and EDA. This allows easy swapping and extension of algorithms and behaviors at runtime.
- Factory Pattern: Used for creating data ingestors based on file type, enabling scalable and maintainable data ingestion logic.
- Modularization & Separation of Concerns: The codebase is organized into clear modules (Steps, Pipelines, Src, Analysis) to ensure each component has a single responsibility and can be developed, tested, and maintained independently.
- Production-Grade Logging: All major actions, errors, and pipeline events are logged for traceability and debugging.
- Centralized Configuration: All parameters and settings are managed via a single config file for easy customization and reproducibility.
- Experiment Tracking: All model metrics, artifacts, and predictions are logged with MLflow for reproducibility and comparison.
- Extensibility: The use of abstract base classes and modular steps makes it easy to add new features or algorithms.
- Reproducibility: Pipelines, experiment tracking, and configuration management ensure results can be reliably reproduced.
- Clear Documentation: The project is well-documented with docstrings, comments, and this README.
Below are key images illustrating the system's architecture, pipelines, and results:
Contributions are welcome! Please open issues or pull requests for improvements, bug fixes, or new features.
THAMIZHARASU SARAVANAN
GitHub Profile
This project is licensed under MIT License
Enjoy robust, modular, and production-ready solar data analysis and ML!







