Mathematics Misconception Analysis Pipeline is an end-to-end data engineering and analytics solution designed to process, store, and analyze educational datasets. The project focuses on identifying students' mathematical misconceptions to improve personalized learning systems.
This repository implements a robust ETL (Extract, Transform, Load) workflow that ingests raw data, manages it via a relational database (MySQL), and performs advanced feature engineering (including TF-IDF vectorization) to prepare datasets for Machine Learning (ML) and Large Language Model (LLM) training.
Designed with scalability and reproducibility in mind, the architecture leverages Docker for containerization, Kubernetes manifests for orchestration, and GitHub Actions for CI/CD automation.
- Automated ETL Pipeline: Seamlessly imports CSV data into a structured MySQL database using SQLAlchemy.
- Advanced Feature Engineering:
- Text Analysis: TF-IDF vectorization for processing textual misconception data.
- Categorical Encoding: Label encoding for constructs and subjects.
- Data Cleaning: Automated handling of missing values and normalization.
- Containerized Environment: Fully Dockerized application ensuring consistency across development, testing, and production environments.
- Cloud-Ready Deployment: Includes Kubernetes (
deployment.yaml,service.yaml) configurations for scalable orchestration. - CI/CD Integration: Automated workflows via GitHub Actions for code quality checks and build verification.
- Core: Python 3.9+, Pandas, NumPy
- Machine Learning: Scikit-learn (TF-IDF, Preprocessing)
- Database: MySQL, SQLite, SQLAlchemy (ORM)
- DevOps & Infrastructure: Docker, Kubernetes (K8s), GitHub Actions
- Version Control: Git
├── .github/workflows/ # CI/CD Pipeline configurations
├── scripts/ # Core logic and ETL scripts
│ ├── database_operations/ # DB connection & raw SQL queries
│ ├── feature_engineering.py # ML preprocessing logic
│ ├── load_data.py # Data ingestion modules
│ ├── preprocess.py # Cleaning & normalization
│ └── ...
├── Dockerfile # Container configuration
├── deployment.yaml # Kubernetes Deployment manifest
├── service.yaml # Kubernetes Service manifest
├── pipeline.py # Main entry point for the pipeline
├── requirements.txt # Python dependencies
└── README.md # Project documentation
- Docker & Docker Compose (Recommended)
- Python 3.9+ (For manual execution)
- MySQL Server (If running locally without Docker)
Build and run the containerized pipeline to ensure all dependencies and database connections are isolated.
# Build the Docker image
docker build -t math-misconception-pipeline .
# Run the container
docker run --env-file .env math-misconception-pipeline
- Clone the repository:
git clone [https://github.com/mostafa-kermaninia/Misconceptions-in-mathematics.git](https://github.com/mostafa-kermaninia/Misconceptions-in-mathematics.git)
cd Misconceptions-in-mathematics
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows use: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Configuration:
Create a
.envfile in the root directory to store your database credentials:
DB_HOST=localhost
DB_USER=root
DB_PASSWORD=yourpassword
DB_NAME=DataScience_DB
- Run the pipeline:
python pipeline.py
The pipeline follows a modular architecture:
- Ingestion: Raw CSV files are read and validated.
- Storage: Data is normalized and stored in the relational database.
- Processing:
load_data.py: Retrives fresh data from the DB.preprocess.py: Cleans text and handles null values.feature_engineering.py: Applies TF-IDF to the 'Misconception Analysis' text fields.
- Output: Processed DataFrames ready for model training.
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a feature branch (
git checkout -b feature/AmazingFeature). - Commit your changes (
git commit -m 'Add some AmazingFeature'). - Push to the branch (
git push origin feature/AmazingFeature). - Open a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
Author: Mostafa Kermaninia