Mathematics Misconception Analysis Pipeline

📋 Overview

Mathematics Misconception Analysis Pipeline is an end-to-end data engineering and analytics solution designed to process, store, and analyze educational datasets. The project focuses on identifying students' mathematical misconceptions to improve personalized learning systems.

This repository implements a robust ETL (Extract, Transform, Load) workflow that ingests raw data, manages it via a relational database (MySQL), and performs advanced feature engineering (including TF-IDF vectorization) to prepare datasets for Machine Learning (ML) and Large Language Model (LLM) training.

Designed with scalability and reproducibility in mind, the architecture leverages Docker for containerization, Kubernetes manifests for orchestration, and GitHub Actions for CI/CD automation.

🚀 Key Features

Automated ETL Pipeline: Seamlessly imports CSV data into a structured MySQL database using SQLAlchemy.
Advanced Feature Engineering:
- Text Analysis: TF-IDF vectorization for processing textual misconception data.
- Categorical Encoding: Label encoding for constructs and subjects.
- Data Cleaning: Automated handling of missing values and normalization.
Containerized Environment: Fully Dockerized application ensuring consistency across development, testing, and production environments.
Cloud-Ready Deployment: Includes Kubernetes (deployment.yaml, service.yaml) configurations for scalable orchestration.
CI/CD Integration: Automated workflows via GitHub Actions for code quality checks and build verification.

🛠 Tech Stack

Core: Python 3.9+, Pandas, NumPy
Machine Learning: Scikit-learn (TF-IDF, Preprocessing)
Database: MySQL, SQLite, SQLAlchemy (ORM)
DevOps & Infrastructure: Docker, Kubernetes (K8s), GitHub Actions
Version Control: Git

📂 Project Structure

├── .github/workflows/    # CI/CD Pipeline configurations
├── scripts/              # Core logic and ETL scripts
│   ├── database_operations/  # DB connection & raw SQL queries
│   ├── feature_engineering.py # ML preprocessing logic
│   ├── load_data.py          # Data ingestion modules
│   ├── preprocess.py         # Cleaning & normalization
│   └── ...
├── Dockerfile            # Container configuration
├── deployment.yaml       # Kubernetes Deployment manifest
├── service.yaml          # Kubernetes Service manifest
├── pipeline.py           # Main entry point for the pipeline
├── requirements.txt      # Python dependencies
└── README.md             # Project documentation

⚡ Getting Started

Prerequisites

Docker & Docker Compose (Recommended)
Python 3.9+ (For manual execution)
MySQL Server (If running locally without Docker)

Option 1: Run with Docker (Recommended)

Build and run the containerized pipeline to ensure all dependencies and database connections are isolated.

# Build the Docker image
docker build -t math-misconception-pipeline .

# Run the container
docker run --env-file .env math-misconception-pipeline

Option 2: Manual Installation

Clone the repository:

git clone [https://github.com/mostafa-kermaninia/Misconceptions-in-mathematics.git](https://github.com/mostafa-kermaninia/Misconceptions-in-mathematics.git)
cd Misconceptions-in-mathematics

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows use: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Configuration: Create a .env file in the root directory to store your database credentials:

DB_HOST=localhost
DB_USER=root
DB_PASSWORD=yourpassword
DB_NAME=DataScience_DB

Run the pipeline:

python pipeline.py

📊 Workflow Architecture

The pipeline follows a modular architecture:

Ingestion: Raw CSV files are read and validated.
Storage: Data is normalized and stored in the relational database.
Processing:

load_data.py: Retrives fresh data from the DB.
preprocess.py: Cleans text and handles null values.
feature_engineering.py: Applies TF-IDF to the 'Misconception Analysis' text fields.

Output: Processed DataFrames ready for model training.

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository.
Create a feature branch (git checkout -b feature/AmazingFeature).
Commit your changes (git commit -m 'Add some AmazingFeature').
Push to the branch (git push origin feature/AmazingFeature).
Open a Pull Request.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

Author: Mostafa Kermaninia

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mathematics Misconception Analysis Pipeline

📋 Overview

🚀 Key Features

🛠 Tech Stack

📂 Project Structure

⚡ Getting Started

Prerequisites

Option 1: Run with Docker (Recommended)

Option 2: Manual Installation

📊 Workflow Architecture

🤝 Contributing

📝 License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
database		database
processed_data		processed_data
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
deployment.yaml		deployment.yaml
pipeline.py		pipeline.py
requirements.txt		requirements.txt
service.yaml		service.yaml
tempCodeRunnerFile.py		tempCodeRunnerFile.py

License

mostafa-kermaninia/Math-Misconception-Analytics-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Mathematics Misconception Analysis Pipeline

📋 Overview

🚀 Key Features

🛠 Tech Stack

📂 Project Structure

⚡ Getting Started

Prerequisites

Option 1: Run with Docker (Recommended)

Option 2: Manual Installation

📊 Workflow Architecture

🤝 Contributing

📝 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages