📝 Text Summarization - End-to-End NLP Project

An end-to-end deep learning project for automatic text summarization using transformer models (T5). The project features a complete MLOps pipeline with data ingestion, validation, transformation, model training, evaluation, and deployment using FastAPI and Streamlit.

🌟 Features

Transformer-Based Summarization: Uses T5 model for state-of-the-art text summarization
Complete ML Pipeline: Modular pipeline architecture covering all stages of ML workflow
RESTful API: FastAPI backend for model serving
Interactive UI: Streamlit-based frontend for easy interaction
ROUGE Metrics: Comprehensive evaluation using ROUGE scores
Configurable: YAML-based configuration for easy experimentation
Scalable Architecture: Clean code structure following software engineering best practices

🏗️ Project Architecture

├── src/textSummaizer/
│   ├── components/          # Core ML components
│   │   ├── data_ingestion.py
│   │   ├── data_validation.py
│   │   ├── data_transformation.py
│   │   ├── model_trainer. py
│   │   └── model_eval.py
│   ├── pipeline/            # Training and prediction pipelines
│   │   ├── stage01_data_ingestion.py
│   │   ├── stage02_data_validation.py
│   │   ├── stage03_data_transformation.py
│   │   ├── stage04_model_trainer.py
│   │   ├── stage05_mode_eval.py
│   │   └── prediction. py
│   ├── config/              # Configuration management
│   ├── entity/              # Data classes for configs
│   ├── constants/           # Project constants
│   ├── utils/               # Utility functions
│   └── logging/             # Custom logging setup
├── config/
│   └── config.yaml          # Main configuration file
├── params.yaml              # Model hyperparameters
├── app.py                   # FastAPI application
├── ui. py                    # Streamlit UI
├── main.py                  # Training pipeline execution
├── requirements.txt         # Python dependencies
└── research/                # Jupyter notebooks for experimentation

🚀 Quick Start

Prerequisites

Python 3.8 or higher
pip package manager
(Optional) CUDA-compatible GPU for faster training

Installation

Clone the repository

git clone https://github.com/Arshp-svg/Text-Summarization.git
cd Text-Summarization

Create a virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```

💻 Usage

1. Training the Model

Execute the complete training pipeline:

python main.py

This will run all stages:

Stage 1: Data Ingestion (downloads SAMSum dataset)
Stage 2: Data Validation (validates dataset schema)
Stage 3: Data Transformation (tokenizes and prepares data)
Stage 4: Model Training (fine-tunes T5 model)
Stage 5: Model Evaluation (computes ROUGE metrics)

2. Running the API Server

Start the FastAPI backend:

python app.py

The API will be available at http://localhost:8080

API Endpoints:

GET /: Redirects to API documentation
GET /train: Triggers model training
GET /predict? text=<your_text>: Returns summary for input text

View interactive API docs at: http://localhost:8080/docs

3. Running the Streamlit UI

In a new terminal window:

streamlit run ui.py

Access the UI at http://localhost:8501

📊 Model Details

Base Model: T5 (Text-to-Text Transfer Transformer)
Task: Abstractive Text Summarization
Dataset: SAMSum (Dialogue Summarization)
Tokenizer: T5 Tokenizer
Max Input Length: 1024 tokens
Max Output Length: 256 tokens
Evaluation Metrics: ROUGE-1, ROUGE-2, ROUGE-L, ROUGE-Lsum

Training Parameters

Configure hyperparameters in params.yaml:

Learning rate
Batch size
Number of epochs
Weight decay
Gradient accumulation steps
Beam search parameters

🔧 Configuration

`config/config.yaml`

Defines paths and configurations for each pipeline stage:

Data ingestion settings
Model paths
Artifact directories
Tokenizer configurations

`params.yaml`

Contains model hyperparameters and training settings.

📁 Dataset

The project uses the SAMSum Corpus - a dataset containing messenger-like conversations with abstractive summaries.

Training samples: ~14,700
Validation samples: ~800
Test samples: ~800

🎯 Development Workflow

When making changes to the project, follow this workflow:

Update config/config.yaml (if adding new paths/configs)
Update params.yaml (if changing hyperparameters)
Update entity classes in src/textSummaizer/entity/
Update configuration manager in src/textSummaizer/config/configuration.py
Update components in src/textSummaizer/components/
Update pipeline stages in src/textSummaizer/pipeline/
Update main.py (if adding new stages)
Update app.py (if adding new API endpoints)

📈 Model Evaluation

The model is evaluated using ROUGE metrics:

ROUGE-1: Unigram overlap
ROUGE-2: Bigram overlap
ROUGE-L: Longest common subsequence
ROUGE-Lsum: Summary-level ROUGE-L

Results are saved to artifacts/model_evaluation/metrics. csv

🐳 Docker Support

Build and run using Docker:

docker build -t text-summarizer .
docker run -p 8080:8080 text-summarizer

🛠️ Technologies Used

Deep Learning: PyTorch, Transformers (Hugging Face)
Web Framework: FastAPI
Frontend: Streamlit
Data Processing: Pandas, Datasets
Evaluation: ROUGE Score, SacreBLEU
Configuration: PyYAML, python-box
Logging: Custom logging module

📝 Project Structure Highlights

Modular Components

Data Ingestion: Downloads and extracts SAMSum dataset
Data Validation: Validates required files and schema
Data Transformation: Tokenizes text and prepares training data
Model Trainer: Fine-tunes T5 model with specified parameters
Model Evaluation: Computes ROUGE metrics on test set

Pipeline Architecture

Each stage is encapsulated in a pipeline class for easy execution and maintenance.

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📧 Contact

Author: Arshp-svg
Email: arshpatel213@gmail.com
GitHub: @Arshp-svg

📄 License

This project is open-source and available under the MIT License.

🙏 Acknowledgments

Hugging Face for the Transformers library
SAMSum dataset creators
FastAPI and Streamlit communities

🔮 Future Enhancements

Add support for multiple summarization models
Implement document summarization (longer texts)
Add multilingual support
Deploy to cloud platforms (AWS/GCP/Azure)
Add user authentication
Implement model versioning and A/B testing
Add support for custom datasets

**⭐ If you find this project helpful, please give it a star! **

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
config		config
research		research
src/textSummaizer		src/textSummaizer
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
main.py		main.py
params.yaml		params.yaml
requirements.txt		requirements.txt
setup.py		setup.py
template.py		template.py
ui.py		ui.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📝 Text Summarization - End-to-End NLP Project

🌟 Features

🏗️ Project Architecture

🚀 Quick Start

Prerequisites

Installation

💻 Usage

1. Training the Model

2. Running the API Server

3. Running the Streamlit UI

📊 Model Details

Training Parameters

🔧 Configuration

`config/config.yaml`

`params.yaml`

📁 Dataset

🎯 Development Workflow

📈 Model Evaluation

🐳 Docker Support

🛠️ Technologies Used

📝 Project Structure Highlights

Modular Components

Pipeline Architecture

🤝 Contributing

📧 Contact

📄 License

🙏 Acknowledgments

🔮 Future Enhancements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📝 Text Summarization - End-to-End NLP Project

🌟 Features

🏗️ Project Architecture

🚀 Quick Start

Prerequisites

Installation

💻 Usage

1. Training the Model

2. Running the API Server

3. Running the Streamlit UI

📊 Model Details

Training Parameters

🔧 Configuration

config/config.yaml

params.yaml

📁 Dataset

🎯 Development Workflow

📈 Model Evaluation

🐳 Docker Support

🛠️ Technologies Used

📝 Project Structure Highlights

Modular Components

Pipeline Architecture

🤝 Contributing

📧 Contact

📄 License

🙏 Acknowledgments

🔮 Future Enhancements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`config/config.yaml`

`params.yaml`

Packages