Skip to content

Arshp-svg/Text-Summarization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

8 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ“ Text Summarization - End-to-End NLP Project

Python 3.8+ FastAPI Streamlit Transformers

An end-to-end deep learning project for automatic text summarization using transformer models (T5). The project features a complete MLOps pipeline with data ingestion, validation, transformation, model training, evaluation, and deployment using FastAPI and Streamlit.


๐ŸŒŸ Features

  • Transformer-Based Summarization: Uses T5 model for state-of-the-art text summarization
  • Complete ML Pipeline: Modular pipeline architecture covering all stages of ML workflow
  • RESTful API: FastAPI backend for model serving
  • Interactive UI: Streamlit-based frontend for easy interaction
  • ROUGE Metrics: Comprehensive evaluation using ROUGE scores
  • Configurable: YAML-based configuration for easy experimentation
  • Scalable Architecture: Clean code structure following software engineering best practices

๐Ÿ—๏ธ Project Architecture

โ”œโ”€โ”€ src/textSummaizer/
โ”‚   โ”œโ”€โ”€ components/          # Core ML components
โ”‚   โ”‚   โ”œโ”€โ”€ data_ingestion.py
โ”‚   โ”‚   โ”œโ”€โ”€ data_validation.py
โ”‚   โ”‚   โ”œโ”€โ”€ data_transformation.py
โ”‚   โ”‚   โ”œโ”€โ”€ model_trainer. py
โ”‚   โ”‚   โ””โ”€โ”€ model_eval.py
โ”‚   โ”œโ”€โ”€ pipeline/            # Training and prediction pipelines
โ”‚   โ”‚   โ”œโ”€โ”€ stage01_data_ingestion.py
โ”‚   โ”‚   โ”œโ”€โ”€ stage02_data_validation.py
โ”‚   โ”‚   โ”œโ”€โ”€ stage03_data_transformation.py
โ”‚   โ”‚   โ”œโ”€โ”€ stage04_model_trainer.py
โ”‚   โ”‚   โ”œโ”€โ”€ stage05_mode_eval.py
โ”‚   โ”‚   โ””โ”€โ”€ prediction. py
โ”‚   โ”œโ”€โ”€ config/              # Configuration management
โ”‚   โ”œโ”€โ”€ entity/              # Data classes for configs
โ”‚   โ”œโ”€โ”€ constants/           # Project constants
โ”‚   โ”œโ”€โ”€ utils/               # Utility functions
โ”‚   โ””โ”€โ”€ logging/             # Custom logging setup
โ”œโ”€โ”€ config/
โ”‚   โ””โ”€โ”€ config.yaml          # Main configuration file
โ”œโ”€โ”€ params.yaml              # Model hyperparameters
โ”œโ”€โ”€ app.py                   # FastAPI application
โ”œโ”€โ”€ ui. py                    # Streamlit UI
โ”œโ”€โ”€ main.py                  # Training pipeline execution
โ”œโ”€โ”€ requirements.txt         # Python dependencies
โ””โ”€โ”€ research/                # Jupyter notebooks for experimentation

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • (Optional) CUDA-compatible GPU for faster training

Installation

  1. Clone the repository

    git clone https://github.com/Arshp-svg/Text-Summarization.git
    cd Text-Summarization
  2. Create a virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt

๐Ÿ’ป Usage

1. Training the Model

Execute the complete training pipeline:

python main.py

This will run all stages:

  • Stage 1: Data Ingestion (downloads SAMSum dataset)
  • Stage 2: Data Validation (validates dataset schema)
  • Stage 3: Data Transformation (tokenizes and prepares data)
  • Stage 4: Model Training (fine-tunes T5 model)
  • Stage 5: Model Evaluation (computes ROUGE metrics)

2. Running the API Server

Start the FastAPI backend:

python app.py

The API will be available at http://localhost:8080

API Endpoints:

  • GET /: Redirects to API documentation
  • GET /train: Triggers model training
  • GET /predict? text=<your_text>: Returns summary for input text

View interactive API docs at: http://localhost:8080/docs

3. Running the Streamlit UI

In a new terminal window:

streamlit run ui.py

Access the UI at http://localhost:8501


๐Ÿ“Š Model Details

  • Base Model: T5 (Text-to-Text Transfer Transformer)
  • Task: Abstractive Text Summarization
  • Dataset: SAMSum (Dialogue Summarization)
  • Tokenizer: T5 Tokenizer
  • Max Input Length: 1024 tokens
  • Max Output Length: 256 tokens
  • Evaluation Metrics: ROUGE-1, ROUGE-2, ROUGE-L, ROUGE-Lsum

Training Parameters

Configure hyperparameters in params.yaml:

  • Learning rate
  • Batch size
  • Number of epochs
  • Weight decay
  • Gradient accumulation steps
  • Beam search parameters

๐Ÿ”ง Configuration

config/config.yaml

Defines paths and configurations for each pipeline stage:

  • Data ingestion settings
  • Model paths
  • Artifact directories
  • Tokenizer configurations

params.yaml

Contains model hyperparameters and training settings.


๐Ÿ“ Dataset

The project uses the SAMSum Corpus - a dataset containing messenger-like conversations with abstractive summaries.

  • Training samples: ~14,700
  • Validation samples: ~800
  • Test samples: ~800

๐ŸŽฏ Development Workflow

When making changes to the project, follow this workflow:

  1. Update config/config.yaml (if adding new paths/configs)
  2. Update params.yaml (if changing hyperparameters)
  3. Update entity classes in src/textSummaizer/entity/
  4. Update configuration manager in src/textSummaizer/config/configuration.py
  5. Update components in src/textSummaizer/components/
  6. Update pipeline stages in src/textSummaizer/pipeline/
  7. Update main.py (if adding new stages)
  8. Update app.py (if adding new API endpoints)

๐Ÿ“ˆ Model Evaluation

The model is evaluated using ROUGE metrics:

  • ROUGE-1: Unigram overlap
  • ROUGE-2: Bigram overlap
  • ROUGE-L: Longest common subsequence
  • ROUGE-Lsum: Summary-level ROUGE-L

Results are saved to artifacts/model_evaluation/metrics. csv


๐Ÿณ Docker Support

Build and run using Docker:

docker build -t text-summarizer .
docker run -p 8080:8080 text-summarizer

๐Ÿ› ๏ธ Technologies Used

  • Deep Learning: PyTorch, Transformers (Hugging Face)
  • Web Framework: FastAPI
  • Frontend: Streamlit
  • Data Processing: Pandas, Datasets
  • Evaluation: ROUGE Score, SacreBLEU
  • Configuration: PyYAML, python-box
  • Logging: Custom logging module

๐Ÿ“ Project Structure Highlights

Modular Components

  • Data Ingestion: Downloads and extracts SAMSum dataset
  • Data Validation: Validates required files and schema
  • Data Transformation: Tokenizes text and prepares training data
  • Model Trainer: Fine-tunes T5 model with specified parameters
  • Model Evaluation: Computes ROUGE metrics on test set

Pipeline Architecture

Each stage is encapsulated in a pipeline class for easy execution and maintenance.


๐Ÿค Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

๐Ÿ“ง Contact

Author: Arshp-svg
Email: arshpatel213@gmail.com
GitHub: @Arshp-svg


๐Ÿ“„ License

This project is open-source and available under the MIT License.


๐Ÿ™ Acknowledgments

  • Hugging Face for the Transformers library
  • SAMSum dataset creators
  • FastAPI and Streamlit communities

๐Ÿ”ฎ Future Enhancements

  • Add support for multiple summarization models
  • Implement document summarization (longer texts)
  • Add multilingual support
  • Deploy to cloud platforms (AWS/GCP/Azure)
  • Add user authentication
  • Implement model versioning and A/B testing
  • Add support for custom datasets

**โญ If you find this project helpful, please give it a star! **

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors