An end-to-end deep learning project for automatic text summarization using transformer models (T5). The project features a complete MLOps pipeline with data ingestion, validation, transformation, model training, evaluation, and deployment using FastAPI and Streamlit.
- Transformer-Based Summarization: Uses T5 model for state-of-the-art text summarization
- Complete ML Pipeline: Modular pipeline architecture covering all stages of ML workflow
- RESTful API: FastAPI backend for model serving
- Interactive UI: Streamlit-based frontend for easy interaction
- ROUGE Metrics: Comprehensive evaluation using ROUGE scores
- Configurable: YAML-based configuration for easy experimentation
- Scalable Architecture: Clean code structure following software engineering best practices
โโโ src/textSummaizer/
โ โโโ components/ # Core ML components
โ โ โโโ data_ingestion.py
โ โ โโโ data_validation.py
โ โ โโโ data_transformation.py
โ โ โโโ model_trainer. py
โ โ โโโ model_eval.py
โ โโโ pipeline/ # Training and prediction pipelines
โ โ โโโ stage01_data_ingestion.py
โ โ โโโ stage02_data_validation.py
โ โ โโโ stage03_data_transformation.py
โ โ โโโ stage04_model_trainer.py
โ โ โโโ stage05_mode_eval.py
โ โ โโโ prediction. py
โ โโโ config/ # Configuration management
โ โโโ entity/ # Data classes for configs
โ โโโ constants/ # Project constants
โ โโโ utils/ # Utility functions
โ โโโ logging/ # Custom logging setup
โโโ config/
โ โโโ config.yaml # Main configuration file
โโโ params.yaml # Model hyperparameters
โโโ app.py # FastAPI application
โโโ ui. py # Streamlit UI
โโโ main.py # Training pipeline execution
โโโ requirements.txt # Python dependencies
โโโ research/ # Jupyter notebooks for experimentation
- Python 3.8 or higher
- pip package manager
- (Optional) CUDA-compatible GPU for faster training
-
Clone the repository
git clone https://github.com/Arshp-svg/Text-Summarization.git cd Text-Summarization -
Create a virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
Execute the complete training pipeline:
python main.pyThis will run all stages:
- Stage 1: Data Ingestion (downloads SAMSum dataset)
- Stage 2: Data Validation (validates dataset schema)
- Stage 3: Data Transformation (tokenizes and prepares data)
- Stage 4: Model Training (fine-tunes T5 model)
- Stage 5: Model Evaluation (computes ROUGE metrics)
Start the FastAPI backend:
python app.pyThe API will be available at http://localhost:8080
API Endpoints:
GET /: Redirects to API documentationGET /train: Triggers model trainingGET /predict? text=<your_text>: Returns summary for input text
View interactive API docs at: http://localhost:8080/docs
In a new terminal window:
streamlit run ui.pyAccess the UI at http://localhost:8501
- Base Model: T5 (Text-to-Text Transfer Transformer)
- Task: Abstractive Text Summarization
- Dataset: SAMSum (Dialogue Summarization)
- Tokenizer: T5 Tokenizer
- Max Input Length: 1024 tokens
- Max Output Length: 256 tokens
- Evaluation Metrics: ROUGE-1, ROUGE-2, ROUGE-L, ROUGE-Lsum
Configure hyperparameters in params.yaml:
- Learning rate
- Batch size
- Number of epochs
- Weight decay
- Gradient accumulation steps
- Beam search parameters
Defines paths and configurations for each pipeline stage:
- Data ingestion settings
- Model paths
- Artifact directories
- Tokenizer configurations
Contains model hyperparameters and training settings.
The project uses the SAMSum Corpus - a dataset containing messenger-like conversations with abstractive summaries.
- Training samples: ~14,700
- Validation samples: ~800
- Test samples: ~800
When making changes to the project, follow this workflow:
- Update
config/config.yaml(if adding new paths/configs) - Update
params.yaml(if changing hyperparameters) - Update entity classes in
src/textSummaizer/entity/ - Update configuration manager in
src/textSummaizer/config/configuration.py - Update components in
src/textSummaizer/components/ - Update pipeline stages in
src/textSummaizer/pipeline/ - Update
main.py(if adding new stages) - Update
app.py(if adding new API endpoints)
The model is evaluated using ROUGE metrics:
- ROUGE-1: Unigram overlap
- ROUGE-2: Bigram overlap
- ROUGE-L: Longest common subsequence
- ROUGE-Lsum: Summary-level ROUGE-L
Results are saved to artifacts/model_evaluation/metrics. csv
Build and run using Docker:
docker build -t text-summarizer .
docker run -p 8080:8080 text-summarizer- Deep Learning: PyTorch, Transformers (Hugging Face)
- Web Framework: FastAPI
- Frontend: Streamlit
- Data Processing: Pandas, Datasets
- Evaluation: ROUGE Score, SacreBLEU
- Configuration: PyYAML, python-box
- Logging: Custom logging module
- Data Ingestion: Downloads and extracts SAMSum dataset
- Data Validation: Validates required files and schema
- Data Transformation: Tokenizes text and prepares training data
- Model Trainer: Fine-tunes T5 model with specified parameters
- Model Evaluation: Computes ROUGE metrics on test set
Each stage is encapsulated in a pipeline class for easy execution and maintenance.
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Author: Arshp-svg
Email: arshpatel213@gmail.com
GitHub: @Arshp-svg
This project is open-source and available under the MIT License.
- Hugging Face for the Transformers library
- SAMSum dataset creators
- FastAPI and Streamlit communities
- Add support for multiple summarization models
- Implement document summarization (longer texts)
- Add multilingual support
- Deploy to cloud platforms (AWS/GCP/Azure)
- Add user authentication
- Implement model versioning and A/B testing
- Add support for custom datasets
**โญ If you find this project helpful, please give it a star! **