The Disaster Response Intelligence System (DRIS) is an NLP-driven machine learning pipeline designed to automatically classify real-time social media streams (Twitter) during emergency events. The system filters noise by distinguishing between legitimate disaster reports (e.g., "Fire at the metro station") and metaphorical usage of disaster terminology (e.g., "This mixtape is on fire").
The system follows a modular micro-service pattern for the training pipeline:
- Ingestion Layer: Loads raw CSV data containing tweet text and keyword metadata.
- Preprocessing Layer:
- Text Normalization: Lowercasing, regex stripping of URLs/Handles.
- Linguistic Cleaning: Stopword removal and WordNet Lemmatization.
- Feature Engineering Layer (Parallel Processing):
- Branch A (Semantic): TF-IDF Vectorization (N-grams 1-2) to capture context.
- Branch B (Statistical): Custom extraction of meta-features:
- Tweet Length
- Number of Hashtags
- Number of Mentions
- Presence of URLs
- Modeling Layer:
- Uses a
GradientBoostingClassifier(ensemble method) for high predictive performance. - Optimized via
GridSearchCVwith 3-fold Cross-Validation.
- Uses a
- Runtime: Python 3.9+
- Core ML: Scikit-Learn (Pipeline, ColumnTransformer, GradientBoosting)
- NLP: NLTK (Corpus management, Lemmatization)
- Data Manipulation: Pandas/Numpy
- Serialization: Joblib
Primary optimization metric is F1-Score (harmonic mean of precision and recall) rather than pure accuracy, as False Negatives in a disaster context are critical failures.
- Integration with Twitter Firehose API for real-time streaming.
- Deployment as a REST API using FastAPI.
- Containerization via Docker.