A lightweight yet powerful ML pipeline to automatically detect whether text is AI-generated or human-written.
Ideal for content moderation, academic integrity checks, and blog/article verification.
- Overview
- Demo
- Tech Stack
- Dataset
- Project Structure
- How It Works
- Getting Started
- Model Performance
- Usage
- Author
With the rapid rise of AI-generated content, distinguishing between human and machine-written text has become increasingly important. This project builds a binary text classifier using classical NLP and machine learning techniques to tackle this challenge.
Key highlights:
- Cleans and preprocesses raw text data
- Converts text to numerical features using TF-IDF Vectorization
- Trains a Logistic Regression classifier with high accuracy
- Saves the trained model and vectorizer as
.pklfiles for production reuse - Includes an
app.pyscript for real-time inference — no retraining needed
Enter text: The mitochondria is the powerhouse of the cell...
🔍 Prediction: HUMAN ✅
📊 Confidence: 91.4%
| Category | Tools |
|---|---|
| Language | Python 3.8+ |
| ML & NLP | scikit-learn, TF-IDF, Logistic Regression |
| Data Handling | pandas, numpy |
| Visualization | matplotlib, seaborn |
| Model Persistence | joblib / pickle |
| Environment | Jupyter Notebook, VS Code |
| Field | Description |
|---|---|
| File | ai_human_content_detection_dataset.csv |
text |
Input text sample |
label |
Target class — AI or Human |
| Split | 80% Train / 20% Test |
ai-vs-human-content-detector-2025/
│
├── 📓 AI_vs_Human_Content_Detection.IPYNB # Main notebook: EDA, preprocessing, training & evaluation
├── 📊 ai_human_content_detection_dataset.csv # Labeled dataset (AI & Human samples)
├── 🤖 logreg_model.pkl # Saved Logistic Regression model
├── 🔤 tfidf_vectorizer.pkl # Saved TF-IDF vectorizer
├── 🚀 app.py # Inference script — load model & predict on new text
├── 📋 requirements.txt # Python dependencies
└── 📄 README.md # Project documentation
Raw Text
│
▼
┌─────────────────────────────┐
│ Text Preprocessing │
│ • Lowercasing │
│ • Remove punctuation │
│ • Strip extra whitespace │
│ • (Optional) Stopwords │
└─────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ TF-IDF Vectorization │
│ Converts text → numbers │
└─────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ Logistic Regression │
│ Binary Classifier │
│ AI vs Human │
└─────────────────────────────┘
│
▼
Prediction + Confidence Score
git clone https://github.com/Musawir456/ai-vs-human-content-detector-2025.git
cd ai-vs-human-content-detector-2025python -m venv venv
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activatepip install -r requirements.txtjupyter notebook "AI_vs_Human_Content_Detection.IPYNB"python app.py| Metric | Score |
|---|---|
| Accuracy | ~XX% |
| Precision | ~XX% |
| Recall | ~XX% |
| F1-Score | ~XX% |
📝 Update this table with your actual evaluation results after training.
Once the model is trained, use app.py to predict on any new text:
import joblib
model = joblib.load("logreg_model.pkl")
vectorizer = joblib.load("tfidf_vectorizer.pkl")
text = ["Your sample text goes here..."]
features = vectorizer.transform(text)
prediction = model.predict(features)
print(f"Prediction: {prediction[0]}")⭐ If you found this project useful, please give it a star! ⭐
Made with ❤️ by Abdul Musawir
.png)