NetPlag-Stream is a real-time plagiarism detection system built using Big Data technologies. It combines Spark Structured Streaming, HDFS, and Elasticsearch to continuously analyze academic documents and detect similarities against a reference corpus.
- Real-time streaming detection using Spark Structured Streaming.
- Persistent storage and shared artifacts on HDFS.
- Fast, queryable results via Elasticsearch with a Flask dashboard.
Documents → Spark Streaming → TF–IDF → Cosine Similarity ↓ HDFS (storage) ↓ Elasticsearch (indexing) ↓ Dashboard (visualization)
Key technologies:
- Apache Spark
- HDFS
- Elasticsearch
- Flask
- Docker
Prerequisites
- Docker Desktop
- Python 3.11
- Java 17
- Start the services:
docker-compose up -d
Start-Sleep -Seconds 30
docker-compose ps- Exit HDFS safe mode (if needed):
docker exec namenode hdfs dfsadmin -safemode leave- Install Python dependencies:
pip install -r requirements.txt- Create the HDFS structure:
python scripts/0_migrate_to_hdfs.py- Fast-batch migrate the local corpus into HDFS (optional, recommended):
.\migrate_fast.ps1- Build TF–IDF model and reference vectors:
python scripts/1_batch_init.py- Run the full streaming pipeline (automated):
python scripts/8_full_streamprocess.pyOr run pieces manually:
python scripts/2_streaming_app.py # streaming process
python scripts/4_plagiarism_analysis.py # batch analysis
python scripts/6_elasticsearch_indexer.py # index resultsThis section summarizes the main pipeline stages, evidence of execution, and
visual confirmations. Only the most relevant images are included below in the
same order they appeared in the original documentation. Place the image files
under docs/ using the filenames shown.
- Infrastructure & HDFS initialization
The scripts/0_migrate_to_hdfs.py script creates a strict HDFS layout that
separates raw data (/data) from processed artifacts (/storage). This step
also validates Spark connectivity with the Dockerized NameNode.
- Corpus migration and ingestion
The migrate_fast.ps1 script moves large numbers of small files into HDFS via
efficient batch transfers (Docker volume mounts), reducing ingestion time.
- Batch preprocessing and TF–IDF model training
The scripts/1_batch_init.py script reads the reference corpus, performs text
cleaning and tokenization, computes IDF weights, and saves the model and the
reference vectors to storage on HDFS.
- Real-time streaming detection
scripts/2_streaming_app.py runs as a long-lived process that watches the
stream_input folder. Incoming files are vectorized with the precomputed IDF
and compared (cosine similarity) against the reference vectors. High-similarity
results are immediately flagged (e.g., is_plagiarism: true).
- Consolidated analysis and statistics
scripts/4_plagiarism_analysis.py aggregates streaming results, computes
statistics (means, distributions), and writes Parquet summaries for traceability.
- Indexing to Elasticsearch
scripts/6_elasticsearch_indexer.py pushes JSON results into Elasticsearch
using the Bulk API so results become searchable via the dashboard.
- Dashboard (global + detailed views)
The Flask dashboard queries Elasticsearch and provides both summary metrics and detailed traces for each suspicious document pairing.
- TF–IDF vectorization with cosine similarity.
- Suggested thresholds:
-
0.7 : potential plagiarism
-
0.8 : strong similarity
-
0.9 : near-identical copy
-
config/hdfs_config.py— HDFS / Spark settingsconfig/elasticsearch_config.py— Elasticsearch settingsscripts/0_migrate_to_hdfs.py— create HDFS treescripts/1_batch_init.py— build TF–IDF & reference vectorsscripts/2_streaming_app.py— streaming detectionscripts/4_plagiarism_analysis.py— batch analysisscripts/6_elasticsearch_indexer.py— indexerscripts/8_full_streamprocess.py— end-to-end runner
If you want, I can (1) add placeholder images into docs/, (2) adjust
captions, or (3) produce a shorter summary README suitable for publication.
Which would you prefer next?
# List HDFS files
docker exec namenode hdfs dfs -ls /netplag/data/corpus_initial
# Show used space
docker exec namenode hdfs dfs -du -h /netplag
# Copy a file from HDFS
docker exec namenode hdfs dfs -get /netplag/storage/reports/plagiarism_cases.json# Vérifier les indices
curl http://localhost:9200/_cat/indices?v
# Compter les documents indexés
curl http://localhost:9200/plagiarism_reports/_count
# Rechercher des documents
curl http://localhost:9200/plagiarism_reports/_search?q=similarity_score:>0.8- Automatic surveillance of new publications
- Plagiarism detection across submitted articles
- Real-time alerts on suspicious similarities
- Batch analysis of student documents
- Comparison with bibliographic corpus
- Generation of detailed reports
- Pre-publication checks
- Detection of uncredited reuse
- Source traceability
Edit scripts/2_streaming_app.py or scripts/8_full_streamprocess.py:
# Approx line 80
PLAGIARISM_THRESHOLD = 0.7 # Change to 0.6 or 0.8# Approx line 50
TRIGGER_INTERVAL = "5 seconds" # Change to "10 seconds"Edit scripts/1_batch_init.py:
# Approx line 60
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=5000)
# Change to 10000 for higher precision- Reference corpus: 500+ documents
- Streaming: 10 documents every 5 seconds
- Average latency: < 10 seconds per batch
- Throughput: ~120 documents/minute
- Accuracy: ~85% (threshold 0.7)
- Sparse vectors (
SparseVector) to save memory - Broadcast the reference corpus (avoids shuffle)
- Parquet storage (high compression)
- Bulk Elasticsearch indexing (1000 docs/batch)
- HDFS checkpointing for fault tolerance
- "TF-IDF: A Statistical Interpretation" - Salton & McGill (1983)
- "Cosine Similarity in Information Retrieval" - Baeza-Yates (1999)
- "Plagiarism Detection: A Survey" - Alzahrani et al. (2012)
Potential Publication:
"NetPlag-Stream: A Real-Time Distributed Architecture for Academic Plagiarism Detection using Spark Streaming and Delta Lake"
Research Directions:
- Real-time Big Data architectures for scientific monitoring
- Optimization of large-scale similarity computation
- Semantic detection using transformers (BERT)
- Incremental management of TF-IDF models
Developed as part of a Big Data project on plagiarism detection in distributed architecture.
- Bellmir Yahya
- Ismaili Ayman
- Ait Abdou Ayman
- Chegdati Chouaib







