A streamlined pipeline for detecting anomalies in HDFS log data using OpenAI embeddings stored directly in Qdrant vector database.
Visualization of the stored Qdrant log embeddings:
- Slides for the talk are available at Semantic Anomaly Detection with Vector Search.
-
Setup Environment
pip install -r requirements.txt
-
Configure Environment Variables Create
.envfile:OPENAI_API_KEY=your_openai_api_key QDRANT_URL=your_qdrant_cloud_url QDRANT_API_KEY=your_qdrant_api_key DETECTION_METHOD=distance # or "dbscan" -
Run Pipeline
python main.py
Run components separately if needed:
python ingest.py # Download and balance HDFS data
python embed_and_ingest.py # Create embeddings and upload to Qdrant
python detect_distance.py # Distance-based anomaly detection
python detect_dbscan.py # DBSCAN clustering detectiondistance_detection_results.htmlordbscan_detection_results.html- Interactive visualizationsdata/balanced_dataset.txt- Balanced HDFS dataset (2000 samples, 80% normal)
- Distance-based: Uses k-NN distances with auto-tuned threshold
- DBSCAN: Uses density-based clustering with k-distance optimization
Both methods use t-SNE for 2D visualization and achieve ~40-50% precision with ~30-35% recall.
