diff --git a/docs/immune/DATASET_REPORT.md b/docs/immune/DATASET_REPORT.md new file mode 100644 index 000000000..e232909d7 --- /dev/null +++ b/docs/immune/DATASET_REPORT.md @@ -0,0 +1,168 @@ +# Network Event Summarization Dataset for Slips IDS + +## Table of Contents + +- [1. Task description](#1-task-description) +- [2. Limitations](#2-limitations) + - [Hardware Constraints](#hardware-constraints) + - [Scope Constraints](#scope-constraints) +- [3. Dataset Generation Workflow](#3-dataset-generation-workflow) + - [Stage 1: Incident Sampling](#stage-1-incident-sampling) + - [Stage 2: Structural Analysis](#stage-2-structural-analysis) + - [Stage 3: Multi-Model LLM Analysis](#stage-3-multi-model-llm-analysis) + - [Stage 4: Dataset Correlation](#stage-4-dataset-correlation) + - [Dataset Extension](#dataset-extension) + - [Workflow Diagram](#workflow-diagram) + - [Event Grouping Strategy](#event-grouping-strategy) + - [Additional Optimizations](#additional-optimizations) + - [Dataset Structure](#dataset-structure) + +## 1. Task description + +Develop a dataset for network security event summarization to be integrated with the Slips Immune system, optimized for deployment on low-resource hardware such as the Raspberry Pi 5. This dataset will be used to fine-tune compact language models capable of generating concise and actionable summaries of security incidents from raw Slips alert data, enabling real-time threat analysis in resource-constrained environments. + +## 2. Limitations + +### Hardware Constraints +- **Platform**: Raspberry Pi 5 with limited RAM and processing power +- **Model Size**: Only small language models (1.5B-3B parameters) are viable on target hardware +- **Real-time Processing**: Target 10-15 seconds per incident on RPi5 with Ollama requires aggressive token optimization + +### Scope Constraints +- **Alert Format**: Analysis currently limited to Slips alert format; generalization to other IDS outputs requires format adaptation +- **Token Budget**: Input and output tokens must be minimized to enable real-time inference on resource-constrained hardware (~2000 tokens max) +- **Output Constraints**: Summaries must be concise (150-300 tokens) while maintaining security context + +## 3. Dataset Generation Workflow + +The dataset generation process consists of four stages, each implemented as Python scripts with shell wrappers that simplify execution, handle argument validation, and automate file naming. This modular design enables flexible experimentation with different models and configurations while maintaining reproducibility. + +**Detailed documentation**: See [README_dataset_summary_workflow.md](README_dataset_summary_workflow.md) for complete pipeline specifications and advanced usage. + +### Stage 1: Incident Sampling +Extract security incidents from Slips `alerts.json` logs with category labels (Malware/Normal): + +```bash +./sample_dataset.sh 20 my_dataset --category malware --seed 42 +``` + +**Output**: `my_dataset.jsonl` (JSONL format with incidents and events) + +### Stage 2: Structural Analysis +Generate DAG-based chronological analysis of incident events: + +```bash +./generate_dag_analysis.sh my_dataset.jsonl +``` + +**Output**: `my_dataset.dag.json` (incident metadata + event timeline) + +### Stage 3: Multi-Model LLM Analysis +Query multiple language models with optimized prompts: + +```bash +# GPT-4o-mini (baseline) +./generate_llm_analysis.sh my_dataset.jsonl --model gpt-4o-mini \ + --group-events --behavior-analysis + +# Qwen2.5:3b (target model) +./generate_llm_analysis.sh my_dataset.jsonl --model qwen2.5:3b \ + --base-url http://10.147.20.102:11434/v1 --group-events --behavior-analysis + +# Qwen2.5:1.5b (minimal model) +./generate_llm_analysis.sh my_dataset.jsonl --model qwen2.5:1.5b \ + --base-url http://10.147.20.102:11434/v1 --group-events --behavior-analysis +``` + +**Outputs**: Model-specific JSON files with `summary` and `behavior_analysis` fields + +### Stage 4: Dataset Correlation +Merge all analyses into unified dataset by incident ID: + +```bash +python3 correlate_incidents.py my_dataset.*.json \ + --jsonl my_dataset.jsonl -o final_dataset.json +``` + +**Output**: `final_dataset.json` (consolidated dataset with all analyses) + +### Dataset Extension + +To expand existing datasets without regeneration, use `merge_datasets.py` to combine multiple correlated datasets with automatic deduplication: + +```bash +# Generate new samples with different seed +./sample_dataset.sh 20 extension --category malware --seed 99 + +# Run full analysis pipeline on extension +./generate_dag_analysis.sh extension.jsonl +./generate_llm_analysis.sh extension.jsonl --model qwen2.5:3b --group-events --behavior-analysis + +# Correlate extension data +python3 correlate_incidents.py extension.*.json --jsonl extension.jsonl -o extension_dataset.json + +# Merge with existing dataset (removes duplicates by incident_id) +python3 merge_datasets.py final_dataset.json extension_dataset.json -o final_dataset_v2.json +``` + +This approach enables incremental dataset growth while maintaining consistency across all analysis fields. + +### Workflow Diagram + +``` +Raw Slips Logs (alerts.json) + ↓ +[sample_dataset.py] → incidents.jsonl + ↓ + ├─→ [alert_dag_parser.py] → incidents.dag.json + ├─→ [alert_dag_parser_llm.py + GPT-4o-mini] → incidents.llm.gpt-4o-mini.json + ├─→ [alert_dag_parser_llm.py + Qwen2.5:3b] → incidents.llm.qwen2.5.json + └─→ [alert_dag_parser_llm.py + Qwen2.5:1.5b] → incidents.llm.qwen2.5.1.5b.json + ↓ +[correlate_incidents.py] → final_dataset.json +``` + +### Event Grouping Strategy + +The `--group-events` optimization reduces token count through pattern normalization: + +1. **Pattern Normalization**: Replaces variable components in event descriptions with placeholders + - IPv4 addresses → `` + - Port numbers → `` (handles formats: `443/TCP`, `port: 80`) + - Standalone numbers → `` + +2. **Pattern-Based Grouping**: Groups events with identical normalized patterns + - Example: "Connection to 192.168.1.5:443" + "Connection to 10.0.2.15:443" → single pattern "Connection to ``:``" + - Preserves count, time range, and sample values (first 5 unique IPs/ports) per group + +3. **Token Reduction**: + - 103 events: 3,522 → 976 tokens (72% reduction) + - 4,604 events: ~50,000 → 1,897 tokens (96% reduction) + +4. **Information Loss Analysis**: + - **Lost**: Individual timestamps (only ranges), complete IP/port lists (max 5 samples), exact event sequence, duplicate frequency tracking + - **Retained**: Semantic patterns, event counts, representative samples, temporal context, protocol details, attack patterns + - **Impact**: Small incidents (~28% loss), large incidents (~90-95% loss, mostly repetitive data) + - **Justification**: Enables LLM summarization on RPi5; alternative is inability to process large incidents + +### Additional Optimizations + +**Dual-Prompt Analysis** (`--behavior-analysis`): Generates both severity-filtered summaries and structured technical flow analysis, providing richer training signals for model fine-tuning. + +**Severity Filtering Strategy**: The dual-prompt approach implements intelligent filtering to manage token budgets: +- Prioritizes high-threat evidence in summaries for focused incident assessment +- May omit low-confidence events to reduce token consumption +- Balanced by generating both severity-filtered summaries and comprehensive behavior analysis +- Trade-off: Enables complete incident coverage while maintaining concise outputs suitable for resource-constrained deployment + +**Multi-Model Evaluation**: Compares GPT-4o (quality baseline), GPT-4o-mini, Qwen2.5:3b (target deployment), and Qwen2.5:1.5b (minimal viable model) to assess performance-resource trade-offs. + +### Dataset Structure + +Each incident in the final dataset contains: +- **Metadata**: incident_id, category, source_ip, timewindow, threat_level +- **DAG Analysis**: Chronological event timeline with threat scores +- **LLM Summaries**: Model-specific severity assessments +- **Behavior Analysis**: Structured network flow descriptions + +Token efficiency enables deployment on Raspberry Pi 5 while maintaining security analysis quality suitable for real-time intrusion detection. diff --git a/docs/immune/DATASET_RISK_REPORT.md b/docs/immune/DATASET_RISK_REPORT.md new file mode 100644 index 000000000..6debd0c0b --- /dev/null +++ b/docs/immune/DATASET_RISK_REPORT.md @@ -0,0 +1,155 @@ +# Network Event Cause & Risk Analysis Dataset for Slips IDS + +## Table of Contents + +- [1. Task Description](#1-task-description) +- [2. Relationship to Summarization Workflow](#2-relationship-to-summarization-workflow) +- [3. Dataset Generation Workflow](#3-dataset-generation-workflow) + - [Workflow Overview](#workflow-overview) + - [Stage 3: Multi-Model Cause & Risk Analysis](#stage-3-multi-model-cause--risk-analysis) + - [Stage 4: Dataset Correlation](#stage-4-dataset-correlation) + - [Dataset Structure](#dataset-structure) +- [4. Use Cases and Applications](#4-use-cases-and-applications) + +## 1. Task Description + +Develop a dataset for **root cause analysis and risk assessment** of network security incidents from Slips IDS alerts. This complementary workflow focuses on structured security analysis rather than event summarization, providing: + +1. **Cause Analysis** - Categorized incident attribution (Malicious Activity / Legitimate Activity / Misconfigurations) +2. **Risk Assessment** - Structured evaluation (Risk Level / Business Impact / Investigation Priority) + +**Target Deployment**: Same hardware constraints as [summarization workflow](DATASET_REPORT.md#2-limitations) (Raspberry Pi 5, 1.5B-3B parameter models). + +## 2. Relationship to Summarization Workflow + +Both workflows share identical **Stages 1-2** (incident sampling and DAG generation) but diverge in LLM analysis approach: + +| Aspect | Summarization Workflow | Risk Analysis Workflow | +|--------|------------------------|------------------------| +| **Documentation** | [DATASET_REPORT.md](DATASET_REPORT.md) | This document | +| **Detailed Guide** | [README_dataset_summary_workflow.md](README_dataset_summary_workflow.md) | [README_dataset_risk_workflow.md](README_dataset_risk_workflow.md) | +| **Analysis Script** | `generate_llm_analysis.sh` | `generate_cause_risk_analysis.sh` | +| **Correlation Script** | `correlate_incidents.py` | `correlate_risks.py` | +| **Output Fields** | `summary` + `behavior_analysis` | `cause_analysis` + `risk_assessment` | +| **LLM Prompts** | 2 per incident (event summarization + behavior patterns) | 2 per incident (cause attribution + risk scoring) | +| **Primary Use Case** | Incident timeline reconstruction, behavior pattern identification | Root cause analysis, threat prioritization, SOC decision support | + +**Recommendation**: Generate both datasets from the same sampled incidents to enable comparative analysis and multi-task model training. + +## 3. Dataset Generation Workflow + +### Workflow Overview + +**Stages 1-2** (Sampling + DAG): See [DATASET_REPORT.md §3](DATASET_REPORT.md#3-dataset-generation-workflow) - identical to summarization workflow. + +**Quick commands:** +```bash +# Stage 1: Sample 100 incidents +./sample_dataset.sh 100 my_dataset --seed 42 + +# Stage 2: Generate DAG analysis +./generate_dag_analysis.sh datasets/my_dataset.jsonl +``` + +### Stage 3: Multi-Model Cause & Risk Analysis + +Query LLMs with dual prompts for cause attribution and risk assessment: + +```bash +# GPT-4o-mini (recommended baseline) +./generate_cause_risk_analysis.sh datasets/my_dataset.jsonl \ + --model gpt-4o-mini --group-events + +# Qwen2.5:3b (target deployment model) +./generate_cause_risk_analysis.sh datasets/my_dataset.jsonl \ + --model qwen2.5:3b \ + --base-url http://10.147.20.102:11434/v1 --group-events +``` + +**Output Structure** (per incident): +```json +{ + "cause_analysis": "**Possible Causes:**\n\n**1. Malicious Activity:**\n• Port scanning indicates reconnaissance...\n\n**2. Legitimate Activity:**\n• Could be network monitoring tools...\n\n**3. Misconfigurations:**\n• Firewall allowing unrestricted scanning...\n\n**Conclusion:** Most likely malicious reconnaissance activity.", + + "risk_assessment": "**Risk Level:** High\n\n**Justification:** Active scanning + C2 connections...\n\n**Business Impact:** Potential data breach or service disruption...\n\n**Likelihood of Malicious Activity:** High - Systematic attack pattern...\n\n**Investigation Priority:** Immediate - Block source IP and investigate." +} +``` + +### Stage 4: Dataset Correlation + +Merge all analyses (DAG + LLM cause/risk assessments) by incident ID: + +```bash +python3 correlate_risks.py datasets/my_dataset.*.json \ + --jsonl datasets/my_dataset.jsonl \ + -o datasets/final_dataset_risk.json +``` + +### Dataset Structure + +Final output contains merged analyses with model-specific risk assessments: + +```json +{ + "total_incidents": 100, + "incidents": [ + { + "incident_id": "uuid", + "category": "Malware", + "source_ip": "192.168.1.113", + "timewindow": "5", + "timeline": "2024-04-05 16:53:07 to 16:53:50", + "threat_level": 15.36, + "event_count": 4604, + "dag_analysis": "• 16:53 - 222 horizontal port scans [HIGH]\n...", + "cause_risk_gpt_4o_mini": { + "cause_analysis": "**1. Malicious Activity:** Reconnaissance scanning...", + "risk_assessment": "**Risk Level:** High\n**Justification:**..." + }, + "cause_risk_gpt_4o": { ... }, + "cause_risk_qwen2_5": { ... } + } + ] +} +``` + +**Key differences from summarization dataset**: +- `cause_risk_*` fields replace `llm_*` fields +- Structured 3-category cause analysis (vs. free-form summary) +- 5-field risk assessment framework (vs. behavior flow description) + +## 4. Use Cases and Applications + +### Security Operations Center (SOC) +- **Automated Triage**: Risk level + investigation priority for alert queue sorting +- **Incident Attribution**: Distinguish malicious attacks from misconfigurations +- **Resource Allocation**: Business impact assessment for team assignments + +### Model Training Applications +- **Classification Tasks**: Train models to categorize incidents (malicious/legitimate/misconfiguration) +- **Risk Scoring**: Fine-tune models for threat level prediction +- **Decision Support**: Generate actionable recommendations (block/monitor/investigate) + +### Dataset Comparison +Use both workflows together: +- **Summarization**: "What happened?" (temporal sequences, behavior patterns) +- **Risk Analysis**: "Why did it happen?" + "How urgent?" (attribution, prioritization) + +**Combined Training Strategy**: +```bash +# Generate both datasets from same incidents +./generate_llm_analysis.sh datasets/my_dataset.jsonl --model qwen2.5:3b --group-events --behavior-analysis +./generate_cause_risk_analysis.sh datasets/my_dataset.jsonl --model qwen2.5:3b --group-events + +# Correlate separately +python3 correlate_incidents.py datasets/my_dataset.*.json --jsonl datasets/my_dataset.jsonl -o summary_dataset.json +python3 correlate_risks.py datasets/my_dataset.*.json --jsonl datasets/my_dataset.jsonl -o risk_dataset.json + +# Multi-task training: Merge datasets and train single model on both tasks +``` + +--- + +**For detailed implementation**: See [README_dataset_risk_workflow.md](README_dataset_risk_workflow.md) +**For workflow comparison**: See [WORKFLOWS_OVERVIEW.md](WORKFLOWS_OVERVIEW.md) (if available) +**For evaluation methods**: See [LLM_EVALUATION_GUIDE.md](LLM_EVALUATION_GUIDE.md) diff --git a/docs/immune/Immune.md b/docs/immune/Immune.md index b54692a6c..378d7b8f0 100644 --- a/docs/immune/Immune.md +++ b/docs/immune/Immune.md @@ -2,17 +2,41 @@ This is the main guide to the documentation related to the changes done to Slips as part of incorporating the immunology ideas +### Architecture - [Main Architecture of Slips Immune](https://stratospherelinuxips.readthedocs.io/en/develop/immune/immune_architecture.html) + +### - [Research RPI Limitations](https://stratospherelinuxips.readthedocs.io/en/develop/immune/research_rpi_limitations_and_define_acceptable_performance_benchmarks.html) - [Slips Compatibility In The RPI](https://stratospherelinuxips.readthedocs.io/en/develop/immune/reimplement_slips_features_incompatible_with_the_rpi.html) - [Installing Slips On the RPI](https://stratospherelinuxips.readthedocs.io/en/develop/immune/installing_slips_in_the_rpi.html) - [LLM Research and Selection](https://stratospherelinuxips.readthedocs.io/en/develop/immune/research_and_selection_of_llm_candidates.html) - [LLM RPI Performance](https://stratospherelinuxips.readthedocs.io/en/develop/immune/research_rpi_llm_performance.html) -- [LLM RPI Finetuning Frameworks](https://stratospherelinuxips.readthedocs.io/en/develop/immune/finetuning_frameworks_rpi_5.html) -- [LLM Summarization Dataset](https://stratospherelinuxips.readthedocs.io/en/develop/immune/summary_dataset.html) + +### Security & Network Configuration + - [ARP Poisoning](https://stratospherelinuxips.readthedocs.io/en/develop/immune/arp_poisoning.html) - [ARP Poisoning Risks](https://stratospherelinuxips.readthedocs.io/en/develop/immune/arp_poisoning_risks.html) - [Blocking with Slips as an Access Point](https://stratospherelinuxips.readthedocs.io/en/develop/immune/blocking_in_slips.html) - [IDS-in-the-middle Traffic routing](https://stratospherelinuxips.readthedocs.io/en/develop/immune/ids_in_the_middle_traffic_routing.html) - [RPI Failover Mechanisms](https://stratospherelinuxips.readthedocs.io/en/develop/immune/failover_mechanisms.html) + +### Datasets & LLM Training + +**Overview Documents:** +- [Dataset Generation Workflows Overview](https://stratospherelinuxips.readthedocs.io/en/develop/immune/WORKFLOWS_OVERVIEW.html) - Quick comparison of summarization vs. risk workflows +- [Summarization Dataset Report](https://stratospherelinuxips.readthedocs.io/en/develop/immune/DATASET_REPORT.html) - Event summarization and behavior analysis +- [Risk Analysis Dataset Report](https://stratospherelinuxips.readthedocs.io/en/develop/immune/DATASET_RISK_REPORT.html) - Root cause and risk assessment + +**Detailed Workflow Guides:** +- [Summarization Workflow Implementation](https://stratospherelinuxips.readthedocs.io/en/develop/immune/README_dataset_summary_workflow.html) - Step-by-step guide for generating summarization datasets +- [Risk Analysis Workflow Implementation](https://stratospherelinuxips.readthedocs.io/en/develop/immune/README_dataset_risk_workflow.html) - Step-by-step guide for generating risk datasets +- [Alert DAG Parser Documentation](https://stratospherelinuxips.readthedocs.io/en/develop/immune/README_alert_dag.html) - DAG structural analysis reference + +**Datasets Evaluation (LLM-as-a-judge):** +- [LLM Evaluation Guide](https://stratospherelinuxips.readthedocs.io/en/develop/immune/LLM_EVALUATION_GUIDE.html) - How to evaluate and compare LLM models +- [Summarization Evaluation Results](https://stratospherelinuxips.readthedocs.io/en/develop/immune/summary_report.html) - Performance metrics for summarization models +- [Risk Analysis Evaluation Results](https://stratospherelinuxips.readthedocs.io/en/develop/immune/risk_summary.html) - Performance metrics for risk assessment models + +**LLM finetuning** +- [LLM RPI Finetuning Frameworks](https://stratospherelinuxips.readthedocs.io/en/develop/immune/finetuning_frameworks_rpi_5.html) diff --git a/docs/immune/LLM_EVALUATION_GUIDE.md b/docs/immune/LLM_EVALUATION_GUIDE.md new file mode 100644 index 000000000..5ee29970d --- /dev/null +++ b/docs/immune/LLM_EVALUATION_GUIDE.md @@ -0,0 +1,415 @@ +# LLM Evaluation Framework Guide + +## Overview + +Evaluate 4 LLM models (GPT-4o, GPT-4o-mini, Qwen2.5 15B, Qwen2.5) on security incident summarization using GPT-4o as a network security analyst judge. + +- 50-sample evaluation (10 Normal + 40 Malware) +- Comparative ranking (1-4) with scores (1-10) +- Interactive HTML dashboard +- Cost: ~$5 for 50 evaluations + +--- + +## Quick Start + +### Setup + +```bash +pip install openai python-dotenv +export OPENAI_API_KEY="sk-your-key-here" +``` + +### Run Evaluation + +```bash +# Summarization workflow +./run_evaluation_summary.sh + +# Risk analysis workflow +./run_evaluation_risk.sh + +# OR manual steps +python3 datasets/create_evaluation_sample.py +python3 evaluate_summaries.py +python3 analyze_results.py +python3 generate_dashboard.py + +# View results +cat results/summary_report.md +firefox results/summary_dashboard.html +``` + +### Test First + +```bash +# Single incident test (~$0.10) +python3 test_evaluation.py +``` + +--- + +## Components + +### 1. Dataset Sampler +```bash +python3 datasets/create_evaluation_sample.py [--size 50] [--seed 42] +``` +Output: `datasets/summary_sample.json` + +### 2. Judge Evaluation +```bash +python3 evaluate_summaries.py [--judge gpt-4o] [--input FILE] [--output FILE] +``` +Output: `results/summary_results.json` + +### 3. Results Analysis +```bash +python3 analyze_results.py [--results FILE] [--summary FILE] [--csv FILE] +``` +Output: `results/summary_report.md` (Markdown), `results/summary_data.csv` + +### 4. Dashboard Generator +```bash +python3 generate_dashboard.py [--results FILE] [--sample FILE] [--output FILE] +``` +Output: `results/summary_dashboard.html` + +**All scripts support `--help` for full options.** + +--- + +## Example Workflows + +### Standard Evaluation (50 samples, ~$5) +```bash +python3 datasets/create_evaluation_sample.py +python3 evaluate_summaries.py +python3 analyze_results.py +python3 generate_dashboard.py +firefox results/summary_dashboard.html +``` + +### Budget Evaluation (GPT-4o-mini judge, ~$1) +```bash +python3 datasets/create_evaluation_sample.py +python3 evaluate_summaries.py --judge gpt-4o-mini +python3 analyze_results.py +``` + +### Large-Scale Evaluation (100 samples, ~$10) +```bash +python3 datasets/create_evaluation_sample.py --size 100 -o datasets/sample_100.json +python3 evaluate_summaries.py -i datasets/sample_100.json -o results/results_100.json +python3 analyze_results.py -r results/results_100.json -s results/summary_100.md +python3 generate_dashboard.py -r results/results_100.json -s datasets/sample_100.json +``` + +### Compare Judge Models +```bash +python3 datasets/create_evaluation_sample.py + +# GPT-4o judge +python3 evaluate_summaries.py --judge gpt-4o -o results/eval_gpt4o.json +python3 analyze_results.py -r results/eval_gpt4o.json -s results/summary_gpt4o.md + +# GPT-4o-mini judge +python3 evaluate_summaries.py --judge gpt-4o-mini -o results/eval_mini.json +python3 analyze_results.py -r results/eval_mini.json -s results/summary_mini.md + +# Compare +diff results/summary_gpt4o.md results/summary_mini.md +``` + +--- + +## Dashboard Features + +- **Summary Metrics**: Total incidents, top performer, win rate +- **Win Rate Chart**: Bar chart comparing models +- **Position Distribution**: 1st/2nd/3rd/4th place frequency +- **Category Performance**: Malware vs Normal breakdown +- **Head-to-Head Matrix**: Pairwise win rates +- **Incident Browser**: Searchable table with expandable details + +**Interactive:** Click rows for full details, search/filter, dark/light theme + +--- + +## Understanding Results + +### Example Output +``` +Rank Model Avg Pos Avg Score Win Rate +1 GPT-4o 1.8 8.5 45.0% +2 Qwen2.5 15B 2.3 7.2 28.0% +3 GPT-4o-mini 2.7 6.8 18.0% +4 Qwen2.5 3.2 5.9 9.0% +``` + +- **Win Rate**: % of times ranked #1 +- **Avg Position**: 1-4 scale (lower is better) +- **Avg Score**: 1-10 scale (higher is better) + +--- + +## Evaluation Criteria + +Judge (GPT-4o as network analyst) evaluates: +1. Accuracy of threat identification +2. Completeness of critical events +3. Clarity and readability +4. Actionability for incident response +5. Professional quality for SOC + +Output: Rankings + scores + justification + +--- + +## Tips + +### Cost Optimization +- Test with `test_evaluation.py` first ($0.10) +- Use `--judge gpt-4o-mini` for testing (80% cheaper) +- Start with 25-50 samples + +### Organization +```bash +mkdir -p experiments/exp01 +python3 datasets/create_evaluation_sample.py -o experiments/exp01/sample.json +python3 evaluate_summaries.py -i experiments/exp01/sample.json -o experiments/exp01/results.json +# ... continue workflow in exp01/ +``` + +### Batch Processing +```bash +for size in 25 50 100; do + python3 datasets/create_evaluation_sample.py --size $size -o datasets/sample_${size}.json + python3 evaluate_summaries.py -i datasets/sample_${size}.json -o results/eval_${size}.json +done +``` + +--- + +## Troubleshooting + +**API Key:** +```bash +echo $OPENAI_API_KEY +export OPENAI_API_KEY="sk-..." +``` + +**Rate Limits:** Add `time.sleep(2)` in `evaluate_summaries.py` + +**Dashboard:** Requires internet for CDN (Chart.js, Bootstrap) + +--- + +## Files Generated + +``` +datasets/summary_sample.json # 50 sampled incidents +results/summary_results.json # Judge rankings +results/summary_report.md # Markdown report +results/summary_data.csv # Spreadsheet data +results/summary_dashboard.html # Interactive visualization +``` + +--- + +## Dataset Composition + +**Summarization Dataset:** +- Normal: 10 samples (20% - all available) +- Malware: 40 samples (80%) +- Event count: 24 - 7,322 (avg: 1,518) +- Stratified by complexity + +**Risk Analysis Dataset:** +- Normal: 18 samples (36% - all available) +- Malware: 32 samples (64%) +- Models: GPT-4o, GPT-4o-mini, Qwen2.5, Qwen2.5 3B +- Fields: cause_analysis + risk_assessment + +--- + +## Risk Analysis Evaluation + +Alternative workflow for evaluating **Cause & Risk analysis** outputs using LLM-as-judge methodology. + +This workflow evaluates how well LLMs perform root cause analysis and risk assessment for security incidents. For dataset generation, see [README_RISK_WORKFLOW.md](README_RISK_WORKFLOW.md). + +### Prerequisites + +**Input Requirements:** +- Cause & Risk dataset with multiple model analyses (`.cause_risk.*.json` files) +- Correlated final dataset (from `correlate_risks.py`) +- Judge model access (GPT-4o recommended for evaluation quality) + +**Environment:** +- `OPENAI_API_KEY` set for judge model +- Python packages: `openai`, `python-dotenv` + +### Quick Start + +```bash +# Automated workflow +./run_evaluation_risk.sh + +# OR manual steps +python3 datasets/create_risk_sample.py +python3 evaluate_risk.py +python3 analyze_results.py --results results/risk_results.json --summary results/risk_summary.md --csv results/risk_data.csv +python3 generate_dashboard.py --results results/risk_results.json --output results/risk_dashboard.html +``` + +### Evaluation Components + +**Script: `evaluate_risk.py`** +- Reads correlated dataset with `cause_analysis` and `risk_assessment` fields +- Extracts analyses from different models (GPT-4o, GPT-4o-mini, Qwen, etc.) +- Queries judge model to score each analysis on 5 criteria +- Outputs structured JSON with scores and justifications + +**Script: `analyze_results.py`** +- Aggregates scores across models and criteria +- Generates statistical summaries (mean, median, std dev) +- Creates markdown report with model comparison tables +- Supports both summarization and risk evaluation results + +**Script: `generate_dashboard.py`** +- Creates interactive HTML dashboard +- Visualizes score distributions per model and criterion +- Side-by-side comparisons of model outputs +- Incident-level detail views with full evidence + +### Judge Criteria (Security Risk Analyst Perspective) + +The judge model evaluates each analysis on **5 criteria** (1-5 scale): + +1. **Cause Identification Accuracy** (1-5) + - Correctly categorizes as Malicious / Legitimate / Misconfiguration + - Identifies specific attack techniques or benign operational causes + - Distinguishes between intentional malicious activity and system misconfigurations + +2. **Evidence-Based Reasoning** (1-5) + - Analysis grounded in actual events from DAG evidence + - Logical connection between observed behavior and proposed causes + - Avoids speculation unsupported by evidence + +3. **Risk Level Accuracy** (1-5) + - Appropriate risk classification (Critical / High / Medium / Low) + - Risk level justified by actual threat severity + - Considers both likelihood and impact + +4. **Business Impact Assessment** (1-5) + - Realistic evaluation of potential business consequences + - Specific impact types (data breach, service disruption, compliance violation) + - Appropriate scope and severity of impact description + +5. **Investigation Priority** (1-5) + - Actionable prioritization (Immediate / High / Medium / Low) + - Aligned with risk level and business impact + - Clear guidance for security team response + +**Scoring Guidelines:** +- **5**: Excellent - Highly accurate, well-justified, actionable +- **4**: Good - Mostly accurate with minor issues +- **3**: Adequate - Correct direction but lacks depth or has some inaccuracies +- **2**: Poor - Significant errors or missing key elements +- **1**: Unacceptable - Fundamentally incorrect or irrelevant + +### Cost Estimates + +**Evaluation Costs (GPT-4o judge):** +- 50 incidents × 4 models = 200 evaluations +- ~2,000 tokens per evaluation (input + output) +- Total: ~400,000 tokens ≈ $5 USD +- Time: ~30-45 minutes + +**Model Comparison:** +- GPT-4o: $5.00 / 1M input tokens, $15.00 / 1M output tokens +- GPT-4o-mini: $0.15 / 1M input, $0.60 / 1M output (cheaper but less reliable as judge) + +### Example Output + +**Risk Evaluation Results (`results/risk_results.json`):** +```json +{ + "incident_id": "abc123...", + "category": "Malware", + "model_evaluations": { + "cause_risk_gpt4o_mini": { + "cause_identification": {"score": 4, "justification": "..."}, + "evidence_reasoning": {"score": 5, "justification": "..."}, + "risk_level": {"score": 4, "justification": "..."}, + "business_impact": {"score": 3, "justification": "..."}, + "investigation_priority": {"score": 4, "justification": "..."}, + "average_score": 4.0 + } + } +} +``` + +**Summary Report (`results/risk_summary.md`):** +``` +Model Performance Summary +========================= + +cause_risk_gpt4o: 4.2 ± 0.6 +cause_risk_gpt4o_mini: 3.8 ± 0.7 +cause_risk_qwen2_5: 3.5 ± 0.8 + +Criterion Breakdown: +- Cause Identification: 3.9 +- Evidence Reasoning: 4.1 +- Risk Level Accuracy: 3.7 +- Business Impact: 3.6 +- Investigation Priority: 3.8 +``` + +### Workflow Integration + +**Generate → Evaluate → Iterate:** + +```bash +# 1. Generate Cause & Risk dataset (see README_RISK_WORKFLOW.md) +./generate_cause_risk_analysis.sh datasets/my_dataset.jsonl --model gpt-4o-mini --group-events +python3 correlate_risks.py datasets/my_dataset.*.json -o final_dataset.json + +# 2. Sample for evaluation +python3 datasets/create_risk_sample.py + +# 3. Evaluate +python3 evaluate_risk.py datasets/risk_sample.json -o results/risk_results.json + +# 4. Analyze results +python3 analyze_results.py -r results/risk_results.json -s results/risk_summary.md + +# 5. View dashboard +python3 generate_dashboard.py -r results/risk_results.json -s datasets/risk_sample.json -o results/risk_dashboard.html +open results/risk_dashboard.html +``` + +### Tips & Best Practices + +**Sampling Strategy:** +- Include diverse incident types (Normal + Malware) +- Sample across complexity levels (event count variation) +- Use stratified sampling to ensure representative coverage + +**Judge Model Selection:** +- **GPT-4o**: Best evaluation quality, use for final results +- **GPT-4o-mini**: Faster/cheaper for development iteration +- Avoid using same model as judge and generator (evaluation bias) + +**Interpreting Results:** +- Scores < 3.0: Significant improvement needed +- Scores 3.0-4.0: Acceptable, room for refinement +- Scores > 4.0: High quality analysis +- High variance: Inconsistent performance across incidents + +**Common Issues:** +- Low "Evidence Reasoning": Model hallucinating causes not in DAG +- Low "Risk Level": Overestimating or underestimating severity +- Low "Business Impact": Generic impacts instead of specific consequences diff --git a/docs/immune/README_alert_dag.md b/docs/immune/README_alert_dag.md new file mode 100644 index 000000000..00d5c432d --- /dev/null +++ b/docs/immune/README_alert_dag.md @@ -0,0 +1,484 @@ +# Alert DAG Parser + +## Overview + +`alert_dag_parser.py` is a Python tool that parses JSONL (JSON Lines) files containing Slips security incidents and events in IDEA format. Unlike traditional log parsers that rely on regex pattern matching of text descriptions, this tool uses **structured JSON field-based classification** to ensure compatibility with current and future unknown alert types. + +## Design Philosophy + +### Future-Proof Architecture + +The tool is designed to handle new alert types without code modifications by: + +1. **Field-based classification** - Uses standardized JSON fields (`Severity`, `Source`, `Target`) instead of parsing description text +2. **Graceful degradation** - Unknown patterns are grouped and displayed automatically +3. **No regex maintenance** - New alert types work immediately without updating pattern definitions + +### Why Not Text Parsing? + +Traditional approaches like `slips_dag_generator.py` use regex patterns on description text: +```python +# Brittle approach - breaks when text changes +r'horizontal port scan to port\s+(\d+/\w+)' +r'C&C channel.*?destination IP: ([\d.]+)' +``` + +**Problems:** +- Breaks when description text changes +- Requires code updates for new alert types +- Fragile maintenance burden + +**Solution:** +```python +# Robust approach - uses structured fields +severity = event['Severity'] +target_ip = event['Target'][0]['IP'] +target_port = event['Target'][0]['Port'][0] +``` + +## File Format + +### JSONL Structure + +The input file contains one JSON object per line with two entry types: + +#### Incidents (Alerts) +```json +{ + "Status": "Incident", + "ID": "96b2b890-8e6d-458a-9217-71cfff0ef1c5", + "Source": [{"IP": "192.168.1.122"}], + "StartTime": "1970-01-01T00:00:13.676697+00:00", + "CreateTime": "2025-03-06T13:53:53.687361+00:00", + "CorrelID": ["event-uuid-1", "event-uuid-2", ...], + "Note": "{\"accumulated_threat_level\": 15.36, \"timewindow\": 1, \"EndTime\": \"...\"}" +} +``` + +#### Events (Evidence) +```json +{ + "Status": "Event", + "ID": "9180df3e-449d-412b-b8c9-45fb76831e12", + "Severity": "Info", + "StartTime": "1970-01-01T00:00:13.676697+00:00", + "Confidence": 1.0, + "Description": "Connecting to private IP: fd2d:ab8c:225::1 on destination port: 53 threat level: info.", + "Source": [{"IP": "fd2d:ab8c:225:0:f575:44d7:5a0b:2224", "Port": [49885]}], + "Target": [{"IP": "fd2d:ab8c:225::1", "Port": [53]}], + "Note": "{\"uids\": [...], \"threat_level\": \"info\", \"timewindow\": 1}" +} +``` + +### Correlation Model + +- Incidents contain `CorrelID` array with Event UUIDs +- Events are linked to Incidents via their `ID` field +- One Incident can have multiple Events +- Events can theoretically belong to multiple Incidents + +## Usage + +### Basic Usage + +```bash +# Analyze all incidents in the file +python3 alert_dag_parser.py alerts.json + +# Analyze specific incident by UUID +python3 alert_dag_parser.py alerts.json --incident-id 96b2b890-8e6d-458a-9217-71cfff0ef1c5 + +# Save output to file +python3 alert_dag_parser.py alerts.json -o incident_report.txt + +# Verbose mode (shows parsing progress) +python3 alert_dag_parser.py alerts.json --verbose +``` + +### Command-Line Options + +| Option | Short | Description | +|--------|-------|-------------| +| `--incident-id` | `-i` | Analyze specific incident by UUID | +| `--output` | `-o` | Write output to file instead of stdout | +| `--verbose` | `-v` | Show parsing progress and statistics | + +### Example Workflow + +```bash +# 1. Quick analysis of all incidents +./alert_dag_parser.py sample_logs/alya_datasets/Malware/.../alerts.json + +# 2. Identify interesting incident from summary +# 3. Deep dive into specific incident +./alert_dag_parser.py alerts.json -i -o incident_analysis.txt + +# 4. Review detailed report +less incident_analysis.txt +``` + +## Output Format + +### Comprehensive Analysis + +The tool generates a comprehensive per-incident analysis showing ALL associated events: + +``` +============================================================ +Incident: 96b2b890-8e6d-458a-9217-71cfff0ef1c5 +Source IP: 192.168.1.122 | Timewindow: 1 +Timeline: 1970-01-01 00:00:13 to 1970-01-01 01:00:13 +Threat Level: 15.36 | Events: 24 + +• 00:00-00:20 - 6 events to 224.0.0.1 [HIGH] + - Connection on port 0 from 0.0.0.0:0 to 224.0.0.1:0. threat level: high. (x6) + +• 00:05-00:15 - 8 events to port 53 [INFO] + - Connecting to private IP: fd2d:ab8c:225::1 on destination port: 53 threat level: info. (x4) + - Connecting to private IP: 192.168.1.1 on destination port: 53 threat level: info. (x4) + +• 00:10 - 3 events to 81.169.128.232:4743 [MEDIUM] + - Connection to unknown destination port 4743/TCP destination IP 81.169.128.232. threat level: medium. (x3) + +• 00:12 - 1 events to 176.9.116.3:3889 [HIGH] + - Connection to unknown destination port 3889/TCP destination IP 176.9.116.3. threat level: high. + +• 00:07-00:13 - 6 events to 4 IPs [INFO] + - A connection without DNS resolution to IP: 81.169.128.232 threat level: info. (x3) + - A connection without DNS resolution to IP: 176.9.116.3 threat level: info. + - A connection without DNS resolution to IP: 107.170.231.118 threat level: info. + - A connection without DNS resolution to IP: 37.187.54.76 threat level: info. + +Total Evidence: 24 events +Severity breakdown: High: 7, Medium: 3, Info: 14 +``` + +### Output Structure + +Each incident analysis includes: + +1. **Header** - Incident UUID and metadata +2. **Timeline** - Start and end times from timewindow +3. **Threat metrics** - Accumulated threat level and event count +4. **Grouped events** - Events grouped by: + - Severity level (Critical → High → Medium → Low → Info) + - Target characteristics (IP, port, or pattern) + - Time range (earliest to latest in group) +5. **Event details** - Up to 3 example descriptions per group with counts +6. **Summary statistics** - Total events and severity breakdown + +### Grouping Logic + +Events are grouped using structured fields: + +```python +group_key = (event.severity, target_summary) + +# target_summary examples: +# - "192.168.1.1:53" (specific IP and port) +# - "224.0.0.1" (IP only) +# - "port 53" (port only) +# - "4 IPs" (multiple targets) +# - "Unknown" (no target info) +``` + +This ensures consistent grouping regardless of description text variations. + +## Technical Architecture + +### Core Classes + +#### `JSONEvent` +Dataclass representing individual security events (evidence). + +**Key Fields:** +- `id` - Unique event identifier (UUID) +- `severity` - Info, Low, Medium, High, Critical +- `source_ips` - List of source IP addresses +- `source_ports` - List of source ports +- `target_ips` - List of destination IP addresses +- `target_ports` - List of destination ports +- `description` - Human-readable text (display only) +- `confidence` - Numeric confidence score +- `note` - Parsed metadata dictionary + +**Design Note:** Uses lists for IPs/ports to handle multi-target events gracefully. + +#### `JSONIncident` +Dataclass representing security incidents (alerts). + +**Key Fields:** +- `id` - Unique incident identifier (UUID) +- `source_ips` - List of source IPs involved in incident +- `correl_ids` - List of Event UUIDs associated with this incident +- `note` - Metadata including `accumulated_threat_level`, `timewindow`, `EndTime` + +#### `AlertJSONParser` +Parses JSONL files and builds incident-event correlation. + +**Responsibilities:** +- Line-by-line JSONL parsing +- Separation of Incidents from Events +- Event lookup index creation (`{event_id: event_object}`) +- Error handling and validation + +#### `AlertDAGGenerator` +Generates comprehensive analysis output. + +**Responsibilities:** +- Field-based event grouping (not text parsing) +- Severity-based prioritization +- Timeline formatting +- Summary statistics generation + +### Data Flow + +``` +JSONL File + ↓ +AlertJSONParser.parse_file() + ├─→ List[JSONIncident] + └─→ Dict[event_id: JSONEvent] + ↓ +For each Incident: + AlertJSONParser.get_incident_events() + ↓ + List[JSONEvent] (correlated events) + ↓ + AlertDAGGenerator.generate_comprehensive_analysis() + ├─→ Group by (severity, target_summary) + ├─→ Sort by severity priority + ├─→ Format timeline and descriptions + └─→ Generate statistics + ↓ +Comprehensive Analysis Output +``` + +### Field-Based Classification + +Unlike regex-based parsers, this tool classifies events using structured fields: + +```python +def _create_target_summary(self, event: JSONEvent) -> str: + """Create target summary using structured fields.""" + if event.target_ips and event.target_ports: + # Both IP and port available + ip_summary = event.target_ips[0] if len(event.target_ips) == 1 else f"{len(event.target_ips)} IPs" + port_summary = str(event.target_ports[0]) if len(event.target_ports) == 1 else f"{len(event.target_ports)} ports" + return f"{ip_summary}:{port_summary}" + elif event.target_ips: + # Only IP available + return event.target_ips[0] if len(event.target_ips) == 1 else f"{len(event.target_ips)} IPs" + elif event.target_ports: + # Only port available + return f"port {event.target_ports[0]}" if len(event.target_ports) == 1 else f"{len(event.target_ports)} ports" + else: + # No structured target info - use description prefix as fallback + desc_prefix = event.description.split()[0] if event.description else "Unknown" + return desc_prefix +``` + +**Benefits:** +- Works with any event type (current or future) +- No regex pattern maintenance +- Consistent grouping logic +- Graceful fallback for edge cases + +## Example Datasets + +### Test Dataset Structure + +``` +sample_logs/alya_datasets/Malware/ +├── CTU-Malware-Capture-Botnet-219-2/ +├── CTU-Malware-Capture-Botnet-327-2/ +└── CTU-Malware-Capture-Botnet-346-1/ + └── 2018-04-03_win12-fixed/ + └── 9/ + ├── alerts.json (3,226 entries: 47 incidents, 3,179 events) + └── slips.log (Original Slips log output) +``` + +### Dataset Characteristics + +**CTU-Malware-Capture-Botnet-346-1 (9):** +- 47 Incidents +- 3,179 Events +- Event types: + - Private IP connections + - Port 0 connections (multicast) + - Unknown destination ports + - DNS resolution issues + - Reconnection attempts + - Long connections + +### Sample Analysis + +```bash +# Quick stats +python3 alert_dag_parser.py sample_logs/alya_datasets/Malware/CTU-Malware-Capture-Botnet-346-1/2018-04-03_win12-fixed/9/alerts.json --verbose 2>&1 | head -3 + +# Output: +# Parsing file: sample_logs/alya_datasets/Malware/... +# Found 47 incidents and 3179 events +``` + +## Error Handling + +### Graceful Error Recovery + +The parser handles common issues without crashing: + +1. **Malformed JSON lines** - Skipped with warning +2. **Missing Event IDs** - Warning logged, analysis continues +3. **Missing fields** - Defaults to "Unknown" or empty lists +4. **Invalid timestamps** - Falls back to raw ISO string +5. **Unparseable Note fields** - Stored as raw string + +### Warning Messages + +``` +Warning: JSON parse error at line 42: Expecting ',' delimiter +Warning: Event abc123-... not found for Incident xyz789-... +Warning: Unknown status 'Test' at line 156 +``` + +### Exit Codes + +- `0` - Success +- `1` - File not found, write error, or no incidents found + +## Performance Considerations + +### Memory Usage + +- **Efficient**: All events and incidents loaded into memory +- **Typical**: ~50 incidents + ~3,000 events = ~5-10 MB RAM +- **Large datasets**: May need streaming for >100,000 events + +### Processing Speed + +- ~3,000 events parsed in <1 second +- JSON parsing is the bottleneck (not analysis logic) +- Linear time complexity: O(incidents + events) + +### Scalability Tips + +For very large datasets (>100K events): +1. Filter by timewindow or IP before parsing +2. Use `--incident-id` to analyze specific incidents +3. Split JSONL files by timewindow + +## Comparison with slips_dag_generator.py + +| Feature | alert_dag_parser.py | slips_dag_generator.py | +|---------|---------------------|------------------------| +| **Input format** | JSONL (IDEA format) | Plain text logs | +| **Classification** | Structured fields | Regex on descriptions | +| **Future-proof** | ✅ Yes | ❌ Requires updates | +| **Analysis mode** | Per-incident only | Per-IP or per-analysis | +| **Output formats** | Comprehensive only | 5 formats (compact, minimal, etc.) | +| **New alert types** | Work automatically | Need code updates | +| **Maintenance** | Low | High (regex patterns) | + +### When to Use Each Tool + +**Use `alert_dag_parser.py` when:** +- Working with JSONL/IDEA format files +- Need future-proof classification +- Want per-incident comprehensive analysis +- Analyzing structured alert exports + +**Use `slips_dag_generator.py` when:** +- Working with plain text Slips logs +- Need multiple output formats +- Want IP-based timeline analysis +- Analyzing real-time log streams + +## Limitations + +1. **Format dependency** - Only works with JSONL/IDEA format +2. **Memory bound** - All data loaded into memory (not streaming) +3. **Single output format** - Comprehensive analysis only (no minimal/compact modes) +4. **No IP grouping** - Per-incident analysis only, not per-IP +5. **Description fallback** - Unknown patterns use description prefix (not ideal but graceful) + +## Future Enhancements + +Potential improvements: + +1. **Streaming parser** - For very large files +2. **Multiple output formats** - Add compact, minimal, pattern modes +3. **Filtering options** - By severity, timewindow, IP range +4. **Statistical analysis** - Incident trends, severity distribution +5. **Export formats** - JSON, CSV, HTML reports +6. **IP-based grouping** - Optional IP-centric analysis mode +7. **Custom grouping** - User-defined grouping criteria + +## Troubleshooting + +### Common Issues + +**"File not found"** +```bash +# Check path is correct +ls -l alerts.json + +# Use absolute path +python3 alert_dag_parser.py /full/path/to/alerts.json +``` + +**"No incidents found"** +```bash +# Check file format +head -1 alerts.json | python3 -m json.tool + +# Verify Status field +grep -o '"Status": "[^"]*"' alerts.json | sort | uniq -c +``` + +**"Event XYZ not found for Incident ABC"** +- Event referenced in CorrelID but not in file +- Possible file truncation or corruption +- Analysis continues with warning + +### Debug Mode + +Enable verbose output to see parsing details: +```bash +python3 alert_dag_parser.py alerts.json --verbose 2>&1 | tee debug.log +``` + +## Contributing + +When modifying the tool: + +1. **Maintain field-based classification** - Don't add regex on descriptions +2. **Graceful fallbacks** - Unknown patterns should work, not crash +3. **Test with sample datasets** - Use CTU malware capture data +4. **Update this documentation** - Keep examples current + +### Testing Checklist + +- [ ] Parse all sample datasets without errors +- [ ] Verify incident-event correlation +- [ ] Check output formatting +- [ ] Test all CLI options +- [ ] Handle malformed JSON gracefully +- [ ] Validate with new/unknown alert types + +## License + +Part of the slips-tools repository. See main repository for license information. + +## Related Tools + +- `slips_dag_generator.py` - DAG generator for plain text Slips logs +- `analyze_slips_with_llm.sh` - LLM-enhanced analysis wrapper +- Slips IDS - https://github.com/stratosphereips/StratosphereLinuxIPS + +## References + +- IDEA format specification: https://idea.cesnet.cz/en/index +- Slips documentation: https://stratospherelinuxips.readthedocs.io/ +- CTU malware captures: https://www.stratosphereips.org/datasets-overview diff --git a/docs/immune/README_dataset_risk_workflow.md b/docs/immune/README_dataset_risk_workflow.md new file mode 100644 index 000000000..fcc802e99 --- /dev/null +++ b/docs/immune/README_dataset_risk_workflow.md @@ -0,0 +1,235 @@ +# Cause & Risk Analysis Workflow + +## Overview + +The **Cause & Risk Analysis workflow** generates structured security analysis for Slips incidents using LLM-powered assessment. For each incident, it produces: + +1. **Cause Analysis** - Categorized possible reasons (Malicious Activity / Legitimate Activity / Misconfigurations) +2. **Risk Assessment** - Structured evaluation (Risk Level / Business Impact / Investigation Priority) + +This workflow is complementary to the [Summarization workflow](README_dataset_summary_workflow.md), which focuses on event summarization and behavior analysis. + +--- + +## Workflow Comparison + +| Aspect | Summarization Workflow | Risk Analysis Workflow | +|--------|------------------------|------------------------| +| **Script** | `generate_llm_analysis.sh` | `generate_cause_risk_analysis.sh` | +| **Correlation** | `correlate_incidents.py` | `correlate_risks.py` | +| **Output Fields** | `summary` + `behavior_analysis` | `cause_analysis` + `risk_assessment` | +| **LLM Calls** | 2 per incident (summary + behavior) | 2 per incident (cause + risk) | +| **Steps 1-2** | Identical (sampling + DAG) | Identical (sampling + DAG) | +| **Use Case** | Event summarization, behavior patterns | Root cause analysis, risk prioritization | + +--- + +## Prerequisites + +- Python 3.6+ with dependencies: `openai`, `python-dotenv` +- `OPENAI_API_KEY` environment variable set +- Access to OpenAI-compatible API (OpenAI, Ollama, etc.) + +For initial setup and shared steps, see [README_dataset_summary_workflow.md](README_dataset_summary_workflow.md). + +--- + +## Workflow Steps + +### Steps 1-2: Sampling & DAG Generation (Shared) + +These steps are **identical** to the Summarization workflow. See: +- [Step 1: Sample Incidents](README_dataset_summary_workflow.md#321-step-1-sample-representative-incidents) +- [Step 2: Generate DAG Analysis](README_dataset_summary_workflow.md#322-step-2-generate-dag-structural-analysis) + +**Quick commands:** +```bash +# Step 1: Sample 100 incidents +./sample_dataset.sh 100 my_dataset --seed 42 + +# Step 2: Generate DAG analysis +./generate_dag_analysis.sh datasets/my_dataset.jsonl +``` + +--- + +### Step 3: Generate Cause & Risk Analysis (Multiple Models) + +Use `generate_cause_risk_analysis.sh` to generate both cause and risk assessments for each incident. + +**Basic usage:** +```bash +./generate_cause_risk_analysis.sh datasets/my_dataset.jsonl \ + --model gpt-4o-mini \ + --group-events +``` + +**Multi-model analysis (recommended):** +```bash +# GPT-4o-mini (fast, cost-effective) +./generate_cause_risk_analysis.sh datasets/my_dataset.jsonl \ + --model gpt-4o-mini \ + --group-events + +# GPT-4o (higher quality) +./generate_cause_risk_analysis.sh datasets/my_dataset.jsonl \ + --model gpt-4o \ + --group-events + +# Qwen 2.5 3B via Ollama (local, free) +./generate_cause_risk_analysis.sh datasets/my_dataset.jsonl \ + --model qwen2.5:3b \ + --base-url http://10.147.20.102:11434/v1 \ + --group-events + +# Qwen 2.5 1.5B via Ollama (faster local alternative) +./generate_cause_risk_analysis.sh datasets/my_dataset.jsonl \ + --model qwen2.5:1.5b \ + --base-url http://10.147.20.102:11434/v1 \ + --group-events +``` + +**Output files:** +- `datasets/my_dataset.cause_risk.gpt-4o-mini.json` +- `datasets/my_dataset.cause_risk.gpt-4o.json` +- `datasets/my_dataset.cause_risk.qwen2_5.json` + +**Key options:** +- `--group-events`: Groups similar events to reduce token usage (recommended for large incidents) +- `--verbose`: Show detailed progress and token counts +- `--incident-id `: Analyze specific incident only + +--- + +### Step 4: Correlate All Analyses + +Use `correlate_risks.py` to merge DAG, LLM, and Cause & Risk analyses into a unified dataset. + +```bash +python3 correlate_risks.py \ + datasets/my_dataset.*.json \ + --jsonl datasets/my_dataset.jsonl \ + -o datasets/final_dataset_risk.json +``` + +This creates a consolidated JSON file with all analyses merged by incident ID. + +--- + +## Complete Pipeline Example + +```bash +#!/bin/bash +# Full Cause & Risk Analysis Pipeline + +DATASET_NAME="my_dataset_risk" +NUM_SAMPLES=100 + +# Step 1: Sample incidents +./sample_dataset.sh $NUM_SAMPLES $DATASET_NAME --seed 42 + +# Step 2: Generate DAG analysis +./generate_dag_analysis.sh datasets/${DATASET_NAME}.jsonl + +# Step 3: Generate Cause & Risk analysis (multiple models) +./generate_cause_risk_analysis.sh datasets/${DATASET_NAME}.jsonl \ + --model gpt-4o-mini --group-events + +./generate_cause_risk_analysis.sh datasets/${DATASET_NAME}.jsonl \ + --model gpt-4o --group-events + +./generate_cause_risk_analysis.sh datasets/${DATASET_NAME}.jsonl \ + --model qwen2.5:3b \ + --base-url http://10.147.20.102:11434/v1 \ + --group-events + +# Step 4: Correlate all analyses +python3 correlate_risks.py \ + datasets/${DATASET_NAME}.*.json \ + --jsonl datasets/${DATASET_NAME}.jsonl \ + -o datasets/final_${DATASET_NAME}.json + +echo "Pipeline complete! Output: datasets/final_${DATASET_NAME}.json" +``` + +--- + +## Output Dataset Structure + +The final dataset contains merged analyses with the following structure: + +```json +{ + "total_incidents": 100, + "incidents": [ + { + "incident_id": "abc123-def456-...", + "category": "Malware", + "source_ip": "10.0.2.15", + "timewindow": "12", + "timeline": "2024-04-05 16:53:07 to 16:53:50", + "threat_level": 15.36, + "event_count": 4604, + "dag_analysis": "...", + "cause_risk_gpt4o_mini": { + "cause_analysis": "**Possible Causes:**\n\n**1. Malicious Activity:**\n• Reconnaissance scanning...\n\n**2. Legitimate Activity:**\n• Network monitoring...\n\n**3. Misconfigurations:**\n• Firewall misconfiguration...\n\n**Conclusion:** Most likely malicious reconnaissance...", + "risk_assessment": "**Risk Level:** High\n\n**Justification:** Active port scanning indicates potential attack preparation...\n\n**Business Impact:** Could lead to service disruption or data breach...\n\n**Likelihood of Malicious Activity:** High - Systematic scanning pattern...\n\n**Investigation Priority:** High - Investigate source and block if confirmed malicious" + }, + "cause_risk_gpt4o": { ... }, + "cause_risk_qwen2_5": { ... } + } + ] +} +``` + +**Field descriptions:** +- `cause_analysis`: Structured analysis with 3 categories (Malicious/Legitimate/Misconfigurations) + Conclusion +- `risk_assessment`: 5-field assessment (Risk Level, Justification, Business Impact, Likelihood, Investigation Priority) + +--- + +## Evaluation Workflow + +After generating the dataset, evaluate LLM performance using LLM-as-judge: + +```bash +# Evaluate risk assessments +python3 evaluate_risk.py datasets/final_dataset_risk.json \ + --judge-model gpt-4o \ + -o risk_evaluation_results.json +``` + +For detailed evaluation instructions, see [LLM_EVALUATION_GUIDE.md](LLM_EVALUATION_GUIDE.md#risk-analysis-evaluation). + +--- + +## Performance Considerations + +**Token Optimization:** +- Use `--group-events` to reduce token usage by 96-99% for large incidents +- Without grouping: 4604 events → ~200K tokens +- With grouping: 4604 events → ~5K tokens + +**Model Selection:** +- **GPT-4o-mini**: Best balance of cost/quality for production +- **GPT-4o**: Highest quality, ~10x cost of mini +- **Qwen 2.5**: Free local alternative via Ollama + +**Parallel Processing:** +Run multiple models concurrently to reduce total pipeline time: +```bash +# Run in parallel (requires multiple terminals or background jobs) +./generate_cause_risk_analysis.sh datasets/my_dataset.jsonl --model gpt-4o-mini --group-events & +./generate_cause_risk_analysis.sh datasets/my_dataset.jsonl --model qwen2.5:3b --base-url http://localhost:11434/v1 --group-events & +wait +``` + +--- + +## Next Steps + +- **Evaluation**: [LLM_EVALUATION_GUIDE.md](LLM_EVALUATION_GUIDE.md) - Evaluate analysis quality +- **Summarization Workflow**: [README_dataset_summary_workflow.md](README_dataset_summary_workflow.md) - Alternative workflow +- **Comparison**: [WORKFLOWS_OVERVIEW.md](WORKFLOWS_OVERVIEW.md) - Choose the right workflow + +For questions or issues, see the main [README.md](../README.md). diff --git a/docs/immune/README_dataset_summary_workflow.md b/docs/immune/README_dataset_summary_workflow.md new file mode 100644 index 000000000..0c258268b --- /dev/null +++ b/docs/immune/README_dataset_summary_workflow.md @@ -0,0 +1,315 @@ +# Dataset Generation Pipeline for Slips Alert Analysis + +> **Note**: This guide covers the **Summarization workflow** (summary + behavior analysis). +> For **Cause & Risk analysis** workflow, see [README_dataset_risk_workflow.md](README_dataset_risk_workflow.md). +> For a comparison of both workflows, see [WORKFLOWS_OVERVIEW.md](WORKFLOWS_OVERVIEW.md). + +## 1. Overview + +This pipeline transforms raw Slips security logs into structured multi-model analysis datasets. The workflow consists of four stages: (1) sampling incidents from raw logs into JSONL format, (2) generating DAG-based structural analysis, (3) producing LLM-enhanced summaries with behavior analysis from multiple models, and (4) correlating all analyses into a unified JSON dataset. The output provides comprehensive incident analysis from different analytical perspectives, enabling comparative evaluation of model performance on security analysis tasks. + +## 2. Pipeline Components + +### 2.1 Python Scripts + +**`sample_dataset.py`** +Samples INCIDENT alerts and their associated EVENT alerts from Slips `alerts.json` files. Preserves the complete event context for each incident by following CorrelID references. Supports filtering by category (normal/malware), severity (low/medium/high), and reproducible sampling via random seeds. Outputs JSONL format compatible with downstream analysis tools. + +**`alert_dag_parser.py`** +Parses JSONL incident files and generates Directed Acyclic Graph (DAG) analysis showing the chronological structure of security events. Extracts incident metadata (source IPs, timewindows, threat levels, timelines) and produces comprehensive event summaries. Outputs structured JSON with incident-level analysis. + +**`alert_dag_parser_llm.py`** +Generates LLM-enhanced analysis by querying language models with structured incident data. Implements two key optimizations: (1) event grouping by pattern normalization (replaces IPs, ports, numbers with placeholders to identify identical patterns), reducing token counts by 96-99% for large incidents, and (2) dual-prompt analysis generating both severity-assessed summaries and structured behavior explanations. Supports multiple LLM backends via OpenAI-compatible APIs. Outputs JSON with both `summary` and `behavior_analysis` fields. + +**`correlate_incidents.py`** +Merges multiple JSON analysis files by matching `incident_id` fields. Combines DAG analysis with multiple LLM analyses (from different models) into a single unified dataset. Automatically detects analysis types from filenames (e.g., `.dag.json`, `.llm.gpt-4o-mini.json`, `.llm.qwen2.5.json`) and creates appropriately named fields in the output. Produces consolidated JSON suitable for model comparison and evaluation. + +**`merge_datasets.py`** +Merges multiple correlated dataset JSON files into a single unified dataset. Removes duplicates based on `incident_id` while preserving all analysis fields from each incident. Useful for extending existing datasets by combining separately generated correlated datasets. Supports multiple input files, automatic deduplication, and optional compact output format. + +### 2.2 Shell Wrappers + +**`sample_dataset.sh`** +Wrapper for `sample_dataset.py` providing simplified command-line interface. Handles argument parsing, validation, and automatic file naming (appends `.jsonl` extension). Supports filtering options, random seed configuration, and optional statistics generation. + +**`generate_dag_analysis.sh`** +Wrapper for `alert_dag_parser.py` with automatic output filename generation based on input JSONL file. Converts `input.jsonl` to `input.dag.json` by default. Provides colored status logging and error handling. + +**`generate_llm_analysis.sh`** +Wrapper for `alert_dag_parser_llm.py` supporting multiple model configurations. Auto-generates output filenames incorporating model names (e.g., `input.llm.gpt-4o-mini.json`, `input.llm.qwen2.5.json`). Handles model endpoint configuration for both cloud APIs (OpenAI) and local servers (Ollama). Passes through optimization flags for event grouping and behavior analysis. + +## 3. Dataset Generation Workflow + +### 3.1 Prerequisites + +**Input Requirements:** +- Raw Slips logs: `alerts.json` files from Slips network security analysis +- Directory structure: `sample_logs/alya_datasets/{Normal,Malware}/...` + +**Model Configuration:** +- **GPT-4o-mini**: OpenAI API key in environment variable `OPENAI_API_KEY` +- **Qwen2.5:3b**: Ollama server running at `http://10.147.20.102:11434/v1` (adjust as needed) +- **Qwen2.5:1.5b**: Ollama server with model installed + +**Software Dependencies:** +- Python 3.6+ with standard library only (no external packages required) +- `bash`, `jq` for shell scripts +- OpenAI Python package for LLM analysis: `pip install openai` + +### 3.2 Step-by-Step Process + +**Step 1: Sample Incidents from Raw Logs** + +Generate a JSONL file containing sampled incidents with all associated events: + +```bash +./sample_dataset.sh 20 my_dataset --category malware --seed 42 --include-stats +``` + +This creates: +- `my_dataset.jsonl` - Sampled incidents and events in JSONL format +- `my_dataset.stats.json` - Statistics about the sample (optional) + +**Step 2: Generate DAG Analysis** + +Parse the JSONL file and generate structural DAG analysis: + +```bash +./generate_dag_analysis.sh my_dataset.jsonl +``` + +Output: `my_dataset.dag.json` - JSON array of incidents with DAG-based analysis + +**Step 3: Generate LLM Analysis (GPT-4o-mini)** + +Query GPT-4o-mini for enhanced analysis with event grouping and behavior analysis: + +```bash +./generate_llm_analysis.sh my_dataset.jsonl \ + --model gpt-4o-mini \ + --base-url https://api.openai.com/v1 \ + --group-events \ + --behavior-analysis +``` + +Output: `my_dataset.llm.gpt-4o-mini.json` - JSON array with `summary` and `behavior_analysis` fields + +**Step 4: Generate LLM Analysis (Qwen2.5:3b)** + +Query Qwen2.5:3b model via Ollama with same optimization flags: + +```bash +./generate_llm_analysis.sh my_dataset.jsonl \ + --model qwen2.5:3b \ + --base-url http://10.147.20.102:11434/v1 \ + --group-events \ + --behavior-analysis +``` + +Output: `my_dataset.llm.qwen2.5.json` - JSON array with model-specific analysis + +**Step 5: Generate LLM Analysis (Qwen2.5:1.5b)** + +Query Qwen2.5:1.5b model for comparison with smaller model: + +```bash +./generate_llm_analysis.sh my_dataset.jsonl \ + --model qwen2.5:1.5b \ + --base-url http://10.147.20.102:11434/v1 \ + --group-events \ + --behavior-analysis +``` + +Output: `my_dataset.llm.qwen2.5.1.5b.json` - JSON array from smaller model + +**Step 6: Correlate All Analyses** + +Merge all analysis files into a unified dataset by incident_id, including category information from the original JSONL: + +```bash +python3 correlate_incidents.py my_dataset.*.json --jsonl my_dataset.jsonl -o final_dataset.json +``` + +Output: `final_dataset.json` - Consolidated dataset with all analyses per incident + +**Note:** The `--jsonl` parameter is used to extract the category field (Malware/Normal) from the original sampled data, ensuring proper ground truth labeling in the final dataset. + +### 3.3 Complete Workflow Example + +```bash +# Full pipeline execution +./sample_dataset.sh 20 my_dataset --category malware --seed 42 +./generate_dag_analysis.sh my_dataset.jsonl +./generate_llm_analysis.sh my_dataset.jsonl --model gpt-4o-mini --group-events --behavior-analysis +./generate_llm_analysis.sh my_dataset.jsonl --model qwen2.5:3b --base-url http://10.147.20.102:11434/v1 --group-events --behavior-analysis +./generate_llm_analysis.sh my_dataset.jsonl --model qwen2.5:1.5b --base-url http://10.147.20.102:11434/v1 --group-events --behavior-analysis +python3 correlate_incidents.py my_dataset.*.json --jsonl my_dataset.jsonl -o final_dataset.json +``` + +Files generated: +- `my_dataset.jsonl` - Sampled incidents (JSONL) +- `my_dataset.dag.json` - DAG analysis +- `my_dataset.llm.gpt-4o-mini.json` - GPT-4o-mini analysis +- `my_dataset.llm.qwen2.5.json` - Qwen2.5:3b analysis +- `my_dataset.llm.qwen2.5.1.5b.json` - Qwen2.5:1.5b analysis +- `final_dataset.json` - Unified correlated dataset + +### 3.4 Extending Existing Datasets + +To add more incidents to an existing correlated dataset without regenerating from scratch: + +**Step 1: Sample Additional Incidents** + +Use a different random seed to ensure new samples don't duplicate existing ones: + +```bash +./sample_dataset.sh 20 extension --category malware --seed 99 +``` + +**Step 2: Generate All Analyses for Extension** + +Run the full analysis pipeline on the new samples: + +```bash +./generate_dag_analysis.sh extension.jsonl +./generate_llm_analysis.sh extension.jsonl --model gpt-4o-mini --group-events --behavior-analysis +./generate_llm_analysis.sh extension.jsonl --model qwen2.5:3b --base-url http://10.147.20.102:11434/v1 --group-events --behavior-analysis +./generate_llm_analysis.sh extension.jsonl --model qwen2.5:1.5b --base-url http://10.147.20.102:11434/v1 --group-events --behavior-analysis +``` + +**Step 3: Correlate Extension Data** + +```bash +python3 correlate_incidents.py extension.*.json --jsonl extension.jsonl -o extension_dataset.json +``` + +**Step 4: Merge with Existing Dataset** + +Combine the original and extension datasets, automatically removing any duplicates: + +```bash +python3 merge_datasets.py final_dataset.json extension_dataset.json -o final_dataset_v2.json +``` + +**Alternative: Merge Multiple Extensions** + +If you have multiple extension datasets: + +```bash +python3 merge_datasets.py final_dataset.json extension1_dataset.json extension2_dataset.json -o combined_dataset.json +``` + +**Note on Deduplication:** The `merge_datasets.py` script automatically detects and removes duplicate incidents based on `incident_id`. If the same incident appears in multiple input files, only the first occurrence is kept. + +**Verification:** After merging, verify the operation completed successfully: + +```bash +python3 verify_merge.py --verbose +``` + +This validates file integrity, count accuracy, deduplication correctness, completeness, and data integrity. Use `--inputs` and `--output` flags to verify custom merge operations. + +## 4. Output Dataset Structure + +The final correlated dataset is a JSON array where each object represents one incident with all analyses: + +```json +[ + { + "incident_id": "bd47e95b-a211-41b1-9644-40d6a2e77a07", + "category": "Malware", + "source_ip": "10.0.2.15", + "timewindow": "12", + "timeline": "2024-04-05 16:53:07 to 16:53:50", + "threat_level": 15.36, + "event_count": 4604, + "dag_analysis": "Comprehensive analysis:\n- Source IP: 10.0.2.15\n- Timewindow: 12...", + "llm_gpt4o_mini_analysis": { + "summary": "Incident bd47e95b-a211-41b1-9644-40d6a2e77a07 involves...", + "behavior_analysis": "**Source:** 10.0.2.15\n**Activity:** Port scanning...\n**Detected Flows:**\n• 10.0.2.15 → 185.29.135.234:443/TCP (HTTPS)\n..." + }, + "llm_qwen2_5_3b_analysis": { + "summary": "This incident represents a sophisticated attack...", + "behavior_analysis": "**Source:** 10.0.2.15\n**Activity:** Multi-stage attack...\n..." + }, + "llm_qwen2_5_1_5b_analysis": { + "summary": "The incident shows malicious behavior with...", + "behavior_analysis": "**Source:** 10.0.2.15\n**Activity:** Network reconnaissance...\n..." + } + } +] +``` + +**Key Fields:** +- `incident_id`: UUID identifying the unique security incident +- `category`: Classification of the capture origin ("Malware" or "Normal") +- `source_ip`: Primary source IP address for the incident +- `timewindow`: Slips timewindow number for temporal context +- `timeline`: Human-readable time range (start to end) +- `threat_level`: Accumulated threat score from Slips +- `event_count`: Number of security events in this incident +- `dag_analysis`: Structural DAG-based analysis (string) +- `llm__analysis`: Object with `summary` and `behavior_analysis` strings + +**Analysis Field Contents:** + +*DAG Analysis:* Chronological event summary with threat levels, detection types, and temporal patterns. + +*LLM Summary:* Severity-assessed event descriptions prioritizing high-confidence and high-threat-level evidence. Groups similar events by pattern to reduce verbosity. + +*LLM Behavior Analysis:* Structured technical explanation formatted as: +``` +**Source:** +**Activity:** +**Detected Flows:** +• (service) +• [additional flows] + +**Summary:** [1-2 sentence technical summary] +``` + +## 5. Performance Considerations + +### Event Grouping (--group-events) + +**Purpose:** Reduce token count for large incidents to enable processing on low-specification devices. + +**Mechanism:** Normalizes event descriptions by replacing variable components (IP addresses → ``, ports → ``, numbers → ``) to identify identical patterns. Groups events with matching normalized patterns while preserving threat level and timing information. + +**Impact:** +- Small incident (103 events): 3,522 tokens → 976 tokens (72% reduction) +- Large incident (4,604 events): ~50,000 tokens → 1,897 tokens (96% reduction) + +**Trade-off:** Slight reduction in granularity (individual IPs/ports shown as samples) for massive token savings. Recommended for all production use. + +### Behavior Analysis (--behavior-analysis) + +**Purpose:** Generate structured technical explanations of network behavior alongside severity-assessed summaries. + +**Mechanism:** Issues two separate LLM queries per incident: +1. Summary prompt: Assesses severity and filters high-priority evidence +2. Behavior prompt: Produces structured flow analysis and technical summary + +**Impact:** +- Adds ~1,500 tokens per incident (behavior prompt) +- Doubles API calls and processing time per incident +- Provides richer analytical context for security analysts + +**Trade-off:** Enhanced analysis quality and readability at cost of increased processing time and API usage. Recommended for datasets under 100 incidents or when quality is prioritized over speed. + +### Combined Usage + +Using both flags together (`--group-events --behavior-analysis`) achieves optimal balance: +- Event grouping minimizes prompt size (token reduction) +- Behavior analysis maximizes output quality (richer insights) +- Large incidents become processable while maintaining analytical depth + +**Example token counts with both flags:** +- 4,604 events: 1,897 tokens (summary) + 1,527 tokens (behavior) = 3,424 total tokens +- Processing time: ~10-15 seconds per incident on low-spec devices (Ollama on Raspberry Pi) + +--- + +**Pipeline Maintained By:** Security Analysis Team +**Last Updated:** 2025-10-13 +**Version:** 2.0 (JSON-based workflow with event grouping and behavior analysis) diff --git a/docs/immune/WORKFLOWS_OVERVIEW.md b/docs/immune/WORKFLOWS_OVERVIEW.md new file mode 100644 index 000000000..67532bc69 --- /dev/null +++ b/docs/immune/WORKFLOWS_OVERVIEW.md @@ -0,0 +1,135 @@ +# Dataset Generation Workflows: Quick Comparison + +## Overview + +The alert_summary toolkit provides **two complementary workflows** for generating LLM-enhanced security datasets from Slips alerts: + +1. **Summarization Workflow** - Event summarization and behavior pattern analysis +2. **Cause & Risk Analysis Workflow** - Root cause analysis and risk assessment + +Both workflows share the same initial steps (sampling + DAG generation) but produce different analytical outputs. + +--- + +## Which Workflow Should I Use? + +| If you need... | Use This Workflow | +|----------------|-------------------| +| **Event summaries** in plain language | Summarization | +| **Behavior pattern** analysis (network flows, activities) | Summarization | +| **Root cause** analysis (malicious/legitimate/misconfigured) | Cause & Risk | +| **Risk assessment** with business impact and priority | Cause & Risk | +| **Both types of analysis** | Run both workflows! They're compatible. | + +--- + +## Workflow Comparison + +### Summarization Workflow + +**Purpose:** Transform technical security events into human-readable summaries and structured behavior analyses. + +**Output:** `summary` (event description) + `behavior_analysis` (network flows, activity patterns) + +**Use Cases:** +- Security analyst briefings +- Incident timeline reconstruction +- Behavior pattern recognition +- Training data for summarization models + +**Documentation:** [README_dataset_summary_workflow.md](README_dataset_summary_workflow.md) + +**Quick Start:** +```bash +./sample_dataset.sh 100 my_dataset --seed 42 +./generate_dag_analysis.sh datasets/my_dataset.jsonl +./generate_llm_analysis.sh datasets/my_dataset.jsonl --model gpt-4o-mini --group-events --behavior-analysis +python3 correlate_incidents.py datasets/my_dataset.*.json --jsonl datasets/my_dataset.jsonl -o final_dataset.json +``` + +--- + +### Cause & Risk Analysis Workflow + +**Purpose:** Analyze root causes and assess security risks with structured business impact evaluation. + +**Output:** `cause_analysis` (3-category causes) + `risk_assessment` (risk level, business impact, priority) + +**Use Cases:** +- Incident response prioritization +- Root cause investigation +- Risk-based decision making +- Training data for risk assessment models + +**Documentation:** [README_dataset_risk_workflow.md](README_dataset_risk_workflow.md) + +**Quick Start:** +```bash +./sample_dataset.sh 100 my_dataset --seed 42 +./generate_dag_analysis.sh datasets/my_dataset.jsonl +./generate_cause_risk_analysis.sh datasets/my_dataset.jsonl --model gpt-4o-mini --group-events +python3 correlate_risks.py datasets/my_dataset.*.json --jsonl datasets/my_dataset.jsonl -o final_dataset_risk.json +``` + +--- + +## Running Both Workflows + +Both workflows can be run on the same dataset to produce comprehensive multi-perspective analysis: + +```bash +# Step 1-2: Shared (run once) +./sample_dataset.sh 100 my_dataset --seed 42 +./generate_dag_analysis.sh datasets/my_dataset.jsonl + +# Step 3a: Summarization analysis +./generate_llm_analysis.sh datasets/my_dataset.jsonl \ + --model gpt-4o-mini --group-events --behavior-analysis + +# Step 3b: Cause & Risk analysis +./generate_cause_risk_analysis.sh datasets/my_dataset.jsonl \ + --model gpt-4o-mini --group-events + +# Step 4: Correlate ALL analyses (use correlate_risks.py for broader support) +python3 correlate_risks.py datasets/my_dataset.*.json \ + --jsonl datasets/my_dataset.jsonl \ + -o final_dataset_complete.json +``` + +**Result:** Dataset with DAG + Summarization + Cause & Risk analyses per incident. + +--- + +## Key Differences + +| Aspect | Summarization | Cause & Risk | +|--------|---------------|--------------| +| **Generation Script** | `generate_llm_analysis.sh` | `generate_cause_risk_analysis.sh` | +| **Correlation Script** | `correlate_incidents.py` | `correlate_risks.py` | +| **Prompt Focus** | Event clarity and behavior patterns | Root causes and risk evaluation | +| **Output Structure** | `summary` + `behavior_analysis` | `cause_analysis` + `risk_assessment` | +| **Evaluation Method** | Summarization quality metrics | Risk assessment accuracy | +| **Best For** | Understanding WHAT happened | Understanding WHY and RISK level | + +--- + +## Shared Components + +Both workflows use the same: +- ✅ Sampling methodology (`sample_dataset.sh`) +- ✅ DAG structural analysis (`generate_dag_analysis.sh`) +- ✅ Event grouping optimization (`--group-events`) +- ✅ Multi-model support (GPT-4o, Qwen, etc.) +- ✅ JSONL/IDEA format parsing +- ✅ Category labeling system + +--- + +## Next Steps + +Choose your workflow: +- **Summarization**: [README_dataset_summary_workflow.md](README_dataset_summary_workflow.md) +- **Cause & Risk**: [README_dataset_risk_workflow.md](README_dataset_risk_workflow.md) +- **Evaluation**: [LLM_EVALUATION_GUIDE.md](LLM_EVALUATION_GUIDE.md) + +Or run both for comprehensive analysis! diff --git a/docs/immune/risk_summary.md b/docs/immune/risk_summary.md new file mode 100644 index 000000000..57164f794 --- /dev/null +++ b/docs/immune/risk_summary.md @@ -0,0 +1,84 @@ +# LLM Evaluation Summary Report + +**Judge:** GPT-4o +**Total Evaluations:** 46 + +## Overall Performance Rankings + +| Rank | Model | Avg Position | Avg Score | Win Rate | Wins | +|------|-------|--------------|-----------|----------|------| +| 1 | GPT-4o | 1.70 | 7.83/10 | 56.5% | 26 | +| 2 | GPT-4o-mini | 2.09 | 7.37/10 | 19.6% | 9 | +| 3 | Qwen2.5 | 3.09 | 5.67/10 | 19.6% | 9 | +| 4 | Qwen2.5 3B | 3.13 | 5.85/10 | 4.3% | 2 | + +## Position Distribution + +| Model | 1st | 2nd | 3rd | 4th | +|-------|-----|-----|-----|-----| +| GPT-4o | 26 | 11 | 6 | 3 | +| GPT-4o-mini | 9 | 25 | 11 | 1 | +| Qwen2.5 | 9 | 1 | 13 | 23 | +| Qwen2.5 3B | 2 | 9 | 16 | 19 | + +## Performance by Incident Category + +### Malware Incidents + +| Model | Count | Avg Position | Avg Score | Wins | +|-------|-------|--------------|-----------|------| +| GPT-4o | 28 | 1.50 | 8.43/10 | 19 | +| GPT-4o-mini | 28 | 1.96 | 7.96/10 | 7 | +| Qwen2.5 3B | 28 | 2.96 | 6.61/10 | 1 | +| Qwen2.5 | 28 | 3.57 | 5.46/10 | 1 | + +### Normal Incidents + +| Model | Count | Avg Position | Avg Score | Wins | +|-------|-------|--------------|-----------|------| +| GPT-4o | 18 | 2.00 | 6.89/10 | 7 | +| GPT-4o-mini | 18 | 2.28 | 6.44/10 | 2 | +| Qwen2.5 | 18 | 2.33 | 6.00/10 | 8 | +| Qwen2.5 3B | 18 | 3.39 | 4.67/10 | 1 | + +## Performance by Incident Complexity + +### Simple Incidents + +| Model | Count | Avg Position | Avg Score | Wins | +|-------|-------|--------------|-----------|------| +| GPT-4o | 22 | 2.14 | 7.18/10 | 9 | +| GPT-4o-mini | 22 | 2.36 | 6.91/10 | 3 | +| Qwen2.5 | 22 | 2.64 | 6.27/10 | 9 | +| Qwen2.5 3B | 22 | 2.86 | 6.18/10 | 1 | + +### Medium Incidents + +| Model | Count | Avg Position | Avg Score | Wins | +|-------|-------|--------------|-----------|------| +| GPT-4o | 15 | 1.33 | 8.20/10 | 10 | +| GPT-4o-mini | 15 | 1.87 | 7.53/10 | 4 | +| Qwen2.5 3B | 15 | 3.33 | 5.40/10 | 1 | +| Qwen2.5 | 15 | 3.47 | 5.00/10 | 0 | + +### Complex Incidents + +| Model | Count | Avg Position | Avg Score | Wins | +|-------|-------|--------------|-----------|------| +| GPT-4o | 9 | 1.22 | 8.78/10 | 7 | +| GPT-4o-mini | 9 | 1.78 | 8.22/10 | 2 | +| Qwen2.5 3B | 9 | 3.44 | 5.78/10 | 0 | +| Qwen2.5 | 9 | 3.56 | 5.33/10 | 0 | + +## Key Insights + +**Best Overall Model:** GPT-4o +- Average Position: 1.70 +- Average Score: 7.83/10 +- Win Rate: 56.5% + +**Best for Malware Incidents:** GPT-4o +- Avg Position: 1.50 + +**Best for Normal Incidents:** GPT-4o +- Avg Position: 2.00 \ No newline at end of file diff --git a/docs/immune/summary_report.md b/docs/immune/summary_report.md new file mode 100644 index 000000000..8a1a0ec3c --- /dev/null +++ b/docs/immune/summary_report.md @@ -0,0 +1,84 @@ +# LLM Evaluation Summary Report + +**Judge:** GPT-4o +**Total Evaluations:** 47 + +## Overall Performance Rankings + +| Rank | Model | Avg Position | Avg Score | Win Rate | Wins | +|------|-------|--------------|-----------|----------|------| +| 1 | GPT-4o | 1.64 | 7.40/10 | 59.6% | 28 | +| 2 | GPT-4o-mini | 1.70 | 7.64/10 | 31.9% | 15 | +| 3 | Qwen2.5 3b | 2.94 | 5.51/10 | 6.4% | 3 | +| 4 | Qwen2.5 | 3.72 | 3.83/10 | 2.1% | 1 | + +## Position Distribution + +| Model | 1st | 2nd | 3rd | 4th | +|-------|-----|-----|-----|-----| +| GPT-4o | 28 | 12 | 3 | 4 | +| GPT-4o-mini | 15 | 31 | 1 | 0 | +| Qwen2.5 3b | 3 | 4 | 33 | 7 | +| Qwen2.5 | 1 | 0 | 10 | 36 | + +## Performance by Incident Category + +### Malware Incidents + +| Model | Count | Avg Position | Avg Score | Wins | +|-------|-------|--------------|-----------|------| +| GPT-4o | 29 | 1.55 | 7.76/10 | 18 | +| GPT-4o-mini | 29 | 1.76 | 7.76/10 | 8 | +| Qwen2.5 3b | 29 | 2.90 | 5.62/10 | 3 | +| Qwen2.5 | 29 | 3.79 | 3.69/10 | 0 | + +### Normal Incidents + +| Model | Count | Avg Position | Avg Score | Wins | +|-------|-------|--------------|-----------|------| +| GPT-4o-mini | 18 | 1.61 | 7.44/10 | 7 | +| GPT-4o | 18 | 1.78 | 6.83/10 | 10 | +| Qwen2.5 3b | 18 | 3.00 | 5.33/10 | 0 | +| Qwen2.5 | 18 | 3.61 | 4.06/10 | 1 | + +## Performance by Incident Complexity + +### Simple Incidents + +| Model | Count | Avg Position | Avg Score | Wins | +|-------|-------|--------------|-----------|------| +| GPT-4o-mini | 22 | 1.59 | 7.73/10 | 9 | +| GPT-4o | 22 | 1.86 | 6.91/10 | 11 | +| Qwen2.5 3b | 22 | 2.91 | 5.73/10 | 1 | +| Qwen2.5 | 22 | 3.64 | 4.18/10 | 1 | + +### Medium Incidents + +| Model | Count | Avg Position | Avg Score | Wins | +|-------|-------|--------------|-----------|------| +| GPT-4o | 15 | 1.27 | 8.13/10 | 11 | +| GPT-4o-mini | 15 | 1.73 | 7.67/10 | 4 | +| Qwen2.5 3b | 15 | 3.13 | 5.20/10 | 0 | +| Qwen2.5 | 15 | 3.87 | 3.73/10 | 0 | + +### Complex Incidents + +| Model | Count | Avg Position | Avg Score | Wins | +|-------|-------|--------------|-----------|------| +| GPT-4o | 10 | 1.70 | 7.40/10 | 6 | +| GPT-4o-mini | 10 | 1.90 | 7.40/10 | 2 | +| Qwen2.5 3b | 10 | 2.70 | 5.50/10 | 2 | +| Qwen2.5 | 10 | 3.70 | 3.20/10 | 0 | + +## Key Insights + +**Best Overall Model:** GPT-4o +- Average Position: 1.64 +- Average Score: 7.40/10 +- Win Rate: 59.6% + +**Best for Malware Incidents:** GPT-4o +- Avg Position: 1.55 + +**Best for Normal Incidents:** GPT-4o-mini +- Avg Position: 1.61 \ No newline at end of file