Azerbaijan-Cybersecurity-Center · pierringshot · Sep 3, 2025 · Sep 4, 2025 · Sep 4, 2025 · Sep 4, 2025
diff --git a/.gitignore b/.gitignore
@@ -1,41 +1,45 @@
-# Environments
-.venv/
-env/
-.env
-.env.*
-
-# Python
-__pycache__/
-*.pyc
-
-# Data
-data/raw/*
-!data/raw/.gitkeep
-data/processed/*
-!data/processed/.gitkeep
-
 # Cache
+.coverage
+coverage.xml
+# Data
+data/cache/
 data/cache/*
 !data/cache/.gitkeep
-
-# Tool caches / reports
-.pytest_cache/
-.mypy_cache/
-.ruff_cache/
+# Data caches and raw logs
+data/processed/*
+!data/processed/.gitkeep
+data/raw/
+data/raw/*
+!data/raw/.gitkeep
+.debug/*
+docs/explanation_of_project.mp4
+.DS_Store
+*.env
+*.env.*
+.env
+.env.*
+env/
+# Environments
+.env.local
 htmlcov/
-.coverage
-coverage.xml
-
-# Notebooks
 *.ipynb_checkpoints/
-
+# Large binaries
+# Local env
+*.mov
+*.mp4
+.mypy_cache/
+# Notebooks
+notebooks/**/*.ipynb_checkpoints/
+# Notebooks outputs
 # OS
-.DS_Store
-Thumbs.db
-
-.coverage
+*.pyc
+__pycache__/
 .pytest_cache/
-.mypy_cache/
+# Python
 .ruff_cache/
-htmlcov/
-coverage.xml
+*.tar
+*.tar.gz
+Thumbs.db
+# Tool caches / reports
+.venv/
+*.zip
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -0,0 +1,7 @@
+{
+    "workbench.colorCustomizations": {
+        "terminal.background": "#00000000",
+        "minimap.background": "#00000000",
+        "scrollbar.shadow": "#00000000"
+    }
+}
diff --git a/AGENTS.md b/AGENTS.md
@@ -35,3 +35,51 @@
 - Anonymize or truncate sensitive log data before committing.
 - Large files: store raw datasets outside git or via LFS; keep only small, representative fixtures.
 
+---
+
+## Scalable, Budget‑Aware Processing (Project‑Specific)
+
+Principles
+
+- Offline‑first: parse, score, and report without network by default; add CTI/LLM only on the smallest, most informative subset.
+- Aggregate, then sample: enrich clusters (IP/time/window/signature), not individual lines.
+- Cache and dedupe: never ask the network twice for the same thing.
+- Budget‑aware: throttle LLM and CTI to a daily budget and degrade gracefully.
+
+LLM Strategy
+
+- Grouping: enrich per group, not per line. Default `--llm-group-by ip`; for more precision use `signature` (`ip+path+status+ua`). Optional `--group-window` adds a time bucket.
+- Sampling: cap calls with `--llm-sample N` (default 200). Non‑sampled groups are marked `severity=unknown` with a clear rationale.
+- Gate before LLM: `--llm-gate-4xx N`, `--llm-gate-ua` so only interesting groups hit the LLM.
+- Map–reduce summaries: optionally ask the LLM only for the top‑K groups (via sampling/gates) instead of all events.
+- Budget throttle: set `GROQ_TOKENS_BUDGET`; enrichment stops before the cap and continues offline.
+
+CTI Strategy
+
+- Suspicious‑first: `--cti-scope suspicious` (default) and `--cti-max 100–200`.
+- Strong cache: `data/cache/cti_cache.json` stores results; TTL is optional in future.
+- Defer VT/API: query VirusTotal only for final shortlist; continue gracefully if rate‑limited.
+- Batch/Resilience: lookups are capped and cached incrementally; re‑runs reuse cache to resume.
+- Offline lists: set `OFFLINE_IP_BLOCKLIST` to escalate known‑bad IPs without CTI calls.
+
+Pipeline Shape
+
+- Stage 1 (Parse): JSONL output with stable fields; chunk by time window for massive files.
+- Stage 2 (Score): per‑IP stats, 4xx ratios, UA flags; produce candidate groups.
+- Stage 3 (CTI): shortlist only (top K by 4xx/requests/UA), cached.
+- Stage 4 (LLM): grouped + sampled enrichment, budget‑throttled.
+- Stage 5 (Reports): deterministic, reproducible, works even with no LLM/CTI.
+
+Recommended Commands
+
+- Huge logs, minimal requests:
+  - `python -m src.cli data/raw/big.log --out data/processed --llm-group-by ip --llm-sample 200 --cti-scope suspicious --cti-max 200 --color never`
+- Strictly offline (fastest):
+  - `python -m src.cli data/raw/big.log --out data/processed --no-llm --no-cti --no-reports`
+- Budgeted runs:
+  - `export GROQ_TOKENS_BUDGET=150000` then run the first command.
+
+Next Enhancements
+
+- Time‑window grouping (`--group-window`) implemented; consider adaptive windows per IP for very bursty traffic.
+- Add token budget accounting by model/tokenizer if needed; current approach is length‑based and conservative.
diff --git a/README.md b/README.md
@@ -0,0 +1,114 @@
+# LogCTIAI — Offline‑First Log Analysis + CTI (LLM‑Optional)
+
+Bu layihə (AZ): böyük həcmli server/web loglarını emal edir, qruplaşdırılmış LLM şərhləri (istəyə görə) və CTI zənginləşdirməsi ilə təhlükə siqnallarını çıxarır, nəticədə yığcam və təkrarlana bilən hesabatlar yaradır. Şəbəkədən minimal istifadə və büdcə nəzarəti üçün optimallaşdırılıb.
+
+This project ingests large web/server logs, enriches events with optional LLM analysis, performs CTI lookups against external sources, and generates concise human‑readable reports. It is designed to run reliably on very large datasets with minimal network usage:
+
+- Auto‑detects `.txt` vs `.log` inputs; parses recognized log lines in `.txt` files.
+- Minimizes LLM calls via grouping, sampling, and gates; enforces an optional token budget.
+- Minimizes CTI calls via suspicious‑first scoping, caps, batching, and strong caching.
+- Works fully offline and degrades gracefully when network or budgets are unavailable.
+
+See `docs/USAGE.md` for practical commands and tips. See `AGENTS.md` for project conventions and the scalable processing strategy.
+
+![Mindmap](docs/ProjectMindmapv0.5.png)
+
+## Quickstart
+
+- Create env: `python -m venv .venv && source .venv/bin/activate`
+- Install deps: `pip install -r requirements.txt`
+- Run on a log (auto‑detects `.txt` that look like logs):
+  - `python -m src.cli data/raw/access_log.txt --out data/processed --summary --preview 3`
+  - Outputs `data/processed/access_log.jsonl` and `data/processed/reports/` with `.txt` and `.md`.
+
+If LLM keys are not configured, enrichment runs offline with `severity=unknown` placeholders and continues to produce reports.
+
+## CLI Overview
+
+`python -m src.cli <input_path> --out <out_dir> [options]`
+
+Common options:
+
+- `--no-llm`: disable LLM enrichment (default if no keys set).
+- `--no-cti`: skip CTI lookups; run fully offline.
+- `--no-reports`: skip generating text/markdown reports.
+- `--limit N`: process only the first N lines.
+- `--format jsonl|csv`: output for enriched events (default: `jsonl`).
+- `--color auto|always|never`: terminal color policy.
+- `--ai-malicious-report`: after CTI summarization, ask the LLM for a detailed malicious-activity report (saved under `reports/`).
+
+LLM request control:
+
+- `--llm-group-by none|ip|signature`: group before LLM calls (default: `ip`); `signature` groups by `ip+path+status+ua`.
+- `--group-window SECONDS`: add a time bucket to grouping (e.g., `60`).
+- `--llm-sample N`: send only N groups to LLM; the rest are annotated as sampled/gated out (default: `200`).
+- `--llm-gate-4xx N`: only send groups with ≥N 4xx responses.
+- `--llm-gate-ua`: only send groups with suspicious user‑agents.
+
+CTI request control:
+
+- `--cti-scope suspicious|all`: lookup only suspicious IPs (default) or all IPs.
+- `--cti-max N`: cap number of IPs to query for CTI (0=unlimited; default: `100`).
+- `--cti-batch-size N`, `--cti-batch-pause S`: batch CTI queries and pause between batches; cache flushes periodically.
+
+Examples (large logs):
+
+- Minimal network usage:
+  - `python -m src.cli data/raw/big.log --out data/processed --llm-group-by ip --group-window 60 --llm-gate-4xx 5 --llm-sample 200 --cti-scope suspicious --cti-max 200`
+- Strictly offline (fastest):
+  - `python -m src.cli data/raw/big.log --out data/processed --no-llm --no-cti --no-reports`
+
+## Environment
+
+Create a `.env` (see variables below). Keys are optional; the tool runs offline without them.
+
+- `GROQ_API_KEYS`: comma‑separated LLM keys for rotation.
+- `GROQ_MODEL`: Groq model name (default `llama3-8b-8192`).
+- `GROQ_TOKENS_BUDGET`: approximate token budget per run/day; enrichment stops before the cap and continues offline.
+- `RISK_4XX_THRESHOLD`: per‑IP 4xx threshold to consider suspicious in reports (default `5`).
+- `SUSPICIOUS_UA_REGEX`: comma‑separated regex patterns to flag suspicious UAs.
+- `VT_API_KEY`: VirusTotal API key (optional; CTI works in a degraded mode without it).
+- `OFFLINE_IP_BLOCKLIST`: path to a newline‑separated list of known‑bad IPs to escalate risk without CTI calls.
+
+## Outputs
+
+- Enriched events: `data/processed/<name>.jsonl` (or `.csv` with `--format csv`).
+- Reports: `data/processed/reports/report.txt` and `report.md` summarizing activity and suspicious IPs; may include a brief AI note if LLM is enabled.
+- Malicious AI report (optional): `data/processed/reports/malicious_ai_report.txt|md` if `--ai-malicious-report` is used and malicious CTI signals are present.
+- CTI cache: `data/cache/cti_cache.json` (auto‑created and reused to minimize network calls).
+
+## Testing
+
+- Run tests: `pytest -q`
+- Optional coverage: `pytest --cov=src -q` (if coverage plugin installed).
+
+Notes:
+- If you used the local venv above, run tests via `.venv/bin/pytest -q`.
+- A PyPDF2 deprecation warning may appear; it’s harmless and can be ignored.
+
+## UI Dashboard
+
+An optional Streamlit dashboard is included for exploration and client-friendly viewing.
+
+- Install UI deps (already part of `requirements.txt`).
+- Run the UI: `scripts/run_ui.sh` (or `streamlit run ui/app.py`).
+- Select an enriched `.jsonl` file from `data/processed/` or upload one.
+- View status distribution, sample enriched events, and CTI attributes.
+
+## Troubleshooting
+
+- `.txt` auto‑detection: the CLI reads a small sample and parses with `parse_line`. If none match, the file is copied as plain text rather than parsed as logs.
+- LLM budget exceeded: you’ll see `LLM budget exhausted` in logs; records are still produced with `severity=unknown` and a rationale explaining sampling/gating.
+- CTI failures: the pipeline continues with cached/partial data; use `--no-cti` for fully offline runs. Consider `--cti-max` and batching to avoid rate limits.
+- No colors or CI: pass `--color never` for consistent, plain output.
+
+## Docs
+
+- Usage guide with more examples: `docs/USAGE.md`
+- Principles, strategy, and repo conventions: `AGENTS.md`
+- Mindmap/diagram: `docs/ProjectMindmapv0.5.png`
+- Project write‑ups: `docs/Final Project - Log Analysis + CTI.pdf`
+
+---
+
+Made with a focus on reliability, scalability, and cost‑awareness.