Azerbaijan-Cybersecurity-Center · pierringshot · Sep 3, 2025 · Sep 4, 2025 · Sep 4, 2025 · Sep 4, 2025
diff --git a/.env.example b/.env.example
@@ -0,0 +1,15 @@
+# Copy to .env and fill in secrets
+VT_API_KEY=
+# Optional AbuseIPDB (https://www.abuseipdb.com/) API key
+ABUSEIPDB_API_KEY=
+# Or multiple keys (comma-separated) to distribute load respectfully
+ABUSEIPDB_API_KEYS=
+# Optional outbound proxies (comma-separated). Examples:
+# PROXY_LIST=http://user:[email protected]:8080,https://5.6.7.8:8443,socks5://9.9.9.9:1080
+PROXY_LIST=
+# GROQ (LLM) optional keys (comma-separated)
+GROQ_API_KEYS=
+# Optional path to a newline-separated list of known-bad IPs (offline escalation)
+OFFLINE_IP_BLOCKLIST=
+# Optional token budget (not required for this tool)
+GROQ_TOKENS_BUDGET=
diff --git a/.gitignore b/.gitignore
@@ -1,41 +1,15 @@
-# Environments
-.venv/
-env/
-.env
-.env.*
-
-# Python
 __pycache__/
-*.pyc
-
-# Data
-data/raw/*
-!data/raw/.gitkeep
-data/processed/*
-!data/processed/.gitkeep
-
-# Cache
-data/cache/*
-!data/cache/.gitkeep
-
-# Tool caches / reports
 .pytest_cache/
 .mypy_cache/
-.ruff_cache/
-htmlcov/
-.coverage
-coverage.xml
-
-# Notebooks
-*.ipynb_checkpoints/
-
-# OS
-.DS_Store
-Thumbs.db
-
+.venv/
+.env
+data/cache/*.json
+data/processed/*
+data/raw/*
+docs/*.mp4
 .coverage
-.pytest_cache/
-.mypy_cache/
-.ruff_cache/
-htmlcov/
-coverage.xml
+.debug/
+*Zone.Identifier
+*.pyc
+*.pyo
+*.DS_Store
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -0,0 +1,7 @@
+{
+    "workbench.colorCustomizations": {
+        "terminal.background": "#00000000",
+        "minimap.background": "#00000000",
+        "scrollbar.shadow": "#00000000"
+    }
+}
diff --git a/AGENTS.md b/AGENTS.md
@@ -1,37 +1,48 @@
 # Repository Guidelines
 
+This AGENTS.md guides both human and agent contributors. Its scope is the entire repository.
+
 ## Project Structure & Module Organization
-- Root: currently contains `access_log.txt` and project PDFs. Move long‑form write‑ups to `docs/` and raw logs to `data/raw/` as the repo evolves.
-- `src/`: Python modules for log parsing, enrichment, and CTI lookups (e.g., `src/parsers/`, `src/enrichers/`).
-- `notebooks/`: exploratory analysis; keep outputs cleared before commit.
-- `tests/`: unit tests mirroring `src/` layout (e.g., `tests/parsers/test_nginx.py`).
-- `docs/`: reports, diagrams, and usage notes.
+- Root: currently `access_log.txt` and project PDFs; move long‑form write‑ups to `docs/` and raw logs to `data/raw/` as the repo evolves.
+- `src/`: Python modules for parsing, scoring, enrichment, and CTI (e.g., `src/parsers/`, `src/enrichers/`).
+- `notebooks/`: exploratory analysis; clear outputs before commit.
+- `tests/`: unit tests mirroring `src/` (e.g., `tests/parsers/test_nginx.py`); fixtures in `tests/fixtures/`.
+- `docs/`: reports, diagrams, usage notes. `data/`: `raw/` inputs; `cache/` (e.g., `data/cache/cti_cache.json`).
 
 ## Build, Test, and Development Commands
-- Create env: `python -m venv .venv && source .venv/bin/activate`.
-- Install deps: `pip install -r requirements.txt` (add one if code is introduced).
-- Run tests: `pytest -q`.
-- Lint/format: `ruff check . && ruff format .` (or `black . && isort .` if preferred).
-- Type check: `mypy src`.
+- Create env: `python -m venv .venv && source .venv/bin/activate`
+- Install deps: `pip install -r requirements.txt`
+- Lint/format: `ruff check . && ruff format .` (or `black . && isort .`)
+- Type check: `mypy src`
+- Run tests: `pytest -q` (coverage: `pytest --cov=src`)
+- Run UI: `streamlit run src/ui/streamlit_app.py`
+ - One‑click: `./run.sh setup`, `./run.sh scan <file>`, or `./run.sh ui`
 
 ## Coding Style & Naming Conventions
-- Python 3.10+; 4‑space indentation; UTF‑8.
+- Python 3.10+, UTF‑8, 4‑space indentation.
 - Names: modules/functions `lower_snake_case`, classes `PascalCase`, constants `UPPER_SNAKE_CASE`.
 - Files: logs `data/raw/YYYYMMDD_source.log`; notebooks `notebooks/<topic>_<yyyymmdd>.ipynb`.
 - Keep functions <50 lines where practical; document public functions with docstrings.
 
 ## Testing Guidelines
-- Framework: `pytest`; minimum 80% coverage measured via `pytest --cov=src`.
-- Layout: mirror `src/` with `test_*.py`; use fixtures for sample logs under `tests/fixtures/`.
-- Determinism: do not read network in tests; mock CTI APIs.
+- Framework: `pytest`; mirror `src/` layout; fixtures under `tests/fixtures/`.
+- Determinism: no network in tests; mock CTI/LLM calls; target ≥80% coverage.
 
 ## Commit & Pull Request Guidelines
 - Commits: Conventional Commits (e.g., `feat(parser): add nginx status extraction`).
-- PRs: concise summary, linked issue, before/after notes, and if UI/data changes, include a small sample input and expected output.
-- Size: prefer ≤300 lines diff; split larger changes.
+- PRs: concise summary, linked issue, before/after notes; if UI/data changes, include small sample input and expected output. Prefer ≤300 lines diff.
 
 ## Security & Data Handling
-- Do not commit secrets or tokens; use `.env` and provide `.env.example`.
-- Anonymize or truncate sensitive log data before committing.
-- Large files: store raw datasets outside git or via LFS; keep only small, representative fixtures.
+- Never commit secrets; use `.env` and provide `.env.example`.
+- Anonymize/truncate sensitive logs; store large datasets outside git or via LFS.
+ - API keys: `VT_API_KEY`, `ABUSEIPDB_API_KEY` (optional). Respect rate limits.
 
+## Scalable, Budget‑Aware Processing (Project‑Specific)
+- Offline‑first; aggregate then sample; cache and dedupe. Defaults: `--llm-group-by ip`, `--llm-sample 200`, `--cti-scope suspicious` with `--cti-max 200`.
+ - Offline‑first; aggregate then sample; cache and dedupe. Defaults: `--llm-group-by ip`, `--llm-sample 200`, `--cti-scope suspicious` with `--cti-max 200` (use `--cti-max -1` to scan all IPs).
+- Budget throttle: `export GROQ_TOKENS_BUDGET=150000`.
+- Examples:
+  - Huge logs: `python -m src.cli data/raw/big.log --out data/processed --llm-group-by ip --llm-sample 200 --cti-scope suspicious --cti-max 200 --color never`
+  - Strictly offline: `python -m src.cli data/raw/big.log --out data/processed --no-llm --no-cti --no-reports`
+  - IP scan to PDF (CLI): `python -m src.cli scan-ips data/sample_ips.txt --out data/processed --cti-max -1`
+  - IP scan (UI): `streamlit run src/ui/streamlit_app.py`
diff --git a/Makefile b/Makefile
@@ -0,0 +1,52 @@
+VENV?=.venv
+PY?=$(VENV)/bin/python
+PIP?=$(VENV)/bin/pip
+
+.PHONY: setup install lint fmt test scan scan-all ui help doctor scan-file
+
+setup:
+	python -m venv $(VENV)
+	. $(VENV)/bin/activate; $(PIP) install -r requirements.txt
+
+install:
+	. $(VENV)/bin/activate; $(PIP) install -r requirements.txt
+
+lint:
+	. $(VENV)/bin/activate; $(PY) -m ruff check . || true
+
+fmt:
+	. $(VENV)/bin/activate; $(PY) -m ruff format . || true
+
+test:
+	. $(VENV)/bin/activate; pytest -q || true
+
+scan:
+	. $(VENV)/bin/activate; $(PY) -m src.cli scan-ips data/sample_ips.txt --out data/processed --no-cti
+
+scan-all:
+	. $(VENV)/bin/activate; $(PY) -m src.cli scan-ips data/sample_ips.txt --out data/processed --cti-max -1
+
+ui:
+	. $(VENV)/bin/activate; streamlit run src/ui/streamlit_app.py
+
+help:
+	@echo "Targets:" && \
+	echo "  make setup       # Create venv and install deps" && \
+	echo "  make scan        # Offline demo scan -> PDF" && \
+	echo "  make scan-all    # Demo scan with CTI (uses VT_API_KEY)" && \
+	echo "  make scan-file FILE=path CTI_MAX=-1 RATE=0.8 BURST=1 SAVE=50 # Full control" && \
+	echo "  make ui          # Launch Streamlit UI" && \
+	echo "  make lint|fmt|test" && \
+	echo "  make doctor      # Quick environment check"
+
+doctor:
+	@python3 -c 'import sys; print("Python:", sys.version.split()[0]); assert sys.version_info[:2] >= (3,10)' || (echo "Python 3.10+ required" && exit 1)
+	@[ -f .env ] || echo "Note: .env not found (optional). Copy .env.example -> .env"
+	@[ -n "$$VT_API_KEY" ] || echo "Note: VT_API_KEY not set; CTI calls will be disabled."
+
+# Example: make scan-file FILE=data/sample_ips.txt CTI_MAX=-1 RATE=0.8 BURST=1 SAVE=25
+scan-file:
+	@[ -n "$(FILE)" ] || (echo "Usage: make scan-file FILE=path [CTI_MAX=-1] [RATE=0.8] [BURST=1] [SAVE=50]" && exit 1)
+	. $(VENV)/bin/activate; \
+		$(PY) -m src.cli scan-ips $(FILE) --out data/processed --cti-max $${CTI_MAX:--1} \
+		--cti-rate $${RATE:-0.8} --cti-burst $${BURST:-1} --save-every $${SAVE:-50}
diff --git a/README.md b/README.md
@@ -0,0 +1,139 @@
+# LogCTIAI — Offline‑First Log Analysis + CTI (LLM‑Optional)
+
+Bu layihə (AZ): böyük həcmli server/web loglarını emal edir, qruplaşdırılmış LLM şərhləri (istəyə görə) və CTI zənginləşdirməsi ilə təhlükə siqnallarını çıxarır, nəticədə yığcam və təkrarlana bilən hesabatlar yaradır. Şəbəkədən minimal istifadə və büdcə nəzarəti üçün optimallaşdırılıb.
+
+This project ingests large web/server logs, enriches events with optional LLM analysis, performs CTI lookups against external sources, and generates concise human‑readable reports. It is designed to run reliably on very large datasets with minimal network usage:
+
+- Auto‑detects `.txt` vs `.log` inputs; parses recognized log lines in `.txt` files.
+- Minimizes LLM calls via grouping, sampling, and gates; enforces an optional token budget.
+- Minimizes CTI calls via suspicious‑first scoping, caps, batching, and strong caching.
+- Works fully offline and degrades gracefully when network or budgets are unavailable.
+
+See `docs/USAGE.md` for practical commands and tips. See `AGENTS.md` for project conventions and the scalable processing strategy.
+
+![Mindmap](docs/ProjectMindmapv0.5.png)
+
+## Quickstart
+
+- Create env: `python -m venv .venv && source .venv/bin/activate`
+- Install deps: `pip install -r requirements.txt`
+- Run on a log (auto‑detects `.txt` that look like logs):
+  - `python -m src.cli data/raw/access_log.txt --out data/processed --summary --preview 3`
+  - Outputs `data/processed/access_log.jsonl` and `data/processed/reports/` with `.txt` and `.md`.
+
+### IP Threat Scanner (CLI & UI)
+
+This repo also includes a fast, offline‑first IP CTI scanner with caching, PDF/JSON/CSV outputs, and a Streamlit UI.
+
+- CLI (offline demo): `python -m src.cli scan-ips data/sample_ips.txt --out data/processed --no-cti`
+- CLI (with CTI): `VT_API_KEYS=vt_key1,vt_key2 ABUSEIPDB_API_KEYS=ab1,ab2 python -m src.cli scan-ips data/sample_ips.txt --out data/processed --cti-max 200 --cti-rate 1 --cti-burst 1 --workers 2`
+- UI: `streamlit run src/ui/streamlit_app.py` (clean UI with optional AI executive summary embedded in the exported PDF)
+
+Environment (see `.env.example`):
+- VirusTotal: `VT_API_KEY` or `VT_API_KEYS` (comma‑separated)
+- AbuseIPDB: `ABUSEIPDB_API_KEY` or `ABUSEIPDB_API_KEYS` (comma‑separated)
+- Optional proxies (resiliency, not for evading quotas): `PROXY_LIST="http://1.2.3.4:8080,socks5://5.6.7.8:1080"`
+- Offline blocklist: `OFFLINE_IP_BLOCKLIST=/path/to/bad_ips.txt`
+
+Notes:
+- The scanner respects provider rate limits and `Retry-After`; it rotates your keys and proxies on 429/403 and caches results.
+- VirusTotal has no API‑less access; provide an API key to query VT.
+
+If LLM keys are not configured, enrichment runs offline with `severity=unknown` placeholders and continues to produce reports.
+
+## CLI Overview
+
+`python -m src.cli <input_path> --out <out_dir> [options]`
+
+Common options:
+
+- `--verbose quiet|normal|max`: control console verbosity (default: `max`).
+- `--no-llm`: disable LLM enrichment (default if no keys set).
+- `--no-cti`: skip CTI lookups; run fully offline.
+- `--no-reports`: skip generating text/markdown reports.
+- `--limit N`: process only the first N lines.
+- `--format jsonl|csv`: output for enriched events (default: `jsonl`).
+- `--color auto|always|never`: terminal color policy.
+- `--ai-malicious-report`: after CTI summarization, ask the LLM for a detailed malicious-activity report (saved under `reports/`).
+
+LLM request control:
+
+- `--llm-group-by none|ip|signature`: group before LLM calls (default: `ip`); `signature` groups by `ip+path+status+ua`.
+- `--group-window SECONDS`: add a time bucket to grouping (e.g., `60`).
+- `--llm-sample N`: send only N groups to LLM; the rest are annotated as sampled/gated out (default: `200`).
+- `--llm-gate-4xx N`: only send groups with ≥N 4xx responses.
+- `--llm-gate-ua`: only send groups with suspicious user‑agents.
+
+CTI request control:
+
+- `--cti-scope suspicious|all`: lookup only suspicious IPs (default) or all IPs.
+- `--cti-max N`: cap number of IPs to query for CTI (0=unlimited; default: `100`).
+- `--cti-batch-size N`, `--cti-batch-pause S`: batch CTI queries and pause between batches; cache flushes periodically.
+
+Examples (large logs):
+
+- Minimal network usage:
+  - `python -m src.cli data/raw/big.log --out data/processed --llm-group-by ip --group-window 60 --llm-gate-4xx 5 --llm-sample 200 --cti-scope suspicious --cti-max 200`
+- Strictly offline (fastest):
+  - `python -m src.cli data/raw/big.log --out data/processed --no-llm --no-cti --no-reports`
+
+## Environment
+
+Create a `.env` (see variables below). Keys are optional; the tool runs offline without them.
+
+- `GROQ_API_KEYS`: comma‑separated LLM keys for rotation.
+- `GROQ_MODEL`: Groq model name (default `llama3-8b-8192`).
+- `GROQ_TOKENS_BUDGET`: approximate token budget per run/day; enrichment stops before the cap and continues offline.
+- `RISK_4XX_THRESHOLD`: per‑IP 4xx threshold to consider suspicious in reports (default `5`).
+- `SUSPICIOUS_UA_REGEX`: comma‑separated regex patterns to flag suspicious UAs.
+- VirusTotal: `VT_API_KEY` (single) or `VT_API_KEYS` (comma‑separated).
+- AbuseIPDB: `ABUSEIPDB_API_KEY` (single) or `ABUSEIPDB_API_KEYS` (comma‑separated).
+- Proxies: `PROXY_LIST` comma‑separated list of `http://`, `https://`, or `socks5://` URLs.
+- `VT_API_KEY`: VirusTotal API key (optional; CTI works in a degraded mode without it).
+- `OFFLINE_IP_BLOCKLIST`: path to a newline‑separated list of known‑bad IPs to escalate risk without CTI calls.
+
+Budget notes:
+- When available, the client uses model‑reported token usage; otherwise it falls back to a conservative character‑based estimate.
+
+## Outputs
+
+- Enriched events: `data/processed/<name>.jsonl` (or `.csv` with `--format csv`).
+- Reports: `data/processed/reports/report.txt` and `report.md` summarizing activity and suspicious IPs; may include a brief AI note if LLM is enabled.
+- Malicious AI report (optional): `data/processed/reports/malicious_ai_report.txt|md` if `--ai-malicious-report` is used and malicious CTI signals are present.
+- CTI cache: `data/cache/cti_cache.json` (auto‑created and reused to minimize network calls).
+
+## Testing
+
+- Run tests: `pytest -q`
+- Optional coverage: `pytest --cov=src -q` (if coverage plugin installed).
+
+Notes:
+- If you used the local venv above, run tests via `.venv/bin/pytest -q`.
+- A PyPDF2 deprecation warning may appear; it’s harmless and can be ignored.
+
+## UI Dashboard
+
+An optional Streamlit dashboard is included for exploration and client-friendly viewing.
+
+- Install UI deps (already part of `requirements.txt`).
+- Run the UI: `scripts/run_ui.sh` (or `streamlit run ui/app.py`).
+- Select an enriched `.jsonl` file from `data/processed/` or upload one.
+- View status distribution, sample enriched events, and CTI attributes.
+
+## Troubleshooting
+
+- `.txt` auto‑detection: the CLI reads a small sample and parses with `parse_line`. If none match, the file is copied as plain text rather than parsed as logs.
+- LLM budget exceeded: you’ll see `LLM budget exhausted` in logs; records are still produced with `severity=unknown` and a rationale explaining sampling/gating.
+- CTI failures: the pipeline continues with cached/partial data; use `--no-cti` for fully offline runs. Consider `--cti-max` and batching to avoid rate limits.
+- No colors or CI: pass `--color never` for consistent, plain output.
+
+## Docs
+
+- Usage guide with more examples: `docs/USAGE.md`
+- Principles, strategy, and repo conventions: `AGENTS.md`
+- Mindmap/diagram: `docs/ProjectMindmapv0.5.png`
+- Project write‑ups: `docs/Final Project - Log Analysis + CTI.pdf`
+
+---
+
+Made with a focus on reliability, scalability, and cost‑awareness.
diff --git a/data/assets/flags/FR.png b/data/assets/flags/FR.png
diff --git a/data/assets/flags/US.png b/data/assets/flags/US.png