Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
f1af99b
Merge pull request #1 from pierringshot/feat/pipeline-and-tests
pierringshot Sep 3, 2025
69147f6
feat(pipeline): add budget-aware LLM + CTI controls\n\n- LLM: groupin…
pierringshot Sep 4, 2025
d90b990
docs: add README with usage, flags, and offline/budget guidance
pierringshot Sep 4, 2025
fde0d71
docs(readme): add testing notes (venv pytest path, PyPDF2 warning)
pierringshot Sep 4, 2025
83bf1d6
feat(cli): add --ai-malicious-report to generate detailed AI report f…
pierringshot Sep 4, 2025
85d96e0
feat(cli): add CTI batching flags; chore: sanitize .env.example and d…
pierringshot Sep 5, 2025
4fb8e3e
chore(repo): snapshot current updates (2025-09-05)
pierringshot Sep 5, 2025
cbe9a70
docs(readme): polish README with mindmap + AZ intro
pierringshot Sep 5, 2025
b8a4011
docs(wiki): seed wiki content under docs/wiki (Home, Quickstart, CLI,…
pierringshot Sep 5, 2025
e5040ec
chore(scripts): add publish_wiki.sh to sync docs/wiki to GitHub Wiki
pierringshot Sep 5, 2025
0656513
feat(cli): add --verbose (default=max) and improve severity highlight…
Sep 8, 2025
496e2cd
feat(groq): prefer API token usage for budget accounting\n\n- Use mod…
pierringshot Sep 8, 2025
f53dec1
feat(cli): add rate limiting, resumable cache saves, JSON/CSV export,…
pierringshot Sep 10, 2025
2f9e923
chore(git): remove large/untracked assets and update .gitignore
pierringshot Sep 10, 2025
df3b8f9
feat(cti): add AbuseIPDB client and integrate into scanner; add flags…
pierringshot Sep 10, 2025
26cd950
feat(cli,core): add --workers parallelism and limit-safe executor
pierringshot Sep 10, 2025
faed822
feat(cti): add proxy rotation + multi-key support; PDF flags placehol…
pierringshot Sep 10, 2025
3923b1e
test: export legacy CLI helpers via package; deps: add groq to requir…
pierringshot Sep 10, 2025
4091132
feat(ui): simplify Streamlit UI and add AI executive summary to PDF e…
Sep 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Copy to .env and fill in secrets
VT_API_KEY=
# Optional AbuseIPDB (https://www.abuseipdb.com/) API key
ABUSEIPDB_API_KEY=
# Or multiple keys (comma-separated) to distribute load respectfully
ABUSEIPDB_API_KEYS=
# Optional outbound proxies (comma-separated). Examples:
# PROXY_LIST=http://user:[email protected]:8080,https://5.6.7.8:8443,socks5://9.9.9.9:1080
PROXY_LIST=
# GROQ (LLM) optional keys (comma-separated)
GROQ_API_KEYS=
# Optional path to a newline-separated list of known-bad IPs (offline escalation)
OFFLINE_IP_BLOCKLIST=
# Optional token budget (not required for this tool)
GROQ_TOKENS_BUDGET=
48 changes: 11 additions & 37 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,41 +1,15 @@
# Environments
.venv/
env/
.env
.env.*

# Python
__pycache__/
*.pyc

# Data
data/raw/*
!data/raw/.gitkeep
data/processed/*
!data/processed/.gitkeep

# Cache
data/cache/*
!data/cache/.gitkeep

# Tool caches / reports
.pytest_cache/
.mypy_cache/
.ruff_cache/
htmlcov/
.coverage
coverage.xml

# Notebooks
*.ipynb_checkpoints/

# OS
.DS_Store
Thumbs.db

.venv/
.env
data/cache/*.json
data/processed/*
data/raw/*
docs/*.mp4
.coverage
.pytest_cache/
.mypy_cache/
.ruff_cache/
htmlcov/
coverage.xml
.debug/
*Zone.Identifier
*.pyc
*.pyo
*.DS_Store
7 changes: 7 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"workbench.colorCustomizations": {
"terminal.background": "#00000000",
"minimap.background": "#00000000",
"scrollbar.shadow": "#00000000"
}
}
49 changes: 30 additions & 19 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,48 @@
# Repository Guidelines

This AGENTS.md guides both human and agent contributors. Its scope is the entire repository.

## Project Structure & Module Organization
- Root: currently contains `access_log.txt` and project PDFs. Move long‑form write‑ups to `docs/` and raw logs to `data/raw/` as the repo evolves.
- `src/`: Python modules for log parsing, enrichment, and CTI lookups (e.g., `src/parsers/`, `src/enrichers/`).
- `notebooks/`: exploratory analysis; keep outputs cleared before commit.
- `tests/`: unit tests mirroring `src/` layout (e.g., `tests/parsers/test_nginx.py`).
- `docs/`: reports, diagrams, and usage notes.
- Root: currently `access_log.txt` and project PDFs; move long‑form write‑ups to `docs/` and raw logs to `data/raw/` as the repo evolves.
- `src/`: Python modules for parsing, scoring, enrichment, and CTI (e.g., `src/parsers/`, `src/enrichers/`).
- `notebooks/`: exploratory analysis; clear outputs before commit.
- `tests/`: unit tests mirroring `src/` (e.g., `tests/parsers/test_nginx.py`); fixtures in `tests/fixtures/`.
- `docs/`: reports, diagrams, usage notes. `data/`: `raw/` inputs; `cache/` (e.g., `data/cache/cti_cache.json`).

## Build, Test, and Development Commands
- Create env: `python -m venv .venv && source .venv/bin/activate`.
- Install deps: `pip install -r requirements.txt` (add one if code is introduced).
- Run tests: `pytest -q`.
- Lint/format: `ruff check . && ruff format .` (or `black . && isort .` if preferred).
- Type check: `mypy src`.
- Create env: `python -m venv .venv && source .venv/bin/activate`
- Install deps: `pip install -r requirements.txt`
- Lint/format: `ruff check . && ruff format .` (or `black . && isort .`)
- Type check: `mypy src`
- Run tests: `pytest -q` (coverage: `pytest --cov=src`)
- Run UI: `streamlit run src/ui/streamlit_app.py`
- One‑click: `./run.sh setup`, `./run.sh scan <file>`, or `./run.sh ui`

## Coding Style & Naming Conventions
- Python 3.10+; 4‑space indentation; UTF‑8.
- Python 3.10+, UTF‑8, 4‑space indentation.
- Names: modules/functions `lower_snake_case`, classes `PascalCase`, constants `UPPER_SNAKE_CASE`.
- Files: logs `data/raw/YYYYMMDD_source.log`; notebooks `notebooks/<topic>_<yyyymmdd>.ipynb`.
- Keep functions <50 lines where practical; document public functions with docstrings.

## Testing Guidelines
- Framework: `pytest`; minimum 80% coverage measured via `pytest --cov=src`.
- Layout: mirror `src/` with `test_*.py`; use fixtures for sample logs under `tests/fixtures/`.
- Determinism: do not read network in tests; mock CTI APIs.
- Framework: `pytest`; mirror `src/` layout; fixtures under `tests/fixtures/`.
- Determinism: no network in tests; mock CTI/LLM calls; target ≥80% coverage.

## Commit & Pull Request Guidelines
- Commits: Conventional Commits (e.g., `feat(parser): add nginx status extraction`).
- PRs: concise summary, linked issue, before/after notes, and if UI/data changes, include a small sample input and expected output.
- Size: prefer ≤300 lines diff; split larger changes.
- PRs: concise summary, linked issue, before/after notes; if UI/data changes, include small sample input and expected output. Prefer ≤300 lines diff.

## Security & Data Handling
- Do not commit secrets or tokens; use `.env` and provide `.env.example`.
- Anonymize or truncate sensitive log data before committing.
- Large files: store raw datasets outside git or via LFS; keep only small, representative fixtures.
- Never commit secrets; use `.env` and provide `.env.example`.
- Anonymize/truncate sensitive logs; store large datasets outside git or via LFS.
- API keys: `VT_API_KEY`, `ABUSEIPDB_API_KEY` (optional). Respect rate limits.

## Scalable, Budget‑Aware Processing (Project‑Specific)
- Offline‑first; aggregate then sample; cache and dedupe. Defaults: `--llm-group-by ip`, `--llm-sample 200`, `--cti-scope suspicious` with `--cti-max 200`.
- Offline‑first; aggregate then sample; cache and dedupe. Defaults: `--llm-group-by ip`, `--llm-sample 200`, `--cti-scope suspicious` with `--cti-max 200` (use `--cti-max -1` to scan all IPs).
- Budget throttle: `export GROQ_TOKENS_BUDGET=150000`.
- Examples:
- Huge logs: `python -m src.cli data/raw/big.log --out data/processed --llm-group-by ip --llm-sample 200 --cti-scope suspicious --cti-max 200 --color never`
- Strictly offline: `python -m src.cli data/raw/big.log --out data/processed --no-llm --no-cti --no-reports`
- IP scan to PDF (CLI): `python -m src.cli scan-ips data/sample_ips.txt --out data/processed --cti-max -1`
- IP scan (UI): `streamlit run src/ui/streamlit_app.py`
52 changes: 52 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
VENV?=.venv
PY?=$(VENV)/bin/python
PIP?=$(VENV)/bin/pip

.PHONY: setup install lint fmt test scan scan-all ui help doctor scan-file

setup:
python -m venv $(VENV)
. $(VENV)/bin/activate; $(PIP) install -r requirements.txt

install:
. $(VENV)/bin/activate; $(PIP) install -r requirements.txt

lint:
. $(VENV)/bin/activate; $(PY) -m ruff check . || true

fmt:
. $(VENV)/bin/activate; $(PY) -m ruff format . || true

test:
. $(VENV)/bin/activate; pytest -q || true

scan:
. $(VENV)/bin/activate; $(PY) -m src.cli scan-ips data/sample_ips.txt --out data/processed --no-cti

scan-all:
. $(VENV)/bin/activate; $(PY) -m src.cli scan-ips data/sample_ips.txt --out data/processed --cti-max -1

ui:
. $(VENV)/bin/activate; streamlit run src/ui/streamlit_app.py

help:
@echo "Targets:" && \
echo " make setup # Create venv and install deps" && \
echo " make scan # Offline demo scan -> PDF" && \
echo " make scan-all # Demo scan with CTI (uses VT_API_KEY)" && \
echo " make scan-file FILE=path CTI_MAX=-1 RATE=0.8 BURST=1 SAVE=50 # Full control" && \
echo " make ui # Launch Streamlit UI" && \
echo " make lint|fmt|test" && \
echo " make doctor # Quick environment check"

doctor:
@python3 -c 'import sys; print("Python:", sys.version.split()[0]); assert sys.version_info[:2] >= (3,10)' || (echo "Python 3.10+ required" && exit 1)
@[ -f .env ] || echo "Note: .env not found (optional). Copy .env.example -> .env"
@[ -n "$$VT_API_KEY" ] || echo "Note: VT_API_KEY not set; CTI calls will be disabled."

# Example: make scan-file FILE=data/sample_ips.txt CTI_MAX=-1 RATE=0.8 BURST=1 SAVE=25
scan-file:
@[ -n "$(FILE)" ] || (echo "Usage: make scan-file FILE=path [CTI_MAX=-1] [RATE=0.8] [BURST=1] [SAVE=50]" && exit 1)
. $(VENV)/bin/activate; \
$(PY) -m src.cli scan-ips $(FILE) --out data/processed --cti-max $${CTI_MAX:--1} \
--cti-rate $${RATE:-0.8} --cti-burst $${BURST:-1} --save-every $${SAVE:-50}
139 changes: 139 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# LogCTIAI — Offline‑First Log Analysis + CTI (LLM‑Optional)

Bu layihə (AZ): böyük həcmli server/web loglarını emal edir, qruplaşdırılmış LLM şərhləri (istəyə görə) və CTI zənginləşdirməsi ilə təhlükə siqnallarını çıxarır, nəticədə yığcam və təkrarlana bilən hesabatlar yaradır. Şəbəkədən minimal istifadə və büdcə nəzarəti üçün optimallaşdırılıb.

This project ingests large web/server logs, enriches events with optional LLM analysis, performs CTI lookups against external sources, and generates concise human‑readable reports. It is designed to run reliably on very large datasets with minimal network usage:

- Auto‑detects `.txt` vs `.log` inputs; parses recognized log lines in `.txt` files.
- Minimizes LLM calls via grouping, sampling, and gates; enforces an optional token budget.
- Minimizes CTI calls via suspicious‑first scoping, caps, batching, and strong caching.
- Works fully offline and degrades gracefully when network or budgets are unavailable.

See `docs/USAGE.md` for practical commands and tips. See `AGENTS.md` for project conventions and the scalable processing strategy.

![Mindmap](docs/ProjectMindmapv0.5.png)

## Quickstart

- Create env: `python -m venv .venv && source .venv/bin/activate`
- Install deps: `pip install -r requirements.txt`
- Run on a log (auto‑detects `.txt` that look like logs):
- `python -m src.cli data/raw/access_log.txt --out data/processed --summary --preview 3`
- Outputs `data/processed/access_log.jsonl` and `data/processed/reports/` with `.txt` and `.md`.

### IP Threat Scanner (CLI & UI)

This repo also includes a fast, offline‑first IP CTI scanner with caching, PDF/JSON/CSV outputs, and a Streamlit UI.

- CLI (offline demo): `python -m src.cli scan-ips data/sample_ips.txt --out data/processed --no-cti`
- CLI (with CTI): `VT_API_KEYS=vt_key1,vt_key2 ABUSEIPDB_API_KEYS=ab1,ab2 python -m src.cli scan-ips data/sample_ips.txt --out data/processed --cti-max 200 --cti-rate 1 --cti-burst 1 --workers 2`
- UI: `streamlit run src/ui/streamlit_app.py` (clean UI with optional AI executive summary embedded in the exported PDF)

Environment (see `.env.example`):
- VirusTotal: `VT_API_KEY` or `VT_API_KEYS` (comma‑separated)
- AbuseIPDB: `ABUSEIPDB_API_KEY` or `ABUSEIPDB_API_KEYS` (comma‑separated)
- Optional proxies (resiliency, not for evading quotas): `PROXY_LIST="http://1.2.3.4:8080,socks5://5.6.7.8:1080"`
- Offline blocklist: `OFFLINE_IP_BLOCKLIST=/path/to/bad_ips.txt`

Notes:
- The scanner respects provider rate limits and `Retry-After`; it rotates your keys and proxies on 429/403 and caches results.
- VirusTotal has no API‑less access; provide an API key to query VT.

If LLM keys are not configured, enrichment runs offline with `severity=unknown` placeholders and continues to produce reports.

## CLI Overview

`python -m src.cli <input_path> --out <out_dir> [options]`

Common options:

- `--verbose quiet|normal|max`: control console verbosity (default: `max`).
- `--no-llm`: disable LLM enrichment (default if no keys set).
- `--no-cti`: skip CTI lookups; run fully offline.
- `--no-reports`: skip generating text/markdown reports.
- `--limit N`: process only the first N lines.
- `--format jsonl|csv`: output for enriched events (default: `jsonl`).
- `--color auto|always|never`: terminal color policy.
- `--ai-malicious-report`: after CTI summarization, ask the LLM for a detailed malicious-activity report (saved under `reports/`).

LLM request control:

- `--llm-group-by none|ip|signature`: group before LLM calls (default: `ip`); `signature` groups by `ip+path+status+ua`.
- `--group-window SECONDS`: add a time bucket to grouping (e.g., `60`).
- `--llm-sample N`: send only N groups to LLM; the rest are annotated as sampled/gated out (default: `200`).
- `--llm-gate-4xx N`: only send groups with ≥N 4xx responses.
- `--llm-gate-ua`: only send groups with suspicious user‑agents.

CTI request control:

- `--cti-scope suspicious|all`: lookup only suspicious IPs (default) or all IPs.
- `--cti-max N`: cap number of IPs to query for CTI (0=unlimited; default: `100`).
- `--cti-batch-size N`, `--cti-batch-pause S`: batch CTI queries and pause between batches; cache flushes periodically.

Examples (large logs):

- Minimal network usage:
- `python -m src.cli data/raw/big.log --out data/processed --llm-group-by ip --group-window 60 --llm-gate-4xx 5 --llm-sample 200 --cti-scope suspicious --cti-max 200`
- Strictly offline (fastest):
- `python -m src.cli data/raw/big.log --out data/processed --no-llm --no-cti --no-reports`

## Environment

Create a `.env` (see variables below). Keys are optional; the tool runs offline without them.

- `GROQ_API_KEYS`: comma‑separated LLM keys for rotation.
- `GROQ_MODEL`: Groq model name (default `llama3-8b-8192`).
- `GROQ_TOKENS_BUDGET`: approximate token budget per run/day; enrichment stops before the cap and continues offline.
- `RISK_4XX_THRESHOLD`: per‑IP 4xx threshold to consider suspicious in reports (default `5`).
- `SUSPICIOUS_UA_REGEX`: comma‑separated regex patterns to flag suspicious UAs.
- VirusTotal: `VT_API_KEY` (single) or `VT_API_KEYS` (comma‑separated).
- AbuseIPDB: `ABUSEIPDB_API_KEY` (single) or `ABUSEIPDB_API_KEYS` (comma‑separated).
- Proxies: `PROXY_LIST` comma‑separated list of `http://`, `https://`, or `socks5://` URLs.
- `VT_API_KEY`: VirusTotal API key (optional; CTI works in a degraded mode without it).
- `OFFLINE_IP_BLOCKLIST`: path to a newline‑separated list of known‑bad IPs to escalate risk without CTI calls.

Budget notes:
- When available, the client uses model‑reported token usage; otherwise it falls back to a conservative character‑based estimate.

## Outputs

- Enriched events: `data/processed/<name>.jsonl` (or `.csv` with `--format csv`).
- Reports: `data/processed/reports/report.txt` and `report.md` summarizing activity and suspicious IPs; may include a brief AI note if LLM is enabled.
- Malicious AI report (optional): `data/processed/reports/malicious_ai_report.txt|md` if `--ai-malicious-report` is used and malicious CTI signals are present.
- CTI cache: `data/cache/cti_cache.json` (auto‑created and reused to minimize network calls).

## Testing

- Run tests: `pytest -q`
- Optional coverage: `pytest --cov=src -q` (if coverage plugin installed).

Notes:
- If you used the local venv above, run tests via `.venv/bin/pytest -q`.
- A PyPDF2 deprecation warning may appear; it’s harmless and can be ignored.

## UI Dashboard

An optional Streamlit dashboard is included for exploration and client-friendly viewing.

- Install UI deps (already part of `requirements.txt`).
- Run the UI: `scripts/run_ui.sh` (or `streamlit run ui/app.py`).
- Select an enriched `.jsonl` file from `data/processed/` or upload one.
- View status distribution, sample enriched events, and CTI attributes.

## Troubleshooting

- `.txt` auto‑detection: the CLI reads a small sample and parses with `parse_line`. If none match, the file is copied as plain text rather than parsed as logs.
- LLM budget exceeded: you’ll see `LLM budget exhausted` in logs; records are still produced with `severity=unknown` and a rationale explaining sampling/gating.
- CTI failures: the pipeline continues with cached/partial data; use `--no-cti` for fully offline runs. Consider `--cti-max` and batching to avoid rate limits.
- No colors or CI: pass `--color never` for consistent, plain output.

## Docs

- Usage guide with more examples: `docs/USAGE.md`
- Principles, strategy, and repo conventions: `AGENTS.md`
- Mindmap/diagram: `docs/ProjectMindmapv0.5.png`
- Project write‑ups: `docs/Final Project - Log Analysis + CTI.pdf`

---

Made with a focus on reliability, scalability, and cost‑awareness.
Binary file added data/assets/flags/FR.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added data/assets/flags/US.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading