Skip to content

Commit 2b5de4b

Browse files
committed
Refactor and update README.md
1 parent 2783a52 commit 2b5de4b

File tree

1 file changed

+100
-88
lines changed

1 file changed

+100
-88
lines changed

README.md

Lines changed: 100 additions & 88 deletions
Original file line numberDiff line numberDiff line change
@@ -58,9 +58,9 @@
5858

5959
![Pipeline overview](https://raw.githubusercontent.com/stefanDeveloper/heiDGAF/main/docs/media/heidgaf_overview_detailed.drawio.png?raw=true)
6060

61-
## 🛠️ Getting Started
61+
## Getting Started
6262

63-
Run `heiDGAF` using Docker Compose:
63+
##### Run **heiDGAF** using Docker Compose:
6464

6565
```sh
6666
HOST_IP=127.0.0.1 docker compose -f docker/docker-compose.yml up
@@ -69,10 +69,65 @@ HOST_IP=127.0.0.1 docker compose -f docker/docker-compose.yml up
6969
<img src="https://raw.githubusercontent.com/stefanDeveloper/heiDGAF/main/assets/terminal_example.gif?raw=true" alt="Terminal example"/>
7070
</p>
7171

72-
## Examplary Dashboards
73-
In the below summary you will find examplary views of the grafana dashboards. The metrics were obtained using the [mock-generator](./docker/docker-compose.send-real-logs.yml)
72+
##### Or run the modules locally on your machine:
73+
```sh
74+
python -m venv .venv
75+
source .venv/bin/activate
76+
77+
sh install_requirements.sh
78+
```
79+
Alternatively, you can use `pip install` and enter all needed requirements individually with `-r requirements.*.txt`.
80+
81+
Now, you can start each stage, e.g. the inspector:
82+
83+
```sh
84+
python src/inspector/inspector.py
85+
```
86+
87+
<p align="right">(<a href="#readme-top">back to top</a>)</p>
88+
89+
90+
## Usage
91+
92+
### Configuration
93+
94+
To configure **heiDGAF** according to your needs, use the provided `config.yaml`.
95+
96+
The most relevant settings are related to your specific log line format, the model you want to use, and
97+
possibly infrastructure.
98+
99+
The section `pipeline.log_collection.collector.logline_format` has to be adjusted to reflect your specific input log
100+
line format. Using our adjustable and flexible log line configuration, you can rename, reorder and fully configure each
101+
field of a valid log line. Freely define timestamps, RegEx patterns, lists, and IP addresses. For example, your
102+
configuration might look as follows:
103+
104+
```yml
105+
- [ "timestamp", Timestamp, "%Y-%m-%dT%H:%M:%S.%fZ" ]
106+
- [ "status_code", ListItem, [ "NOERROR", "NXDOMAIN" ], [ "NXDOMAIN" ] ]
107+
- [ "client_ip", IpAddress ]
108+
- [ "dns_server_ip", IpAddress ]
109+
- [ "domain_name", RegEx, '^(?=.{1,253}$)((?!-)[A-Za-z0-9-]{1,63}(?<!-)\.)+[A-Za-z]{2,63}$' ]
110+
- [ "record_type", ListItem, [ "A", "AAAA" ] ]
111+
- [ "response_ip", IpAddress ]
112+
- [ "size", RegEx, '^\d+b$' ]
113+
```
114+
115+
The options `pipeline.data_inspection` and `pipeline.data_analysis` are relevant for configuring the model. The section
116+
`environment` can be fine-tuned to prevent naming collisions for Kafka topics and adjust addressing in your environment.
117+
118+
For more in-depth information on your options, have a look at our
119+
[official documentation](https://heidgaf.readthedocs.io/en/latest/usage.html), where we provide tables explaining all
120+
values in detail.
121+
122+
### Monitoring
123+
To monitor the system and observe its real-time behavior, multiple Grafana dashboards have been set up.
124+
125+
Have a look at the following pictures showing examples of how these dashboards might look at runtime.
126+
74127
<details>
75-
<summary>📊 <strong>Overview Dashboard</strong></summary>
128+
<summary><strong>Overview</strong> dashboard</summary>
129+
130+
Contains the most relevant information on the system's runtime behavior, its efficiency and its effectivity.
76131

77132
<p align="center">
78133
<a href="./assets/readme_assets/overview.png">
@@ -83,7 +138,10 @@ In the below summary you will find examplary views of the grafana dashboards. Th
83138
</details>
84139

85140
<details>
86-
<summary>📈 <strong>Latencies Dashboard</strong></summary>
141+
<summary><strong>Latencies</strong> dashboard</summary>
142+
143+
Presents any information on latencies, including comparisons between the modules and more detailed,
144+
stand-alone metrics.
87145

88146
<p align="center">
89147
<a href="./assets/readme_assets/latencies.jpeg">
@@ -94,7 +152,11 @@ In the below summary you will find examplary views of the grafana dashboards. Th
94152
</details>
95153

96154
<details>
97-
<summary>📉 <strong>Log Volumes Dashboard</strong></summary>
155+
<summary><strong>Log Volumes</strong> dashboard</summary>
156+
157+
Presents any information on the fill levels of each module, i.e. the number of entries that are currently in the
158+
module for processing. Includes comparisons between the modules, more detailed, stand-alone metrics, as well as
159+
total numbers of logs entering the pipeline or being marked as fully processed.
98160

99161
<p align="center">
100162
<a href="./assets/readme_assets/log_volumes.jpeg">
@@ -105,7 +167,9 @@ In the below summary you will find examplary views of the grafana dashboards. Th
105167
</details>
106168

107169
<details>
108-
<summary>🚨 <strong>Alerts Dashboard</strong></summary>
170+
<summary><strong>Alerts</strong> dashboard</summary>
171+
172+
Presents details on the number of logs detected as malicious including IP addresses responsible for those alerts.
109173

110174
<p align="center">
111175
<a href="./assets/readme_assets/alerts.png">
@@ -116,7 +180,13 @@ In the below summary you will find examplary views of the grafana dashboards. Th
116180
</details>
117181

118182
<details>
119-
<summary>🧪 <strong>Dataset Dashboard</strong></summary>
183+
<summary><strong>Dataset</strong> dashboard</summary>
184+
185+
This dashboard is only active for the **_datatest_** mode. Users who want to test their own models can use this mode
186+
for inspecting confusion matrices on testing data.
187+
188+
> [!CAUTION]
189+
> This feature is in a very early development stage.
120190

121191
<p align="center">
122192
<a href="./assets/readme_assets/datatests.png">
@@ -126,76 +196,42 @@ In the below summary you will find examplary views of the grafana dashboards. Th
126196

127197
</details>
128198

129-
130-
## Developing
131-
132-
Install `Python` requirements:
133-
134-
```sh
135-
python -m venv .venv
136-
source .venv/bin/activate
137-
138-
sh install_requirements.sh
139-
```
140-
141-
Alternatively, you can use `pip install` and enter all needed requirements individually with `-r requirements.*.txt`.
142-
143-
Now, you can start each stage, e.g. the inspector:
144-
145-
```sh
146-
python src/inspector/main.py
147-
```
148199
<p align="right">(<a href="#readme-top">back to top</a>)</p>
149200

150-
### Configuration
151-
152-
The following table lists the most important configuration parameters with their respective default values.
153-
The full list of configuration parameters is available at the [documentation](https://heidgaf.readthedocs.io/en/latest/usage.html)
154-
155-
| Path | Description | Default Value |
156-
| :----------------------------------------- | :-------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------- |
157-
| `pipeline.data_inspection.inspector.mode` | Mode of operation for the data inspector. | `univariate` (options: `multivariate`, `ensemble`) |
158-
| `pipeline.data_inspection.inspector.ensemble.model` | Model to use when inspector mode is `ensemble`. | `WeightEnsemble` |
159-
| `pipeline.data_inspection.inspector.ensemble.module` | Module name for the ensemble model. | `streamad.process` |
160-
| `pipeline.data_inspection.inspector.models` | List of models to use for data inspection (e.g., anomaly detection). | Array of model definitions (e.g., `{"model": "ZScoreDetector", "module": "streamad.model", "model_args": {"is_global": false}}`)|
161-
| `pipeline.data_inspection.inspector.anomaly_threshold` | Threshold for classifying an observation as an anomaly. | `0.01` |
162-
| `pipeline.data_analysis.detector.model` | Model to use for data analysis (e.g., DGA detection). | `rf` (Random Forest) option: `XGBoost` |
163-
| `pipeline.data_analysis.detector.checksum` | Checksum for the model file to ensure integrity. | `021af76b2385ddbc76f6e3ad10feb0bb081f9cf05cff2e52333e31040bbf36cc` |
164-
| `pipeline.data_analysis.detector.base_url` | Base URL for downloading the model if not present locally. | `https://heibox.uni-heidelberg.de/d/0d5cbcbe16cd46a58021/` |
165201

166-
<p align="right">(<a href="#readme-top">back to top</a>)</p>
167-
168-
### Insert Data
202+
## Models and Training
169203

170-
>[!IMPORTANT]
171-
> We rely on the following datasets to train and test our or your own models:
204+
To train and test our and possibly your own models, we currently rely on the following datasets:
172205

173-
For training our models, we currently rely on the following data sets:
174206
- [CICBellDNS2021](https://www.unb.ca/cic/datasets/dns-2021.html)
175207
- [DGTA Benchmark](https://data.mendeley.com/datasets/2wzf9bz7xr/1)
176208
- [DNS Tunneling Queries for Binary Classification](https://data.mendeley.com/datasets/mzn9hvdcxg/1)
177209
- [UMUDGA - University of Murcia Domain Generation Algorithm Dataset](https://data.mendeley.com/datasets/y8ph45msv8/1)
178210
- [DGArchive](https://dgarchive.caad.fkie.fraunhofer.de/)
179211

180-
We compute all feature separately and only rely on the `domain` and `class` for binary classification.
212+
We compute all features separately and only rely on the `domain` and `class` for binary classification.
213+
214+
### Inserting Data for Testing
181215

182-
After downloading the dataset and storing it under `<project-root>/data` you can run
216+
For testing purposes, we provide multiple scripts in the `scripts` directory. Use `real_logs.dev.py` to send data from
217+
the datasets into the pipeline. After downloading the dataset and storing it under `<project-root>/data`, run
183218
```sh
184219
python scripts/real_logs.dev.py
185220
```
186-
to start inserting the dataset traffic.
187-
188-
<p align="right">(<a href="#readme-top">back to top</a>)</p>
221+
to start continuously inserting dataset traffic.
189222

190-
191-
### Train your own models
223+
### Training Your Own Models
192224

193225
> [!IMPORTANT]
194226
> This is only a brief wrap-up of a custom training process.
195227
> We highly encourage you to have a look at the [documentation](https://heidgaf.readthedocs.io/en/latest/training.html)
196228
> for a full description and explanation of the configuration parameters.
197229

198-
We feature two trained models: XGBoost (`src/train/model.py#XGBoostModel`) and RandomForest (`src/train/model.py#RandomForestModel`).
230+
We feature two trained models:
231+
1. XGBoost (`src/train/model.py#XGBoostModel`) and
232+
2. RandomForest (`src/train/model.py#RandomForestModel`).
233+
234+
After installing the requirements, use `src/train/train.py`:
199235

200236
```sh
201237
> python -m venv .venv
@@ -215,56 +251,32 @@ Commands:
215251
train
216252
```
217253

218-
Setting up the [dataset directories](#insert-test-data) (and adding the code for your model class if applicable) let's you start the training process by running the following commands:
254+
Setting up the [dataset directories](#insert-test-data) (and adding the code for your model class if applicable) lets you start
255+
the training process by running the following commands:
219256

220-
**Model Training**
257+
##### Model Training
221258

222259
```sh
223260
> python src/train/train.py train --dataset <dataset_type> --dataset_path <path/to/your/datasets> --model <model_name>
224261
```
225-
The results will be saved per default to `./results`, if not configured otherwise. <br>
262+
The results will be saved per default to `./results`, if not configured otherwise.
226263

227-
**Model Tests**
264+
##### Model Tests
228265

229266
```sh
230267
> python src/train/train.py test --dataset <dataset_type> --dataset_path <path/to/your/datasets> --model <model_name> --model_path <path_to_model_version>
231268
```
232269

233-
**Model Explain**
270+
##### Model Explain
234271

235272
```sh
236273
> python src/train/train.py explain --dataset <dataset_type> --dataset_path <path/to/your/datasets> --model <model_name> --model_path <path_to_model_version>
237274
```
238-
This will create a rules.txt file containing the innards of the model, explaining the rules it created.
275+
This will create a `rules.txt` file containing the innards of the model, explaining the rules it created.
239276

240277
<p align="right">(<a href="#readme-top">back to top</a>)</p>
241278

242279

243-
### Data
244-
245-
> [!IMPORTANT]
246-
> We support custom schemes.
247-
248-
Depending on your data and usecase, you can customize the data scheme to fit your needs.
249-
The below configuration is part of the [main configuration file](./config.yaml) which is detailed in our [documentation](https://heidgaf.readthedocs.io/en/latest/usage.html#id2)
250-
251-
```yml
252-
loglines:
253-
fields:
254-
- [ "timestamp", RegEx, '^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z$' ]
255-
- [ "status_code", ListItem, [ "NOERROR", "NXDOMAIN" ], [ "NXDOMAIN" ] ]
256-
- [ "client_ip", IpAddress ]
257-
- [ "dns_server_ip", IpAddress ]
258-
- [ "domain_name", RegEx, '^(?=.{1,253}$)((?!-)[A-Za-z0-9-]{1,63}(?<!-)\.)+[A-Za-z]{2,63}$' ]
259-
- [ "record_type", ListItem, [ "A", "AAAA" ] ]
260-
- [ "response_ip", IpAddress ]
261-
- [ "size", RegEx, '^\d+b$' ]
262-
```
263-
264-
265-
266-
<p align="right">(<a href="#readme-top">back to top</a>)</p>
267-
268280
<!-- CONTRIBUTING -->
269281
## Contributing
270282

0 commit comments

Comments
 (0)