Skip to content

Commit 8c2253d

Browse files
committed
docs: upd api ref + 0.4.0 changelog
1 parent 10dea92 commit 8c2253d

File tree

3 files changed

+227
-134
lines changed

3 files changed

+227
-134
lines changed

docs/api.md

Lines changed: 114 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -9,165 +9,145 @@ This page provides a detailed guide to the main functions, classes, and extensib
99
### `parse`
1010

1111
```python
12-
from hario_core import parse
12+
from hario_core.parse import parse
1313
```
1414

1515
Parses a HAR file from a path, bytes, or file-like object and returns a validated `HarLog` model. Automatically selects the correct Pydantic model for each entry (including extensions).
1616

1717
**Signature:**
1818
```python
19-
def parse(
20-
src: str | Path | bytes | bytearray | IO[Any],
21-
*,
22-
entry_model_selector: Callable[[dict[str, Any]], type[Entry]] = entry_selector,
23-
) -> HarLog
19+
def parse(src: str | Path | bytes | bytearray | IO[Any]) -> HarLog
2420
```
25-
2621
- `src`: Path, bytes, or file-like object containing HAR JSON.
27-
- `entry_model_selector`: Optional. Function to select the Pydantic model for each entry (default: registry-based selector).
2822

2923
**Returns:**
3024
- `HarLog` — a validated Pydantic model with `.entries` (list of `Entry` or extension models).
3125

3226
**Example:**
3327
```python
34-
model = parse("example.har")
28+
har_log = parse("example.har")
3529
for entry in har_log.entries:
3630
print(entry.request.url)
3731
```
3832

3933
---
4034

41-
## Entry Model Registration
35+
### `validate`
36+
37+
Validates a HAR dict (already loaded from JSON) and returns a `HarLog` model.
38+
39+
**Signature:**
40+
```python
41+
def validate(har_dict: dict) -> HarLog
42+
```
43+
44+
---
4245

4346
### `register_entry_model`
4447

4548
Register a custom Pydantic model and detector function for new HAR entry formats (e.g., Safari, proprietary extensions).
4649

4750
**Signature:**
4851
```python
49-
def register_entry_model(
50-
detector: Callable[[dict[str, Any]], bool],
51-
model: type[Entry],
52-
) -> None
52+
def register_entry_model(detector: Callable[[dict], bool], model: type[Entry]) -> None
5353
```
54-
5554
- `detector`: Function that takes an entry dict and returns True if the model should be used.
5655
- `model`: Pydantic model class to use for matching entries.
5756

5857
**Example:**
5958
```python
60-
from hario_core.models.har_1_2 import Entry
61-
from pydantic import Field
59+
from hario_core.models import Entry
6260

63-
class SafariEntry(Entry):
64-
webkit_trace: dict = Field(alias="_webkitTrace")
61+
class CustomEntry(Entry):
62+
x_custom: str
6563

66-
def is_safari_entry(entry_json):
67-
return "_webkitTrace" in entry_json
64+
def is_custom_entry(entry):
65+
return "x-custom" in entry
6866

69-
register_entry_model(is_safari_entry, SafariEntry)
67+
register_entry_model(is_custom_entry, CustomEntry)
68+
```
69+
70+
---
71+
72+
### `entry_selector`
73+
74+
Selects the appropriate Entry model for a given entry dict (based on registered detectors).
75+
76+
**Signature:**
77+
```python
78+
def entry_selector(entry_dict: dict) -> type[Entry]
7079
```
7180

7281
---
7382

7483
## Data Models
7584

76-
All core data structures are implemented as Pydantic models in `hario_core.models.har_1_2`.
85+
All core data structures are implemented as Pydantic models in `hario_core.models`.
7786

7887
- `Entry`: Pydantic model for a HAR entry (fields: request, response, timings, cache, etc.).
7988
- `HarLog`: Pydantic model for the HAR log (fields: version, creator, entries, etc.).
89+
- `DevToolsEntry`: Chrome DevTools extension entry model.
8090

8191
**Example:**
8292
```python
83-
from hario_core.models.har_1_2 import HarLog, Entry
93+
from hario_core.models import HarLog, Entry
8494

8595
har_log = HarLog.model_validate(har_json["log"])
8696
for entry in har_log.entries:
8797
assert isinstance(entry, Entry)
8898
print(entry.request.url)
8999
```
90100

91-
### `Transformer`
92-
A transformer is a function that takes a dict (parsed HAR entry) and returns a dict (possibly mutated/transformed).
93-
94-
```python
95-
def my_transformer(data: dict[str, Any]) -> dict[str, Any]:
96-
# mutate data
97-
return data
98-
```
99-
100-
### `EntryIdFn`
101-
A function that takes an `Entry` and returns a string ID.
102-
103101
---
104102

105-
## ID Generation
103+
## Transformers & ID Generators
106104

107-
### `by_field`
105+
### `Transformer`
106+
A transformer is a callable that takes a dict (parsed HAR entry) and returns a dict (possibly mutated/transformed).
108107

109-
Returns a deterministic ID function based on specified fields of a HAR entry.
108+
### `set_id`
109+
Sets an ID field in each entry using a provided function.
110110

111111
**Signature:**
112112
```python
113-
def by_field(fields: list[str]) -> EntryIdFn
113+
def set_id(id_fn: Callable[[dict], str], id_field: str = "id") -> Transformer
114114
```
115115

116-
**Example:**
116+
### `by_field`
117+
Returns a deterministic ID function based on specified fields of a HAR entry.
118+
119+
**Signature:**
117120
```python
118-
from hario_core.utils import by_field
119-
id_fn = by_field(["request.url", "startedDateTime"])
121+
def by_field(fields: list[str]) -> Callable[[dict], str]
120122
```
121123

122124
### `uuid`
123-
124125
Returns a function that generates a random UUID for each entry.
125126

126127
**Signature:**
127128
```python
128-
def uuid() -> EntryIdFn
129+
def uuid() -> Callable[[dict], str]
129130
```
130131

131-
**Example:**
132-
```python
133-
from hario_core.utils import uuid
134-
id_fn = uuid()
135-
```
136-
137-
---
138-
139-
## Transformers
140-
141-
Transformers are functions that mutate or normalize HAR entry data for storage or analysis.
142-
143-
144132
### `flatten`
145-
146133
Flattens nested structures in a HAR entry to a flat dict with keys joined by separator. If a list is encountered, array_handler is called (default: str). Useful for exporting to CSV, analytics, or custom DB schemas.
147134

148135
**Signature:**
149136
```python
150-
def flatten(
151-
separator: str = ".",
152-
array_handler: Callable[[list, str], Any] = None,
153-
) -> Transformer
137+
def flatten(separator: str = ".", array_handler: Callable[[list, str], Any] = None) -> Transformer
154138
```
155139
- `separator`: Separator for keys (default: '.')
156140
- `array_handler`: Function (lambda arr, path) -> value. Default is str(arr)
157141

158142
**Example:**
159143
```python
160144
def header_handler(arr, path):
161-
# Each header becomes a separate key by name
162145
return {f"{path}.{item['name']}": item["value"] for item in arr if isinstance(item, dict) and "name" in item and "value" in item}
163146

164147
flat_entry = flatten(array_handler=header_handler)(entry)
165-
# flat_entry['request.headers.user-agent'] == 'Mozilla/5.0 ...'
166-
# flat_entry['request.headers.:authority'] == 'test.test'
167148
```
168149

169150
### `normalize_sizes`
170-
171151
Normalizes negative size fields in request/response to zero.
172152

173153
**Signature:**
@@ -176,7 +156,6 @@ def normalize_sizes() -> Transformer
176156
```
177157

178158
### `normalize_timings`
179-
180159
Normalizes negative timing fields in entry.timings to zero.
181160

182161
**Signature:**
@@ -188,45 +167,88 @@ def normalize_timings() -> Transformer
188167

189168
## Pipeline
190169

191-
### `Pipeline`
192-
193-
A high-level class for processing HAR data: transforming and assigning IDs. You must pass a parsed `HarLog` object (see `parse`).
170+
### `PipelineConfig`
171+
Configuration for the Pipeline processor.
194172

195173
**Signature:**
196174
```python
197-
class Pipeline:
198-
def __init__(
199-
self,
200-
id_fn: EntryIdFn,
201-
id_field: str = "id",
202-
transformers: Sequence[Transformer] = (),
203-
) -> None
204-
def process(self, har_log: HarLog) -> list[dict[str, Any]]
175+
from hario_core.transform import PipelineConfig
176+
177+
config = PipelineConfig(
178+
batch_size=1000, # entries per batch
179+
processing_strategy="process", # "sequential", "thread", "process", "async"
180+
max_workers=4 # number of parallel workers (if applicable)
181+
)
205182
```
206183

207-
- `id_fn`: Function to generate an ID for each entry.
208-
- `id_field`: Field name for the generated ID (default: "id").
209-
- `transformers`: List of transformer functions to apply to each entry.
184+
- `batch_size`: int, default 20000
185+
- `processing_strategy`: str, one of "sequential", "thread", "process", "async"
186+
- `max_workers`: int | None, number of parallel workers (for thread/process)
210187

211188
---
212189

213-
## Example: Full Pipeline
190+
### `Pipeline`
191+
A high-level class for processing HAR entry dicts: transforming and assigning IDs.
214192

193+
**Signature:**
215194
```python
216-
from hario_core import Pipeline, by_field, flatten, normalize_sizes, parse
195+
from hario_core.transform import Pipeline, PipelineConfig
217196

218197
pipeline = Pipeline(
219-
id_fn=by_field(["request.url", "startedDateTime"]),
220-
transformers=[flatten(), normalize_sizes()],
198+
transformers=[...],
199+
config=PipelineConfig(...)
221200
)
222201

223-
model = parse("example.har")
224-
result_dict = pipeline.process(model)
202+
results = pipeline.process(entries) # entries: list[dict]
203+
```
204+
- `transformers`: List of transformer functions to apply to each entry.
205+
- `config`: PipelineConfig instance (optional, default: sequential, batch_size=20000)
206+
- `process(entries)`: entries must be a list of dicts (e.g., from HarLog.model_dump()["entries"])
207+
208+
---
209+
210+
### Example: Full Pipeline
211+
212+
```python
213+
from hario_core.parse import parse
214+
from hario_core.transform import Pipeline, PipelineConfig, by_field, flatten, normalize_sizes, set_id
215+
216+
har_log = parse("example.har")
217+
entries = har_log.model_dump()["entries"]
218+
219+
pipeline = Pipeline([
220+
set_id(by_field(["request.url", "startedDateTime"])),
221+
flatten(),
222+
normalize_sizes(),
223+
])
224+
results = pipeline.process(entries)
225+
```
225226

226-
for entry in result_dict:
227-
print(entry["id"], entry["request"]["url"])
227+
---
228+
229+
### Example: Parallel Processing with Custom Batch Size and Workers
230+
231+
```python
232+
from hario_core.transform import Pipeline, PipelineConfig, flatten
233+
234+
config = PipelineConfig(
235+
processing_strategy="process", # or "thread"
236+
batch_size=20, # process 20 entries per batch
237+
max_workers=6 # use 6 parallel workers
238+
)
239+
240+
pipeline = Pipeline([
241+
flatten(),
242+
], config=config)
243+
results = pipeline.process(entries)
228244
```
229245

246+
#### Available Processing Strategies
247+
- `sequential` (default): Process entries one by one in a single thread. Best for small datasets or debugging.
248+
- `thread`: Parallel processing using threads. Useful for I/O-bound tasks or when GIL is not a bottleneck.
249+
- `process`: Parallel processing using multiple processes. Recommended for CPU-bound tasks and large datasets.
250+
- `async`: Asynchronous processing (if your transformers support async). For advanced use cases with async I/O.
251+
230252
---
231253

232254
## Chrome DevTools Extension Example
@@ -235,8 +257,8 @@ You can use the Chrome DevTools HAR extension models to validate and work with H
235257

236258
**Example:**
237259
```python
238-
from hario_core.models.extensions.chrome_devtools import DevToolsEntry
239-
from hario_core.models.har_1_2 import HarLog
260+
from hario_core.models import DevToolsEntry
261+
from hario_core.models import HarLog
240262

241263
# Suppose har_json is a dict loaded from a Chrome DevTools HAR file
242264
har_log = HarLog.model_validate(har_json["log"])

docs/changelog.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,15 @@
11
# Changelog
22

3+
### v0.4.0
4+
- BREAKING: Pipeline now requires a list of transformers and a PipelineConfig instance (no more id_fn/id_field in constructor).
5+
- BREAKING: Pipeline.process now expects a list of dicts (e.g., from HarLog.model_dump()["entries"]).
6+
- New: PipelineConfig class for configuring batch size, processing strategy (sequential/thread/process/async), and max_workers.
7+
- New: Parallel and batch processing strategies for large HAR files (process, thread, async).
8+
- New: Benchmarks and benchmarking scripts for pipeline performance (see `benchmarks/`).
9+
- New: All transformers (`flatten`, `normalize_sizes`, `normalize_timings`, `set_id`) are now implemented as picklable callable classes, fully compatible with multiprocessing.
10+
- New: `set_id` transformer for assigning IDs to entries using any function (e.g., by_field, uuid).
11+
- Internal: Test suite and samples updated for new API and real-world HAR compatibility.
12+
313
### v0.3.1
414
- FIX real-world HAR compatibility: made nested fields like `postData.params` optional in models, so parsing DevTools and other real HAR files is more robust.
515
- All test samples are now based on real HAR data with valid `pages` and `pageref` links.

0 commit comments

Comments
 (0)