Skip to content

Commit a7322f4

Browse files
committed
📝 Update documentation
1 parent 79240fc commit a7322f4

File tree

6 files changed

+208
-44
lines changed

6 files changed

+208
-44
lines changed

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ clean:
3535

3636
documentation:
3737
mkdocs build
38-
aws --endpoint-url https://s3.investigativedata.org s3 sync ./site s3://openaleph.org/docs/lib/openaleph-search
38+
aws --profile nbg1 --endpoint-url https://s3.investigativedata.org s3 sync ./site s3://openaleph.org/docs/lib/openaleph-search
3939

4040
elastic-build:
4141
docker build -t ghcr.io/openaleph/elasticsearch:$(ELASTIC_TAG) .

docs/aggregations.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,36 @@ Aggregation type for special fields.
9797
--args "facet=properties.entity&facet_type:properties.entity=entity"
9898
```
9999

100+
### `metric:TYPE`
101+
102+
Numeric metric aggregation. `TYPE` is one of `sum`, `avg`, `min`, `max`. Value is a FtM property name (the `numeric.` ES field prefix is resolved internally).
103+
104+
```bash
105+
# Single metric
106+
--args "metric:sum=amount"
107+
108+
# Multiple metrics on same or different fields
109+
--args "metric:sum=amount&metric:avg=amount&metric:min=registrationArea"
110+
```
111+
112+
Response key format: `{field}.{type}` (e.g. `amount.sum`)
113+
114+
### Metric aggregations
115+
116+
Compute numeric metrics (sum, average, min, max) on numeric fields. Uses the `metric:` prefix with FtM property names — the `numeric.` ES field prefix is resolved internally.
117+
118+
```bash
119+
# Sum of payment amounts
120+
openaleph-search search query-string "*" \
121+
--args "filter:schema=Payment&metric:sum=amount"
122+
123+
# Multiple metrics
124+
openaleph-search search query-string "*" \
125+
--args "filter:schema=Payment&metric:sum=amount&metric:avg=amount&metric:min=amount&metric:max=amount"
126+
```
127+
128+
Supported types: `sum`, `avg`, `min`, `max`
129+
100130
## Response format
101131

102132
Aggregations appear in the `aggregations` section:
@@ -132,6 +162,12 @@ Aggregations appear in the `aggregations` section:
132162
"bg_count": 100
133163
}
134164
]
165+
},
166+
"amount.sum": {
167+
"value": 125000.0
168+
},
169+
"amount.avg": {
170+
"value": 2500.0
135171
}
136172
}
137173
}
@@ -261,6 +297,13 @@ openaleph-search search query-string "company" \
261297
--args "facet=schema&facet=created_at&facet_interval:created_at=year&facet_size:schema=50"
262298
```
263299

300+
### Payment totals
301+
302+
```bash
303+
openaleph-search search query-string "*" \
304+
--args "filter:schema=Payment&filter:beneficiary=entity-id&metric:sum=amount&metric:avg=amount"
305+
```
306+
264307
## Error handling
265308

266309
### Invalid fields

docs/highlighting.md

Lines changed: 24 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -50,30 +50,26 @@ Reduce this value for better performance on large documents:
5050

5151
The system uses different Elasticsearch highlighters optimized for each field type:
5252

53-
### Fast Vector Highlighter (FVH)
54-
55-
Used for: `content` field (full-text of source documents)
56-
57-
Best for long text with accurate phrase highlighting. Requires term vectors to be stored in the index.
53+
### Unified Highlighter
5854

59-
Configuration (via environment):
55+
Used for: `content` (document text), `text` (secondary text), `translation` (translated text), `name` (entity names)
6056

61-
- `OPENALEPH_SEARCH_HIGHLIGHTER_FVH_ENABLED=true` (default)
62-
- Requires `OPENALEPH_SEARCH_CONTENT_TERM_VECTORS=true` (default)
57+
The default highlighter for all fields. Balanced performance with good support for mixed content.
6358

64-
If fast vector highlighting is disabled, the unified highlighter is used for the `content` field.
59+
### Fast Vector Highlighter (FVH)
6560

66-
### Unified Highlighter
61+
Optionally used for the `content` field. Provides more accurate phrase highlighting (wraps entire phrases in a single `<em>` tag) but requires term vectors. Disabled by default because it is incompatible with `copy_to` fields excluded from `_source` — for entities where multiple properties copy into `content` (e.g. HyperText with both `bodyHtml` and `indexText`), FVH causes term vector offset mismatches that drop hits from results.
6762

68-
Used for: `name` field
63+
Configuration (via environment):
6964

70-
Balanced performance for entity names and titles.
65+
- `OPENALEPH_SEARCH_HIGHLIGHTER_FVH_ENABLED=false` (default)
66+
- Requires `OPENALEPH_SEARCH_CONTENT_TERM_VECTORS=true` (default) when enabled
7167

7268
### Plain Highlighter
7369

74-
Used for: `names` (keywords), `text`, and other fields
70+
Used for: `names` (keywords)
7571

76-
Fast highlighting for simple matches.
72+
Fast highlighting for simple keyword matches.
7773

7874
## Configuration
7975

@@ -84,10 +80,10 @@ Control highlighting behavior via environment variables:
8480
Use Fast Vector Highlighter for content field.
8581

8682
```bash
87-
export OPENALEPH_SEARCH_HIGHLIGHTER_FVH_ENABLED=true
83+
export OPENALEPH_SEARCH_HIGHLIGHTER_FVH_ENABLED=false
8884
```
8985

90-
When false, uses Unified Highlighter instead.
86+
Default: `false`. When false, uses Unified Highlighter instead. See [Highlighter types](#highlighter-types) for trade-offs.
9187

9288
### `highlighter_fragment_size`
9389

@@ -146,10 +142,10 @@ Default: `300`
146142
Maximum characters to analyze.
147143

148144
```bash
149-
export OPENALEPH_SEARCH_HIGHLIGHTER_MAX_ANALYZED_OFFSET=999999
145+
export OPENALEPH_SEARCH_HIGHLIGHTER_MAX_ANALYZED_OFFSET=100000
150146
```
151147

152-
Default: `999999`
148+
Default: `100000`
153149

154150
## Response format
155151

@@ -183,10 +179,15 @@ Matched terms are wrapped in `<em>` tags.
183179

184180
Multiple fields are highlighted automatically:
185181

186-
- `content` - Main document text
187-
- `name` - Entity names
188-
- `names` - Name keywords
189-
- `text` - Secondary text content
182+
- `content` - Main document text (primary highlight field for entities)
183+
- `names` - Normalized name keywords
184+
- `text` - Secondary text content (catch-all `copy_to` target)
185+
- `translation` - Translated text content
186+
187+
The `text` and `translation` highlight fields can be disabled via settings:
188+
189+
- `OPENALEPH_SEARCH_HIGHLIGHTER_TEXT_FIELD=false`
190+
- `OPENALEPH_SEARCH_HIGHLIGHTER_TRANSLATION_FIELD=false`
190191

191192
## Examples
192193

@@ -280,4 +281,4 @@ Check that:
280281
- Reduce `highlight_count`
281282
- Lower `max_highlight_analyzed_offset`
282283
- Decrease `phrase_limit`
283-
- (Other than that the name suggests, the FVH seems to be _slower_ than the unified highlighter): Consider disabling FVH: `OPENALEPH_SEARCH_HIGHLIGHTER_FVH_ENABLED=false`
284+
- The unified highlighter (default) generally performs well; FVH is not recommended due to compatibility issues with `copy_to` fields

docs/more_like_this.md

Lines changed: 28 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ Other entity types (Person, Company, etc.) are excluded.
3535
Minimum document frequency corpus-wide.
3636

3737
- Type: `int`
38-
- Default: `0`
38+
- Default: `1`
3939

4040
```bash
4141
--args "mlt_min_doc_freq=2"
@@ -57,7 +57,7 @@ Minimum term frequency within source document.
5757
Maximum terms to use in query.
5858

5959
- Type: `int`
60-
- Default: `25`
60+
- Default: `200`
6161

6262
```bash
6363
--args "mlt_max_query_terms=50"
@@ -74,6 +74,28 @@ Percentage of query terms that must match.
7474
--args "mlt_minimum_should_match=25%"
7575
```
7676

77+
### `mlt_min_word_length`
78+
79+
Minimum word length for query terms.
80+
81+
- Type: `int`
82+
- Default: `5`
83+
84+
```bash
85+
--args "mlt_min_word_length=3"
86+
```
87+
88+
### `mlt_max_doc_freq`
89+
90+
Maximum document frequency for query terms. Terms appearing in more documents than this are ignored.
91+
92+
- Type: `int`
93+
- Default: `500`
94+
95+
```bash
96+
--args "mlt_max_doc_freq=1000"
97+
```
98+
7799
## Parameter effects
78100

79101
### `min_term_freq`
@@ -112,8 +134,10 @@ Affects query comprehensiveness:
112134
"fields": ["content", "text", "name"],
113135
"like": [{"_id": "doc-123"}],
114136
"min_term_freq": 1,
115-
"max_query_terms": 25,
116-
"min_doc_freq": 0,
137+
"max_query_terms": 200,
138+
"min_doc_freq": 1,
139+
"min_word_length": 5,
140+
"max_doc_freq": 500,
117141
"minimum_should_match": "10%"
118142
}
119143
}

docs/reference/parser.md

Lines changed: 41 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -205,6 +205,28 @@ Specify facet aggregation type.
205205
--args "facet=properties.entity&facet_type:properties.entity=entity"
206206
```
207207

208+
## Metric aggregations
209+
210+
Compute numeric metrics on numeric fields. [Read more](../aggregations.md#metric-aggregations)
211+
212+
### `metric:TYPE`
213+
214+
Format: `metric:TYPE=FIELD`
215+
216+
Types: `sum`, `avg`, `min`, `max`
217+
218+
Field is a FtM property name (the `numeric.` ES field prefix is resolved internally).
219+
220+
```bash
221+
# Sum of amounts
222+
--args "metric:sum=amount"
223+
224+
# Multiple metrics
225+
--args "metric:sum=amount&metric:avg=amount&metric:min=registrationArea"
226+
```
227+
228+
Response keys follow `{field}.{type}` pattern (e.g. `amount.sum`).
229+
208230
## Significant terms
209231

210232
Find unusual or interesting terms in search results. [Read more](../significant_terms.md)
@@ -304,31 +326,45 @@ Parameters for similarity search. [Read more](../more_like_this.md)
304326
Minimum document frequency for query terms.
305327

306328
- Type: `int`
307-
- Default: `5`
329+
- Default: `1`
308330

309331
### `mlt_min_term_freq`
310332

311333
Minimum term frequency within document.
312334

313335
- Type: `int`
314-
- Default: `5`
336+
- Default: `1`
315337

316338
### `mlt_max_query_terms`
317339

318340
Maximum number of query terms to use.
319341

320342
- Type: `int`
321-
- Default: `50`
343+
- Default: `200`
322344

323345
### `mlt_minimum_should_match`
324346

325347
Percentage of terms that must match.
326348

327349
- Type: `str`
328-
- Default: `60%`
350+
- Default: `10%`
351+
352+
### `mlt_min_word_length`
353+
354+
Minimum word length for query terms.
355+
356+
- Type: `int`
357+
- Default: `5`
358+
359+
### `mlt_max_doc_freq`
360+
361+
Maximum document frequency for query terms.
362+
363+
- Type: `int`
364+
- Default: `500`
329365

330366
```bash
331-
--args "mlt_min_doc_freq=3&mlt_max_query_terms=100&mlt_minimum_should_match=70%"
367+
--args "mlt_min_doc_freq=3&mlt_max_query_terms=100&mlt_minimum_should_match=25%"
332368
```
333369

334370
## Performance

0 commit comments

Comments
 (0)