Skip to content

Commit b0d5e37

Browse files
authored
[DOCS] Adds further details and an example to how transform checkpointing works (#71615) (#71817)
1 parent f627628 commit b0d5e37

File tree

1 file changed

+35
-11
lines changed

1 file changed

+35
-11
lines changed

docs/reference/transform/checkpoints.asciidoc

Lines changed: 35 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,8 @@ destination index, it generates a _checkpoint_.
1010

1111
If your {transform} runs only once, there is logically only one checkpoint. If
1212
your {transform} runs continuously, however, it creates checkpoints as it
13-
ingests and transforms new source data.
13+
ingests and transforms new source data. The `sync` property of the {transform}
14+
configures checkpointing by specifying a time field.
1415

1516
To create a checkpoint, the {ctransform}:
1617

@@ -22,21 +23,25 @@ indices. This check is done based on the interval defined in the transform's
2223
+
2324
If the source indices remain unchanged or if a checkpoint is already in progress
2425
then it waits for the next timer.
26+
+
27+
If changes are found a checkpoint is created.
2528

26-
. Identifies which entities have changed.
29+
. Identifies which entities and/or time buckets have changed.
2730
+
28-
The {transform} searches to see which entities have changed since the last time
29-
it checked. The `sync` configuration object in the {transform} identifies a time
30-
field in the source indices. The {transform} uses the values in that field to
31-
synchronize the source and destination indices.
31+
The {transform} searches to see which entities or time buckets have changed
32+
between the last and the new checkpoint. The {transform} uses the values to
33+
synchronize the source and destination indices with fewer operations than a
34+
full re-run.
3235

33-
. Updates the destination index (the {dataframe}) with the changed entities.
36+
. Updates the destination index (the {dataframe}) with the changes.
3437
+
3538
--
36-
The {transform} applies changes related to either new or changed entities to the
37-
destination index. The set of changed entities is paginated. For each page, the
38-
{transform} performs a composite aggregation using a `terms` query. After all
39-
the pages of changes have been applied, the checkpoint is complete.
39+
The {transform} applies changes related to either new or changed entities or
40+
time buckets to the destination index. The set of changes can be paginated. The
41+
{transform} performs a composite aggregation similarly to the batch {transform}
42+
operation, however it also injects query filters based on the previous step to
43+
reduce the amount work. After all changes have been applied, the checkpoint is
44+
complete.
4045
--
4146

4247
This checkpoint process involves both search and indexing activity on the
@@ -49,6 +54,25 @@ support both the composite aggregation search and the indexing of its results.
4954
TIP: If the cluster experiences unsuitable performance degradation due to the
5055
{transform}, stop the {transform} and refer to <<transform-performance>>.
5156

57+
58+
[discrete]
59+
[[ml-transform-checkpoint-heuristics]]
60+
== Change detection heuristics
61+
62+
When the {transform} runs in continuous mode, it updates the documents in the
63+
destination index as new data comes in. The {transform} uses a set of heuristics
64+
called change detection to update the destination index with fewer operations.
65+
66+
In this example, the data is grouped by host names. Change detection detects
67+
which host names have changed, for example, host `A`, `C` and `G` and only
68+
updates documents with those hosts but does not update documents that store
69+
information about host `B`, `D`, or any other host that are not changed.
70+
71+
Another heuristic can be applied for time buckets when a `date_histogram` is
72+
used to group by time buckets. Change detection detects which time buckets have
73+
changed and only update those.
74+
75+
5276
[discrete]
5377
[[ml-transform-checkpoint-errors]]
5478
== Error handling

0 commit comments

Comments
 (0)