Skip to content

Commit 255f232

Browse files
docs(log-ingestor): Add user docs about fault tolerance guarantees. (y-scope#2057)
Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
1 parent bde6bb3 commit 255f232

File tree

1 file changed

+19
-5
lines changed

1 file changed

+19
-5
lines changed

docs/src/user-docs/guides-using-log-ingestor.md

Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -51,12 +51,26 @@ will be routed through CLP's [API server](./guides-using-the-api-server.md) in a
5151

5252
### Fault tolerance
5353

54-
:::{warning}
55-
**The current version of `log-ingestor` does not provide fault tolerance.**
54+
`log-ingestor` is designed to tolerate unexpected crashes or restarts without losing information
55+
about ingestion jobs or the files that have been submitted for compression. Note that this does not
56+
include fault tolerance of the components external to `log-ingestor`. Specifically, `log-ingestor`
57+
guarantees the following, even in the presence of crashes or restarts of `log-ingestor`:
5658

57-
If `log-ingestor` crashes or is restarted, all in-progress ingestion jobs and their associated state
58-
will be lost, and must be restored manually. Robust fault tolerance for the ingestion pipeline is
59-
planned for a future release.
59+
* Any ingestion job successfully submitted to `log-ingestor` will run continuously.
60+
* Within an ingestion job, any files that have been found on S3 or received as messages from SQS
61+
queue will eventually be submitted for compression.
62+
63+
:::{note}
64+
`log-ingestor` **DOES NOT** guarantee the following after a crash or restart:
65+
66+
* Any file submitted for compression (that can be compressed successfully) will eventually be
67+
compressed successfully.
68+
* This is because failures of the compression cluster are external to `log-ingestor`. Future
69+
versions of CLP will address this limitation.
70+
* Any file submitted for compression will *only* be compressed once.
71+
* This is because for [SQS listener](#sqs-listener) ingestion jobs, the processes for deleting
72+
messages from the SQS queue and recording the files for ingestion are not synchronized. As a
73+
result, a failure during this process may cause the same file to be ingested multiple times.
6074
:::
6175

6276
---

0 commit comments

Comments
 (0)