Skip to content

Commit cb8700c

Browse files
committed
Add new proposal
1 parent fc678c8 commit cb8700c

File tree

1 file changed

+109
-0
lines changed

1 file changed

+109
-0
lines changed
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# Proposal: Reliable Loki pipelines
2+
3+
* Author: Karl Persson (@kalleep)
4+
* Last updated: 2025-11-26
5+
* Discussion link: TODO
6+
7+
## Abstract
8+
9+
Alloy's Loki pipelines currently use channels, which limits throughput due to head-of-line blocking and can causes silent log drops during config reloads or shutdowns.
10+
11+
This proposal introduces a function-based pipeline using a `Consumer` interface, replacing the channel-based design. Source components will call `Consume()` directly on downstream components, enabling parallel processing and returning errors that sources can use for retry logic or proper HTTP error responses.
12+
13+
## Problem
14+
15+
Loki pipelines in Alloy are built using (unbuffered) channels, a design inherited from promtail.
16+
17+
This comes with two big limitations:
18+
1. Throughput of each component is limited due to head-of-line blocking, where pushing to the next channel may not be possible in the presence of a slow component. An example of this is usage of [secretfilter](https://github.com/grafana/alloy/issues/3694).
19+
2. Because there is no way to signal back to the source, components can silently drop logs during config reload or shutdown and there is no to detect that.
20+
21+
Consider the following simplified config:
22+
```
23+
loki.source.file "logs" {
24+
targets = targets
25+
forward_to = [loki.process.logs.receiver]
26+
}
27+
28+
loki.process "logs" {
29+
forward_to = [loki.write.loki.receiver]
30+
}
31+
32+
loki.write "loki" {}
33+
```
34+
35+
`loki.source.file` will tail all files from targets and compete to send on the channel exposed by `loki.process`. Only one entry will be processed by each stage configured in `loki.process`. If a reload happens or if alloy is shutting down logs could be silently dropped.
36+
37+
## Proposal 0: Do nothing
38+
This architecture works in most cases, it will be hard to use slow components such as `secretfilter` because a lot of the time it's too slow.
39+
It's also hard to use Alloy as a gateway for loki pipelines with e.g. `loki.source.api` due to the limitations listed above.
40+
41+
## Proposal 1: Chain function calls
42+
43+
Loki pipelines are the only one using channels for passing data between components. Prometheus, Pyroscope and otelcol are all using this pattern where each component just calls functions on the next.
44+
45+
They all have slightly different interfaces but basically work the same. Each component exports its own interface like Appender for Prometheus or Consumer for Otel.
46+
47+
We could adopt the same pattern for loki pipelines as well with the following interface:
48+
49+
```go
50+
type Consumer interface {
51+
Consume(ctx context.Context, entries []Entry) error
52+
}
53+
```
54+
55+
Adopting this pattern for loki pipelines would change it from a channel based pipeline to a function based pipeline. This would give us two things:
56+
1. Increased throughput because several sources such as many files or http requests can now call the next component in the pipeline at the same time.
57+
2. A way to return signals back to the source so we can handle things like giving a proper error response or determine if the position file should be updated.
58+
59+
Solving the issues listed above.
60+
61+
A batch of entries should be considered successfully consumed when they are queued up for sending. We could try to extend this to when it was successfully sent over the wire, but that could be considered an improvement at a later stage.
62+
63+
### Handling fan-out failures
64+
65+
Because a pipeline can "fan-out" to multiple paths, it can also partially succeed. We need to determine how to handle this.
66+
67+
Two options to handle this:
68+
* Always retry if one or more failed - This could lead to duplicated logs but is easy and safe to implement. This is also how otelcol works.
69+
* When using `loki.source.api`, we would return a 5xx error so the caller can retry.
70+
* When using `loki.source.file`, we would retry the same batch again.
71+
* Configuration option `min_success` - Only retry if we don't succeed on at least the configured number of destinations.
72+
73+
### Affected components
74+
75+
The following components need to be updated with this new interface and we need to make sure they are concurrency safe:
76+
77+
**Source components** (need to call `Consume()` and handle errors):
78+
- `loki.source.file`
79+
- `loki.source.api`
80+
- `loki.source.kafka`
81+
- `loki.source.journal`
82+
- `loki.source.docker`
83+
- `loki.source.kubernetes`
84+
- `loki.source.kubernetes_events`
85+
- `loki.source.podlogs`
86+
- `loki.source.syslog`
87+
- `loki.source.gelf`
88+
- `loki.source.cloudflare`
89+
- `loki.source.gcplog`
90+
- `loki.source.heroku`
91+
- `loki.source.azure_event_hubs`
92+
- `loki.source.aws_firehose`
93+
- `loki.source.windowsevent`
94+
95+
**Processing components** (need to implement `Consumer` and forward to next):
96+
- `loki.process`
97+
- `loki.relabel`
98+
- `loki.secretfilter`
99+
- `loki.enrich`
100+
101+
**Sink components** (need to implement `Consumer`):
102+
- `loki.write`
103+
- `loki.echo`
104+
105+
Pros:
106+
* Increase throughput of log pipelines.
107+
* A way to signal back to the source
108+
Cons:
109+
* We need to rewrite all loki components with this new pattern and make them safe to call in parallel.

0 commit comments

Comments
 (0)