`http_server` source to ingest clickstream events – reasonable use case? #20584

mksm · 2024-05-30T22:42:45Z

mksm
May 30, 2024

Hello all,

I see that moving log data around is the primary use case for vector, but I've been wondering if it could be used to ingest clickstream events into Kafka and then from Kafka into ClickHouse. The pipeline would roughly be:

HTTP clients --- (http_server source > kafka sink) ---> Kafka --- (kafka source > clickhouse sink) ---> Clickhouse

The http_source endpoint wouldn't be publicly exposed directly. For a production setup we'll have a load balancer in front of the vector processes and some sort of WAF in front of everything for security.

We're trying to avoid building and maintaining our own HTTP->Kafka solution, so that's why we're interested in using vector. Main questions I have:

Has anyone experimented or deployed something like this with vector and could share their experience?
Is this a "bad" use of the http_server source? I can't seem to find a reason why it would be.

Furthermore, I ran some quick stress tests with hey using the following settings:

# 100K requests, 250 connections, HTTP2 enabled, posting a single log entry as JSON.
$ hey -n 100000 -c 250 -h2 -m POST -D fake_log.txt -T application/json http://127.0.0.1:3010/

# vector.yaml -- Vector 0.38.0 running as a Docker container
sources:
  http:
    type: http_server
    address: 0.0.0.0:3010
    encoding: json
    headers:
      - User-Agent
    path_key: vector_http_path

sinks:
  out:
    inputs:
      - "http"
    type: "console"
    encoding:
      codec: json

Results:

Response time histogram:
  0.000 [1]	|
  0.011 [11650]	|■■■■■■
  0.022 [80229]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.033 [6896]	|■■■
  0.044 [681]	|
  0.055 [132]	|
  0.066 [70]	|
  0.077 [28]	|
  0.088 [83]	|
  0.099 [61]	|
  0.110 [129]	|

Performance was great but a few hundred requests returned non-200 with the error below. If I increase concurrency, I get more errors. Could it be hitting an open connection limit? I couldn't find a way to inspect or tweak this in the source settings.

  [1]	Post "http://127.0.0.1:3010/": read tcp 127.0.0.1:61797->127.0.0.1:3010: read: connection reset by peer`.

edit 1: I did test using only the blackhole sink and got the same results.
edit 2: ran hey from within a container in the same docker network and didn't see any issues with 500 connections. Probably something funky from running all those connections from host -> container.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`http_server` source to ingest clickstream events – reasonable use case? #20584

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

http_server source to ingest clickstream events – reasonable use case? #20584

Uh oh!

Uh oh!

mksm May 30, 2024

Replies: 0 comments

`http_server` source to ingest clickstream events – reasonable use case? #20584

mksm
May 30, 2024