Can someone help me grasp pipelines? #6336

jerrac · 2025-06-17T21:48:34Z

jerrac
Jun 17, 2025

So, now that I can tell Vector to use a pipeline, I thought I'd do so. I've been working from the docs, but I'm obviously missing something.

My goal is to take postfix messages from the journald output and parse them into fields like the postfix queueid, to, from, etc.

I thought that the way to do that was a dispatcher pipeline.

dispatcher:
  field: SYSLOG_IDENTIFIER
  rules:
    - value: 'postfix/smtp'
      table_suffix: postfix
      pipeline: process_postfix
transform:
  - fields:
      - SYSLOG_IDENTIFIER
    type: string
    index: fulltext

A bunch of attempts later, and I currently have a manually created journald_postfix table that is empty, and in the journald table, the SYSLOG_IDENTIFIER=postfix/smtp rows are all completely empty except for the SYSLOG_IDENTIFIER and greptime_timestamp columns.

This is after thinking the _postfix table would be created automatically, creating the table anyway, and trying to manually tell the pipelines to include the all the non-"message" columns from the original journald data.

Unfortunately I'm not seeing anything in the greptime logs that give me any clues, and I haven't found anything in the docs either. I'm not even really sure what to search on at this point...

So here are some questions:

Is a transform actually required?

The dispatcher example in the docs does not have a transform, but the pipeline editor requires one. Which is correct?

How does dispatching work?

I thought my config above would look at the SYSLOG_IDENTIFIER column and if it had a value of 'postfix/smtp', it would pass the entire log entry on to the process_postfix pipeline. Then the process_postfix pipeline would split the message field out into the columns I want, and the data would be stored in a new table with the suffix from the dispatcher config.

I also thought that the new table would be created automatically based on the transform config in the process_postfix pipeline, plus the original journald data.

Does the dispatcher value field accept wildcards?

I originally wanted to capture all logs that come from a postfix SYSLOG_IDENTIFIER, since postfix has several.

Is there a "drop row" processor/transform?

There are some log entries that get repeated all the time. I'd like to detect and just drop them in my pipelines. Is that possible? I haven't found a decent way yet...

Is there an easy way to get example logs into the pipeline dashboard?

With as many columns as the journald table has, I'd love to be able to just click somewhere and get a copy of the row in the proper format for pipeline testing.

As it is, I have only been created test data out the message column since manually turning all 80+ columns into a json field is not exactly fun.

I did try exporting as csv, but the file had no header row.

How can I create a dispatcher based on multiple fields?

Is this just not possible? I'm not seeing any way the current syntax could support that.

Ok, that's plenty for one post. Any tips would be appreciated!

Answered by shuiyisong

Jun 17, 2025

Hi, thanks for reaching out!

To reproduce the case, can you upload the vector config file along with all the pipeline config files? You can pack them into a compressed file.
Also, are you using the v0.14 version or the main branch?
I suspect the outcome of the vector may not be a straightforward JSON of data, you can use the console sink of the vector to view the output data locally.

We use the sample data from here, it's working under the minimal config.

Is a transform actually required?
In v0.14, yes, transform is still required. In the main branch, we've added support for auto-transform, which allows the pipeline engine to infer the data type of the input data, much like the behaviour …

View full answer

shuiyisong · 2025-06-17T22:31:24Z

shuiyisong
Jun 17, 2025
Maintainer

Hi, thanks for reaching out!

To reproduce the case, can you upload the vector config file along with all the pipeline config files? You can pack them into a compressed file.
Also, are you using the v0.14 version or the main branch?
I suspect the outcome of the vector may not be a straightforward JSON of data, you can use the console sink of the vector to view the output data locally.

We use the sample data from here, it's working under the minimal config.

Is a transform actually required?
In v0.14, yes, transform is still required. In the main branch, we've added support for auto-transform, which allows the pipeline engine to infer the data type of the input data, much like the behaviour of using the greptime_identity pipeline.

How does dispatching work?
Yes, your understanding is correct. Though it happens before data is written to the database, so it's not a column but a field in the pipeline context.

Does the dispatcher value field accept wildcards?
No, the dispatcher doesn't support wildcards. We prefer to use processors first to get the data field. Is this critical for your use case?

Is there a "drop row" processor/transform?
Currently no. We're planning to add one. For now, the skip_error can be a workaround. The doc for the options has not yet been written :(

Is there an easy way to get example logs into the pipeline dashboard?
If the data is already in the GreptimeDB and you mean exporting the data using COPY TO, the command supports exporting data to a CSV file with headers. If the data sits elsewhere, you might have to write a simple script to do the conversion. Tips: It might be easier to ask an LLM chatbot to write the script.

How can I create a dispatcher based on multiple fields?
No, the dispatcher is designed to work on a simple value. Multiple fields can complicate the situation.

If you encounter any problem, please don't hesitate to ask!

7 replies

jerrac Jun 18, 2025
Author

I worked through my errors. I needed to set greptime_timestamp in both pipelines, and I needed to clean up some bad index types for certain fields. I did not have to manually create the postfix log table.

See the first part of my first reply for some follow up questions. The rest was troubleshooting that is finished. :)

shuiyisong Jun 18, 2025
Maintainer

Sorry for the delay. I was trying to reproduce the case on my local machine, saw your newest reply in the middle of the compile :P
Here's the answer to your questions.

I'm guessing I could do something with processors to get the effect I want?
Am I supposed to use processors to combine fields before hand or something?

Sadly, we don't have a processor to concatenate strings for now. You can, however, use the vrl processor in the main branch, but that would also be resource-consuming.
Can you give more context about your use case and your desired effect? From what I see, SYSLOG_IDENTIFIER is already a two-part string. If it's about the insertion process rather than the parsing process, it might be achieved using the flow engine.

Does the pipeline_version option need to be set in the vector config?

If pipeline_version is not set, it uses the latest one automatically. Only set it if you want to use a specific historical version.

The only idea I haven't tried again is creating the target table beforehand. Should I?

Normally (and in your use case), you really don't need to create the target table beforehand. There are some niche cases where you might need to fine-tune the table options, that's when you might consider creating the table manually.

Thanks for the patience! Feel free to ask any questions :)

jerrac Jun 18, 2025
Author

Thanks for trying to reproduce! I've been really appreciating the friendliness here.

Here is an example of needing to look at multiple fields to figure out what pipeline to use:

I have a lot of docker containers that output several log formats. Apache, Varnish, php errors, etc. To figure out which pipeline to use, I'd want to first check which set of outputs I should expect via the container labels. Then I would look at the stream output to figure if I need to deal with stdout, or stderr. Finally I'd have run a regex processor that creates a new field with a certain value for my log type, and I'd use that as well. Then, if the label is drupal-app, and the stream is stderr, and the extracted field is php, then I'd send it a pipeline that handles Drupal php errors. Or if the stream was stdout, and the extracted field was apache, I'd send it to an apache access log pipeline.

Right now my plan is to just have a pipeline per app container that covers all the types of logs I can get in one pipeline. But that will end up with a lot of repeated code. It'd be nicer to be able to pipe the same kind of data through the same pipeline no matter where it originates.

The other method might be to have a bunch of different dispatchers? That could get a bit convoluted...

So having a really robust way to switch between pipelines, and not run into issues, because I forgot to update pipeline A when I changed pipeline B's method of handling some log type, would be very helpful.

shuiyisong Jun 20, 2025
Maintainer

Hi,
Thanks for your patience. This appears to be a complex use case that requires further discussion with our team. I apologize for the delay in response.

We think the first approach is better, which is one pipeline per app container. It's clearer to view the whole pipeline logic corresponding to its input data format and predict the output. However, the repeated code problem does exist.

We're planning to develop a CLI tool to help manage the pipeline config file on the local disk. This tool should be able to include template manipulation functions, which allow using syntax like include in the local pipeline template file and generating/sending the final pipeline config to the database.

We'll try to get this tool soon, but there's no guarantee on when this will be delivered. I hope it's in the near future, and I've created an issue to track this #6372

Thank you again for your patience. It's been a pleasure speaking with you.

jerrac Jun 20, 2025
Author

Thanks for your patience. This appears to be a complex use case that requires further discussion with our team. I apologize for the delay in response.

Yep, it is complex. I have spent a lot of time over the years trying to get my logs into the format I need. And I bet Rust's typing and memory management would make it even more complex to implement things like wildcards or multiple field support. So I'd never expect it to be something to be fixed quickly, or even fixed at all since my use case might not be that common.

We're planning to develop a CLI tool to help manage the pipeline config file on the local disk. This tool should be able to include template manipulation functions, which allow using syntax like include in the local pipeline template file and generating/sending the final pipeline config to the database.

And that makes me wonder if I could just use Ansible to template my pipelines. I'm pretty sure I could get it to upload them as well.

In any case, if disk space usage keeps looking as good as it does so far, and I can learn how to optimize memory usage properly, I think my greptimedb/vector/telegraf/grafana setup will meet my needs. Lots of manual stuff to implement, but man, not having to deal with many many gb of disk usage per day will be nice. (I think I've had all my servers pushing data for around a week or two, and we're under 50GB of disk. My old system was hitting 10s of gb per day just in logs...)

Greptime

Can someone help me grasp pipelines? #6336

Uh oh!

jerrac Jun 17, 2025

Replies: 1 comment · 7 replies

Uh oh!

Uh oh!

shuiyisong Jun 17, 2025 Maintainer

Uh oh!

jerrac Jun 18, 2025 Author

Uh oh!

Uh oh!

shuiyisong Jun 18, 2025 Maintainer

Uh oh!

jerrac Jun 18, 2025 Author

Uh oh!

Uh oh!

shuiyisong Jun 20, 2025 Maintainer

Uh oh!

jerrac Jun 20, 2025 Author

And that makes me wonder if I could just use Ansible to template my pipelines. I'm pretty sure I could get it to upload them as well.

jerrac
Jun 17, 2025

Replies: 1 comment 7 replies

shuiyisong
Jun 17, 2025
Maintainer

jerrac Jun 18, 2025
Author

shuiyisong Jun 18, 2025
Maintainer

jerrac Jun 18, 2025
Author

shuiyisong Jun 20, 2025
Maintainer

jerrac Jun 20, 2025
Author