Skip to content

Conversation

@bingquanzhao
Copy link

@bingquanzhao bingquanzhao commented May 28, 2025

Summary

This PR introduces a new Apache Doris sink for Vector, enabling users to send log data directly to Apache Doris databases using the Stream Load API. The implementation includes:

  • Complete Doris sink implementation with Stream Load API integration
  • Comprehensive configuration options (endpoints, authentication, batching, custom headers)
  • Full documentation generation using CUE
  • Health check functionality with proper error handling
  • Support for Doris-specific Stream Load parameters via custom HTTP headers

Apache Doris is a modern MPP analytical database that provides sub-second query response times on large datasets, making it ideal for real-time data warehouses and log analysis scenarios.

Change Type

  • New feature
  • Bug fix
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

How did you test this PR?

Local Testing

  1. Unit Tests: All unit tests pass with cargo test
  2. Configuration Validation: Verified config parsing with vector validate
  3. Documentation Generation: Successfully generated docs with make generate-component-docs
  4. CUE Validation: All CUE files pass format and validation checks
  5. Changelog Validation: Changelog fragment passes validation with ./scripts/check_changelog_fragments.sh

Test Configuration Used

sources:
  demo:
    type: demo_logs
    format: json
    interval: 1

sinks:
  doris:
    type: doris
    inputs: ["demo"]
    
    # Target configuration
    endpoints: 
      - "http://doris-fe1:8030"
      - "http://doris-fe2:8030"
    database: "analytics_db"
    table: "user_events"
    
    # Authentication configuration
    auth:
      strategy: basic
      user: "admin"
      password: "admin123"
    
    # Batch configuration
    batch:
      max_events: 100000        # Maximum events per batch
      timeout_secs: 30          # Batch timeout in seconds
      max_bytes: 1073741824     # Maximum bytes per batch (1GB)
    
    # Custom HTTP headers for Doris Stream Load
    headers:
      format: "json"
      strip_outer_array: "false"
      read_json_by_line: "true"
    
    # Additional configuration
    label_prefix: "vector"
    log_request: true
    log_progress_interval: 10
    buffer_bound: 1

Environment Setup

  • Tested configuration validation against Vector's validation system
  • Verified health check functionality (attempts connection to configured endpoints)
  • All documentation generation and validation checks pass
  • CUE v0.7.0 used for documentation generation

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the "no-changelog" label to this PR.

Notes

Implementation Details

  • Stream Load API: Uses Doris's native Stream Load API for optimal performance and compatibility
  • Authentication: Supports basic authentication with username/password
  • Batching: Configurable batching with event count, byte size, and timeout limits
  • Custom Headers: Support for Doris-specific Stream Load parameters via HTTP headers including:
    • format: Data format specification (json, csv, etc.)
    • read_json_by_line: JSON line-by-line reading mode
    • strip_outer_array: Array handling configuration
    • columns: Column mapping specification
  • Error Handling: Comprehensive error handling with configurable retry logic
  • Health Checks: Validates connectivity and basic authentication
  • Rate Limiting: Built-in rate limiting and adaptive concurrency control

Documentation

  • Added complete CUE documentation for the sink configuration
  • Generated reference documentation automatically using Vector's documentation system
  • Updated service definitions and URL references
  • All documentation validation checks pass (CI=true make check-docs)

Dependencies

  • No new external dependencies added
  • Uses existing Vector HTTP client infrastructure
  • Leverages standard Vector authentication, batching, and request frameworks
  • Follows Vector's established patterns for sink implementation

Code Quality

  • All code formatted with cargo fmt
  • Follows Vector's coding standards and patterns
  • Proper error handling and logging throughout
  • Comprehensive configuration validation

Testing Strategy

  • Configuration validation ensures all options are properly parsed
  • Health check functionality verified through connection attempts
  • Documentation generation confirms all metadata is correctly defined
  • Follows Vector's established testing patterns for sinks

References

@bingquanzhao bingquanzhao requested review from a team as code owners May 28, 2025 16:18
@bits-bot
Copy link

bits-bot commented May 28, 2025

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added domain: sinks Anything related to the Vector's sinks domain: ci Anything related to Vector's CI environment domain: external docs Anything related to Vector's external, public documentation labels May 28, 2025
@drichards-87
Copy link
Contributor

Created Jira card for Docs Team review.

Copy link
Contributor

@maycmlee maycmlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small suggestions

Copy link
Contributor

@maycmlee maycmlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for docs

@pront
Copy link
Member

pront commented Jun 25, 2025

Hi @bingquanzhao, thank you for this PR. Please rebase on master and fix merge conflicts. There are 12k affected lines right now.

@pront pront added the meta: awaiting author Pull requests that are awaiting their author. label Jun 25, 2025
@github-actions github-actions bot removed the meta: awaiting author Pull requests that are awaiting their author. label Jul 9, 2025
@mozhu1024
Copy link

Is this PR process still ongoing?

@pront pront mentioned this pull request Dec 3, 2025
@freejool
Copy link

freejool commented Dec 5, 2025

@bingquanzhao Hi, Thanks for your work!

However, I got some warnings when consuming messages from kafka and loading into doris with your version (bingquanzhao@be278bb). Although messages are successfully written(unknown integrity).

2025-12-05T05:52:57.535661Z WARN sink{component_kind="sink" component_id=doris_sink component_type=doris}: vector_buffers::buffer_usage_data: Buffer counter underflowed. Clamping value to 0. current=2467 delta=2476

And here is my config:

data_dir: /tmp/vector/data
sources:
  kafka_source:
    type: kafka
    bootstrap_servers: 100.88.1.4:9092
    group_id: vector_consumer111111111
    topics:
      - test_topic_1
    auto_offset_reset: earliest
    # 解析JSON消息
    decoding:
      codec: bytes
sinks:
  doris_sink:
    type: doris
    inputs:
      - kafka_source
    endpoints:
      - http://100.88.1.4:8030
    database: test
    table: "{{ message_key }}"
    auth:
      strategy: basic
      user: root
      password: "root@1234"
    encoding:
      codec: "text"
      only_fields: ["message"]
    # 启用批处理以提高性能
    batch:
      max_events: 1000
      timeout_secs: 1
    headers:
      format: "json"
      strip_outer_array: "false"
      read_json_by_line: "true"
    # 配置请求
    request:
      concurrency: 1
      rate_limit_duration_secs: 1
      rate_limit_num: 100
    # 配置重试
    acknowledgements:
      enabled: false
    log_request: true

Please ask me for any other information you need!

@bingquanzhao
Copy link
Author

@bingquanzhao Hi, Thanks for your work!

However, I got some warnings when consuming messages from kafka and loading into doris with your version (bingquanzhao@be278bb). Although messages are successfully written(unknown integrity).

2025-12-05T05:52:57.535661Z WARN sink{component_kind="sink" component_id=doris_sink component_type=doris}: vector_buffers::buffer_usage_data: Buffer counter underflowed. Clamping value to 0. current=2467 delta=2476

And here is my config:

data_dir: /tmp/vector/data
sources:
  kafka_source:
    type: kafka
    bootstrap_servers: 100.88.1.4:9092
    group_id: vector_consumer111111111
    topics:
      - test_topic_1
    auto_offset_reset: earliest
    # 解析JSON消息
    decoding:
      codec: bytes
sinks:
  doris_sink:
    type: doris
    inputs:
      - kafka_source
    endpoints:
      - http://100.88.1.4:8030
    database: test
    table: "{{ message_key }}"
    auth:
      strategy: basic
      user: root
      password: "root@1234"
    encoding:
      codec: "text"
      only_fields: ["message"]
    # 启用批处理以提高性能
    batch:
      max_events: 1000
      timeout_secs: 1
    headers:
      format: "json"
      strip_outer_array: "false"
      read_json_by_line: "true"
    # 配置请求
    request:
      concurrency: 1
      rate_limit_duration_secs: 1
      rate_limit_num: 100
    # 配置重试
    acknowledgements:
      enabled: false
    log_request: true

Please ask me for any other information you need!

This warning is benign and does not affect data integrity.
It is caused by interior mutability in Vector's LogEvent structure, where the internal size cache updates during processing. This results in a slight mismatch between the event size recorded when entering versus leaving the buffer, triggering the underflow.
This is a generic Vector
accounting artifact and is unrelated to the Doris sink. It is safe to ignore.

Copy link
Contributor

@thomasqueirozb thomasqueirozb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay! It was a lot of code 😁

@thomasqueirozb thomasqueirozb added the meta: awaiting author Pull requests that are awaiting their author. label Dec 10, 2025
@bingquanzhao
Copy link
Author

Sorry for the delay! It was a lot of code 😁

Thank you so much for your review. I will adjust the code according to your review suggestions as quickly as possible.

@github-actions github-actions bot removed the meta: awaiting author Pull requests that are awaiting their author. label Dec 11, 2025
@bingquanzhao
Copy link
Author

bingquanzhao commented Dec 22, 2025

Hi @thomasqueirozb ,thanks for the review! I've addressed your comments. Let me know if there's anything else I should update.

@thomasqueirozb thomasqueirozb added the sink: new Request or implementation of a new sink label Dec 22, 2025
@github-actions github-actions bot added the domain: ci Anything related to Vector's CI environment label Dec 22, 2025
Copy link
Contributor

@thomasqueirozb thomasqueirozb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Should be good to go realistically. Would only like to talk about the string uri and I'll commit the other required changes myself

@thomasqueirozb
Copy link
Contributor

It actually looks like integration tests are failing when I run ./scripts/run-integration-tests int doris due to a DB auth issue. I partially fixed the integration tests but didn't debug much further after I hit this issue

@thomasqueirozb thomasqueirozb added the meta: awaiting author Pull requests that are awaiting their author. label Dec 22, 2025
@github-actions github-actions bot removed the meta: awaiting author Pull requests that are awaiting their author. label Dec 25, 2025
@bingquanzhao
Copy link
Author

It actually looks like integration tests are failing when I run ./scripts/run-integration-tests int doris due to a DB auth issue. I partially fixed the integration tests but didn't debug much further after I hit this issue

Hi @thomasqueirozb ,I’ve addressed the type issue with base_url and fixed the integration test issues.

@freejool
Copy link

@bingquanzhao Hi, I added header group_commit: async_mode to conf, and got an error from doris "label and group_commit can't be set at the same time". Can you provide a switch to disable label generation? (more efficiency and less integrity)

@bingquanzhao
Copy link
Author

@bingquanzhao Hi, I added header group_commit: async_mode to conf, and got an error from doris "label and group_commit can't be set at the same time". Can you provide a switch to disable label generation? (more efficiency and less integrity)

I will add a check. When group_commit is set, do not set the label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: ci Anything related to Vector's CI environment domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks editorial review sink: new Request or implementation of a new sink

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants