Skip to content

Jvm/Cassandra metrics stopped flowing after some time #861

@junhuangli

Description

@junhuangli

Description

  • Everything is working fine at the beginning, but after some time(10 hours to 10 days depends on the “collection_interval”) the metrics from Jvm/Cassandra(for example “cassandra.client.request.range_slice.latency.99p”) stops, but all the otel internal metrics continue working(for example “otelcol_process_uptime”)
  • Running the second collector manually while is first one is in the “error” state works(In other words, we can see Jvm/Cassandra metrics from the second collector but not from the first one even if they are running in the same docker container)

Steps to reproduce
Deploy and then wait

Expectation
Jvm/Cassandra metrics continue flowing

What applicable config did you use?

---
receivers:
  jmx:
    jar_path: "/refinery/opentelemetry-jmx-metrics.jar"
    endpoint: localhost:7199
    target_system: cassandra,jvm
    collection_interval: 3s
    log_level: debug

  prometheus/internal:
    config:
      scrape_configs:
        - job_name: 'refinery-internal-metrics'
          scrape_interval: 10s
          static_configs:
            - targets: [ 'localhost:8888' ]
          metric_relabel_configs:
            - source_labels: [ __name__ ]
              regex: '.*grpc_io.*'
              action: drop

exporters:
  myexporter:
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
    myexporter:
      host: "myexporter.net"
      port: "9443"
      enable_mtls: true
      root_path: /etc/identity
      repo_dir_path: /etc/identity/client
      service_name: client
      gzip: true
 
processors:
  netmetadata:
    metrics:
      scopes: 
        service: refinery_tested
        subservice: "cassandra"
      
      tags:
        version: "1"
        k8s_pod_name: "test-cass-alrt-eap-c02-0"
        k8s_namespace: "dva-system"
        k8s_cluster: "collection-monitoring"
        device: "ip-10-11-11-11.us-west-2.compute.internal"
        substrate: "aws"
        account: "00000"
        region: "unknown"
        zone: "us-west-2b"
        falcon_instance: "dev1-uswest2"
        functional_domain: "monitoring"
        functional_domain_instance: "monitoring"
        environment: "dev1"
        environment_type: "dev"
        cell: "c02"
        service_name: "test-cass-alrt-eap"
        service_group: "test-shared"
        service_instance: "test-cass-alrt-eap-c02"
  memory_limiter/with-settings:
    check_interval: 1s
    limit_mib: 2000
    spike_limit_mib: 400
    limit_percentage: 0
    spike_limit_percentage: 0

  batch:
    timeout: 5s
    send_batch_size: 8192
    send_batch_max_size: 0

service:
  extensions: []
  telemetry:
    logs:
      development: false
      level: debug
    metrics:
      level: detailed
      address: localhost:8888
  pipelines:
    metrics:
      receivers: ["jmx"]
      processors: [memory_limiter/with-settings, batch, netmetadata]
      exporters: [myexporter]
    metrics/internal:
      receivers: ["prometheus/internal"]
      processors: [memory_limiter/with-settings, batch, netmetadata]
      exporters: [myexporter]

Relevant Environment Information
NAME="CentOS Linux" VERSION="7 (Core)" ID="centos"

Additional context

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions