Skip to content

[RMP] Improve Systems observability w/ Open Telemetry integration (logging, metrics?, traces?) #642

@EvenOldridge

Description

@EvenOldridge

Problem:

Production recommender systems require logging. Without logging metrics, etc it's hard to know what your recommender is doing and troubleshoot.

Goal:

Answer questions like:

  • Why did my recommender take x ms to provide recs. (Timing / SLA)

  • Quality of recs is lower, why? (Metric logging)

  • Why recs of a specific user are 'weird'

  • Be able to provide probabilities of prediction for bandits.

  • Aggregate logged values

  • Slicing on aggregates

New Functionality

  • Log a structured dictionary/json blob per request

  • Systems

Constraints:

NV standard is to use OpenTelemetry which provides most of the infrastructure.

Ability to record:

  • exposure logging
  • feature values used in prediction
  • model metrics
  • propensities for batch learning and bandit feedback
  • model versioning
  • dataset distributions? deviation from distributions?

We should limit our work to exposing information from inside of the Merlin Systems DAG. For example, It's currently possible for a user to measure the latency of requests to Triton, but the ensemble is a "black box" and we only know how long it takes to execute the entire thing. With this work, we should expose how long each component of the DAG takes, so that someone can know, for example, how long the TransformWorkflow, PredictPytorch, QueryFeast, etc. take within a single request.

Starting Point:

  • Figure out the points where we need to capture information
  • Decide on what information will be captured
  • Hackathon project instruments the graph executor to capture timing information

Provide examples that demonstrate how to use opentelemetry to handle logs, metrics and traces:

  • Timing / SLA monitoring
  • Aggregation of metrics (offline vs online, detection of problems in your pipeline)

Integration with Triton Tracing

Triton has support for opentelemetry tracing starting in 23.04, and are adding the ability to trace BLS models. Coordinate with them to ensure that the functionality we are building is supported.

Image

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions