-
Notifications
You must be signed in to change notification settings - Fork 183
DRAFT: Telemetry exporter plug-in infra #962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi alexanderbilk! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
|
|
||
| # NIXL Prometheus Telemetry Exporter Plugin | ||
|
|
||
| This telemetry exporter plugin exports NIXL telemetry events to Prometheus format via an HTTP endpoint that can be scraped by Prometheus servers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A typo and suggestion - ... in Prometheus format, by_exposing an HTTP endpoint ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed. What exactly here is a typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly a typo, it was about "to => in".
|
|
||
| ## Configuration | ||
|
|
||
| To enable the Prometheus exporter, set the following environment variable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add the word "plug-in" and explain that this variable activates a selected plug-in, of a (future) plug-in choices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be considered to put an explicit .so name or path, to allow for custom plug-ins (e.g., by a third party)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more info here
|
|
||
| gauges_["agent_xfer_time"] = &xfer_time_gauge.Add({{"category", "NIXL_TELEMETRY_PERFORMANCE"}}); | ||
| gauges_["agent_xfer_post_time"] = &xfer_post_time_gauge.Add({{"category", "NIXL_TELEMETRY_PERFORMANCE"}}); | ||
| gauges_["agent_memory_registered"] = &memory_registered_gauge.Add({{"category", "NIXL_TELEMETRY_MEMORY"}}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe these two should be implemented as counters, such that their difference gives an amount of currently registered memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It stated in HLD what this metric operates as Gauge. Also what I can see from memory event implementation confirms it
| @@ -0,0 +1,65 @@ | |||
| /* | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general note, please add some .md file that explains how to write a custom telemetry plug-in. What is the API and what templates can be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
d0b114f to
280b472
Compare
| uint32_t eventLimit_; | ||
|
|
||
| public: | ||
| explicit nixlTelemetryExporter(const nixlTelemetryExporterInitParams *init_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest passing by const-reference; if the params are optional they can be wrapped with std::optional (or a default-c'tor can be added).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to const reference
b575c20 to
e412bd1
Compare
| */ | ||
| class nixlTelemetryExporter { | ||
| protected: | ||
| const uint32_t eventLimit_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a uint64_t in the init params, should be consistent (and perhaps size_t?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, changed to size_t
| const nixlTelemetryExporterInitParams *init_params); | ||
|
|
||
| // Constructor | ||
| nixlTelemetryPlugin( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the two getters can be marked noexcept. The comment is superfluous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied
| NixlTelemetryPluginApiVersion api_version; | ||
|
|
||
| // Function pointer for creating a new exporter instance (returns smart pointer for RAII) | ||
| std::unique_ptr<nixlTelemetryExporter> (*create_exporter)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A type alias for the function pointer could make this and the c'tor argument more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied
| } | ||
| } | ||
|
|
||
| dlclose(handle_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should handle_ not also be the unique_ptr with the appropriate deleter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| plugin_(plugin) {} | ||
|
|
||
| nixlTelemetryPluginHandle::~nixlTelemetryPluginHandle() { | ||
| if (handle_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace this with assert(handle) in the c'tor (or // assert(handle) just to make the intention clear)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
| nixl_status_t | ||
| nixlTelemetryPrometheusExporter::exportEvent(const nixlTelemetryEvent &event) { | ||
| try { | ||
| std::string event_name(event.eventName_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a const std::string_view if the map uses std::less<void>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or not, since it's only coming to unordered containers in C++20.
Could still be const :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added const
| // Try to load the plugin from all registered directories | ||
| for (const auto &dir : plugin_dirs_) { | ||
| if (dir.empty()) { | ||
| continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should better be caught at time of insertion into plugin_dirs_.
(Which would also make it easier to not hold the lock the whole time, but that's probably not important for this particular function.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Omitted for now then.
All this functional is more of a "just in case" placeholder and to make it consistent with backend plugin manager
| if (init_params && !init_params->outputPath.empty()) { | ||
| // Validate format: ip_addr:port_num | ||
| std::string path = init_params->outputPath; | ||
| size_t colon_pos = path.find(':'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variables should be declared const when they are not changed (mostly to let the reader know they won't change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
src/core/telemetry_plugin_manager.h
Outdated
| * @param dirpath Directory path to search | ||
| */ | ||
| void | ||
| discoverPluginsFromDir(const std::string &dirpath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use std::filesystem::path for paths and filenames?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, replaced
| enum class NixlTelemetryPluginApiVersion : unsigned int { V1 = 1 }; | ||
|
|
||
| inline constexpr NixlTelemetryPluginApiVersion NIXL_TELEMETRY_PLUGIN_API_VERSION = | ||
| NixlTelemetryPluginApiVersion::V1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to put v1 into the name of all types (and global functions, if any) that are under versioning to make it clearer which parts exactly are in fact versioned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another "just in case" placeholder actually for beta version. I'd consider removing API versioning at all
8419b52 to
5fd32d2
Compare
DRAFT: Prometheus exporter implemetation Signed-off-by: Aleksandr Bilkovskii <[email protected]>
5fd32d2 to
ade7dda
Compare
| inline constexpr char TELEMETRY_EXPORTER_VAR[] = "NIXL_TELEMETRY_EXPORTER"; | ||
| inline constexpr char TELEMETRY_EXPORTER_PLUGIN_DIR_VAR[] = "NIXL_TELEMETRY_EXPORTER_PLUGIN_DIR"; | ||
| inline constexpr char TELEMETRY_EXPORTER_OUTPUT_PATH_VAR[] = "NIXL_TELEMETRY_EXPORTER_OUTPUT_PATH"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| inline constexpr char TELEMETRY_EXPORTER_VAR[] = "NIXL_TELEMETRY_EXPORTER"; | |
| inline constexpr char TELEMETRY_EXPORTER_PLUGIN_DIR_VAR[] = "NIXL_TELEMETRY_EXPORTER_PLUGIN_DIR"; | |
| inline constexpr char TELEMETRY_EXPORTER_OUTPUT_PATH_VAR[] = "NIXL_TELEMETRY_EXPORTER_OUTPUT_PATH"; | |
| inline constexpr char telemetry_exporter_var[] = "NIXL_TELEMETRY_EXPORTER"; | |
| inline constexpr char telemetry_exporter_plugin_dir_var[] = "NIXL_TELEMETRY_EXPORTER_PLUGIN_DIR"; | |
| inline constexpr char telemetry_exporter_output_path_var[] = "NIXL_TELEMETRY_EXPORTER_OUTPUT_PATH"; |
https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#es9-avoid-all_caps-names
| * them to various destinations. | ||
| */ | ||
| class nixlTelemetryExporter { | ||
| protected: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the public before protected order.
| #include "nixl_types.h" | ||
| #include "telemetry_event.h" | ||
| #include "common/cyclic_buffer.h" | ||
|
|
||
| #include <string> | ||
| #include <vector> | ||
| #include <fstream> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please delete the unused header files.
| #include <string_view> | ||
| #include <memory> | ||
|
|
||
| enum class NixlTelemetryPluginApiVersion : unsigned int { V1 = 1 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enums should be in snake_case.
https://github.com/ai-dynamo/nixl/blob/main/docs/CodeStyle.md
|
|
||
| enum class NixlTelemetryPluginApiVersion : unsigned int { V1 = 1 }; | ||
|
|
||
| inline constexpr NixlTelemetryPluginApiVersion NIXL_TELEMETRY_PLUGIN_API_VERSION = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid ALL_CAPS names.
| std::unique_ptr<nixlTelemetryExporter> (*)(const nixlTelemetryExporterInitParams &init_params); | ||
|
|
||
| class nixlTelemetryPlugin { | ||
| private: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the public before private order.
| NixlTelemetryPluginApiVersion::V1; | ||
|
|
||
| // Type alias for exporter creation function | ||
| using ExporterCreatorFn = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| using ExporterCreatorFn = | |
| using exporter_creator_fn_t = |
https://github.com/ai-dynamo/nixl/blob/main/docs/CodeStyle.md
| std::unique_ptr<nixlTelemetryExporter> (*)(const nixlTelemetryExporterInitParams &init_params); | ||
|
|
||
| class nixlTelemetryPlugin { | ||
| private: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the public before private order.
| std::lock_guard<std::mutex> lock(mutex_); | ||
| events_.swap(next_queue); | ||
| } | ||
| for (auto &event : next_queue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
| [](const nixlTelemetryEvent &a, const nixlTelemetryEvent &b) { | ||
| return a.timestampUs_ < b.timestampUs_; | ||
| }); | ||
| for (auto &event : all_events) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
| std::unique_ptr<nixlTelemetryExporter> (*)(const nixlTelemetryExporterInitParams &init_params); | ||
|
|
||
| class nixlTelemetryPlugin { | ||
| private: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the public before private order.
| NixlTelemetryPluginApiVersion::V1; | ||
|
|
||
| // Type alias for exporter creation function | ||
| using ExporterCreatorFn = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| using ExporterCreatorFn = | |
| using exporter_creator_fn_t = |
https://github.com/ai-dynamo/nixl/blob/main/docs/CodeStyle.md
| std::string telemetry_file = std::string(telemetry_env_dir) + "/" + name; | ||
| telemetry_ = std::make_unique<nixlTelemetry>(telemetry_file, backendEngines); | ||
| NIXL_DEBUG << "NIXL telemetry is enabled with output file: " << telemetry_file; | ||
| } else if (telemetry_env_exporter_plugin_dir != nullptr && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nixlAgentData ctor is overloaded. Refactoring is necessary. Please move telemetry initialization into a separate private function.
| } | ||
| }; | ||
|
|
||
| // Plugin must implement these functions for dynamic loading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of code duplication. telemetry_plugin.h looks like backend_plugin.h, and telemetry_plugin_manager.h looks like nixl_plugin_manager.h.
You can use the existing infrastructure by reorganizing the class structure (inheritance, composition, ...).
To better understand which path to choose, it is useful to have an example of using similar telemetry plugins. In other words, this PR should contain tests that use all the new functionality.
At the moment, I'm not entirely sure if telemetry needs to replicate the infrastructure that is used for NIXL plugins.
What?
Adds infrastructure for dynamic load of telemetry exporters
Beta implementation of prometheus exporter
Why?
Required capability to expose NIXL telemetry in common telemetry formats (e.g OTLP, Prometheus etc.)
How?
Solution works as dynamically loaded user defined plugins