feat: extensible log management#3762
Conversation
94512f1 to
2db7de2
Compare
| - The plugin directory would also have a `conf.d` directory where each plugin can store their configuration in a toml file. | ||
| These config files are used to turn type discovery on/off and to define inclusion/exclusion list when type discovery is turned on: | ||
| ```docker.toml | ||
| [docker] | ||
| auto_discover=true | ||
|
|
||
| [docker.include] | ||
| container_name_regex = "nginx" | ||
|
|
||
| [docker.include] | ||
| image_name_regex = "" | ||
|
|
||
| [docker.exclude] | ||
| container_name_regex = "kube-*" | ||
| ``` | ||
| - The main config file for a plugin must be named after the plugin itself (e,g: `docker.sh` -> `docker.toml`). | ||
| Key configs like `auto_discover` can only be defined in this main file. | ||
| The inclusion/exclusion list can be defined in extension files as well, | ||
| but the table names with the type prefix (e.g: `[docker.include]` and `[docker.exclude]`) must be used in those as well. | ||
| - When new software is installed, their corresponding `include/exclude` entries can be appended to the main plugin config itself, | ||
| or created in an independent extension file. |
There was a problem hiding this comment.
I don't think we need to have configuration as part of the interface. Each plugin could be free to have its own configuration or not, however we don't need to dictate the format/contents of it. Yes we could encourage people to place the plugins in a central location (e.g. /etc/tedge/plugins/logs, /etc/tedge/plugins/config/), but I wouldn't enforce it.
There was a problem hiding this comment.
Yeah, I was just trying to enforce a consistent structure to the plugin configs like the include and exclude sections. My plan was to allow the plugin devvs to define additional tables in the main plugin config to describe plugin specific settings as well. But you're right that it is not up to the agent to enforce keys like container_name_regex, unit_name etc. So, I'm wondering where to draw the line between what the agent enforces and what the plugin controls.
There was a problem hiding this comment.
So I guess these settings are actually tedge-agent settings which are used as post filtering on the plugins output, so that users can restrict the selection of types without having to modify the actual plugin...I'm ok with that idea, though it wasn't that clear from reading the docs at least (but maybe that was just me).
There was a problem hiding this comment.
Have detailed it out a bit more in 9182576
| - The `config-plugins` also has a `conf.d` sub-directory where extensions of the main config file can be created dynamically. | ||
| Each extension config can have entries as follows: | ||
| ``` | ||
| [[files]] | ||
| type = "collectd" | ||
| path = "/etc/collectd/collectd.conf" | ||
|
|
||
| [[files]] | ||
| type = "nginx" | ||
| path = "/etc/nginx/nginx.conf" | ||
| ``` |
There was a problem hiding this comment.
Though how do we tie the these configuration snippets to the specific configuration type handler? e.g. lets call the existing file based config handler that tedge-agent supports as the "file" handler. The configuration items need to be marked as to be handled specifically by the "file" handler and nothing else.
There was a problem hiding this comment.
Okay, that's not described clearly in the spec. I'll update that. When a request comes for a type, the agent looks up if a plugin with the same name exists. If yes, that plugin is used to handle this config type. Else, we fall back to the existing file handler which is the inbuilt default.
There was a problem hiding this comment.
Though we're probably going to have to namespace these types...as I'm sure there will be conflicts between different plugins, e.g. "mosquitto" (file) and "mosquitto" (container)
There was a problem hiding this comment.
I didn't add any additional namespace for these config plugins as there was a 1:1 mapping between types and plugins. If we make config plugins also support multiple types, then yeah, this would be required.
There was a problem hiding this comment.
It was decided to have config plugins support multiple config types and those config types would be suffixed with ::<plugin_type> when reported as supported config types, which provides that namespaced isolation.
| [executing] | ||
| action = "builtin:config_update:executing" | ||
| on_success = "download" | ||
|
|
||
| [download] | ||
| action = "builtin:config_update:executing" | ||
| on_success = "prepare" | ||
|
|
||
| [prepare] | ||
| action = "builtin:config_update:prepare" | ||
| on_success = "apply" | ||
|
|
||
| [apply] | ||
| action = "builtin:config_update:apply" | ||
| on_success = "validate" | ||
|
|
||
| [apply] | ||
| action = "builtin:config_update:apply" | ||
| on_success = "commit" | ||
| on_error = "rollback" | ||
|
|
||
| [apply] | ||
| action = "builtin:config_update:apply" | ||
| on_success = "successful" | ||
|
|
||
| [rollback] | ||
| action = "builtin:config_update:rollback" | ||
| on_success = "failed" |
There was a problem hiding this comment.
I'd be in favor of just having a "file" binary which does the handling of the current file-based configuration management (similar to what we do already for the tedge-apt-plugin). The only in-built functions that we could expose as builtin functions is the actually uploading and downloading of the files (though I suspect some of that work is already done by the tedge-mapper-c8y).
There was a problem hiding this comment.
Yes, that's exactly how it will be done. I'd just clarify one thing, which I realise isn't communicated that well in the spec after reading the comments. The same tedge-config-plugin.toml would have multiple file entries possibly handled by different plugins named after those types. For example, a collectd config file can still be listed in the tedge-config-plugin.toml file, but can be handled by a collectd plugin. The association is done purely by the type value specified in the config file and a similarly named plugin. If a named plugin does not exist for that type, then we just use the built-in plugin.
Unlike the default file plugin proposed for log management, I wasn't initially planning of of extracting this builtin behaviour into a file plugin for config as well. The plan was to keep it with the agent itself. But, for the sake of consistency, I guess we can extract that out.
There was a problem hiding this comment.
It was decided to have a dedicated file plugin for config management as well (as it was done for log management), as that enables that plugin to be individually executed with elevated rights while the tedge-agent still runs as tedge user.
There was a problem hiding this comment.
General question about when to publish the lists of config/log types which are reported by each plugin.
If we have a plugin concept, when do we re-evaluate changes in the types? e.g. if a docker log plugin allow users to fetch logs from existing containers, then the user needs to know which containers are currently running in order to fetch it.
Possible solutions
- After every "side-effect" operation (device profile, firmware, software, configuration), we execute the plugin list command? Then we rely on change detection to avoid publishing identical lists to the cloud each time.
There was a problem hiding this comment.
The planned contract was to get the thing that installs the new application to just update the main tedge-config-plugin.toml or tedge-config-plugin.toml files or add an entry into their corresponding conf.d directories. When there are no explicit entries to add (for plugins with auto-discovery turned on), they can even just touch those files or directories without making any real edits.
The reasoning behind this was to have a contract not just for our own managed operations like device_profile, firmware_update etc but an external process can also trigger the same, if an installation was done outside the purview of tedge.
There was a problem hiding this comment.
For configuration management this might work, but I'm not so sure about log management, as the log plugins generally have dynamic values (e.g. as containers are added and removed, and a container does not have any knowledge about how to collect its logs, or that it even exists, that responsibility would be for the container-aware plugin that can detect added/removed containers and knows how to fetch the logs from them).
There was a problem hiding this comment.
For containers, there would still be some process on the device that spawns the new containers, right? Couldn't that perform this action? My assumption was that this "container manager" process would even know the exact container id that it just spawned and can add that entry to the config file if it wants to. Or just touch the directory when auto-discovery is turned on. Wouldn't that work? Or my assumption about this "container manager" process is wrong?
There was a problem hiding this comment.
Ok, so the interface for telling the tedge-agent which log types are there is file based, so any other process will have to add/remove files as containers are created and destroyed...though this would rely on a process always running. so if we wanted to do support a journald log provider, then it would need to be another service which is run, or does every individual service just need to create its own log entry?
There was a problem hiding this comment.
Since new configs or logs are pretty much always added as a result of software installation, we can extend the agent's software update contract to include a refresh of the supported config list and log list as well, in addition to the software list update that happens already whenever a new software is installed. That way, even when a new config/log plugin is installed using some packages, its refresh will happen automatically.
Currently the updated software list request is triggered by the mapper. We can move that to the agent so that it. does this all-in-one refresh whenever a new software package is installed.
We can also expose a REST API from the agent that the users can invoke to trigger this all-in-one refresh explicitly. . This REST API can be used by the users or external systems installing things on the device outside the purview of thin-edge. If the REST API is heavy, we can start off by providing a file-system based API to trigger this refresh by just touching a file or directory.
A single all-in-one refresh API would be sufficient initially. Granular APIs to refresh config list alone or log list alone can be added later if there's a requirement
There was a problem hiding this comment.
If we have a plugin concept, when do we re-evaluate changes in the types? e.g. if a docker log plugin allow users to fetch logs from existing containers, then the user needs to know which containers are currently running in order to fetch it.
An option is to use the concept of signal introduced by #3761
| - `get <type> <temp-log-file-path> [--from <timestamp>] [--to <timestamp>] [--filter <filter-text>]`: | ||
| Used to fetch the log for the given type within the provided log range. | ||
| The log content must be written to the temporary file passed provided by the agent. | ||
| - The plugin directory would also have a `conf.d` directory where each plugin can store their configuration in a toml file. |
There was a problem hiding this comment.
Sorry, I'm lost here. Why do we need this config settings? What does mean auto_discover = false? Isn't it enough to add .ignore extension or whatever to the plugin like tedge diag collect does? Why useful to have include and exclude settings?
There was a problem hiding this comment.
The auto_discover flag is useful if a plugin written by a third-party is listing types on its own (e.g: a journald plugin listing all the service units installed on the device), but prefer not to list them all in the cloud. You still need to use that plugin to fetch the logs of a few services using its get command. In that case, since you may not be able to tweak the list command of the third-party plugin. it'd be convenient for you to tell the agent to ignore its list output with this auto_discover setting and explicitly define what you're interested in, with include values.
How the comination of auto_discover and include/exclude settings work are described in this comment thread. I'll update the doc as well with these details. May be the name auto_discover isn't good enough. Something like use_list is clearer?
There was a problem hiding this comment.
Thanks for explaining. I understood from your comment and the other thread. I prefer use_list as it sounds clearer.
There was a problem hiding this comment.
Though if "auto_discover" is disabled, we'd expect the user to explicitly define the log entry which is the mapping between the log name (e.g. "mosquitto") and the plugin type that is capable of fetching the log?
There was a problem hiding this comment.
Yes. When auto_discover is off, when a new log entry has to be added, its include entry must be added to the config, most-likely via a drop-in extension config at /etc/tedge/log-plugins.d.
| - The `auto_discover` and `include`/`exclude` configs are used by the `tedge-agent` | ||
| to process the supported types listed by that plugin. |
There was a problem hiding this comment.
What I'm missing is how these settings are used by tedge-agent.
Guessing:
- The
auto_discoverandinclude/excludeconfigs are under adockerrecord => this is about a plugin nameddocker. auto_discover=true=> run the plugin with the--listargument.include=> list of types to be used whenauto_discover=false.exclude=> list of types to be removed from the--list.
Am I correct? It would be good to clarify as @rina23q is also lost.
There was a problem hiding this comment.
What I'm missing is how these settings are used by
tedge-agent.Guessing:
- The
auto_discoverandinclude/excludeconfigs are under adockerrecord => this is about a plugin nameddocker.
Correct
auto_discover=true=> run the plugin with the--listargument.
Correct
include=> list of types to be used whenauto_discover=false.
Correct for the auto_discover=false case. But the same include pattern can be applied as an additional filter the list output when auto_discover=true as well .
exclude=> list of types to be removed from the--list.
Correct
Am I correct? It would be good to clarify as @rina23q is also lost.
Yes, I'll detail it out.
| [docker] | ||
| auto_discover=true | ||
|
|
||
| [[docker.include]] | ||
| pattern = "nginx" | ||
|
|
||
| [[docker.exclude]] | ||
| pattern = "kube-*" |
There was a problem hiding this comment.
| [docker] | |
| auto_discover=true | |
| [[docker.include]] | |
| pattern = "nginx" | |
| [[docker.exclude]] | |
| pattern = "kube-*" | |
| [plugin.docker] | |
| auto_discover=true | |
| [[plugin.docker.include]] | |
| pattern = "nginx" | |
| [[plugin.docker.exclude]] | |
| pattern = "kube-*" |
To ease the TOML parsing.
There was a problem hiding this comment.
Though I'm not sure if we really need all the nested structures here, so it could be simplified to
[plugin.docker]
auto_discover = true
include_pattern = ".*tedge.*"
exclude_pattern = "^systemd-.*"There was a problem hiding this comment.
What you've proposed has the limitation that include_pattern for a given plugin can only be defined in one file on a single line. Dynamically adding or removing an new include_pattern would be difficult with this structure without complex sed commands. I used the nested array structure to allow users to easily append multiple patterns to the same file as well as define include patterns across multiple extension configurations. The main motivation to support multiple file extensions was to make removals also seamless.
There was a problem hiding this comment.
This isn't a use-case since that individual plugins can control files via the conf.d folder. User's should be using sed (as it is anyway a flawed / naive logic) to modify these files...previously layout of the tedge-configuration-plugin.toml was changed to the equivalent table layout [[configuration]] was done to add a full item...but that was only because we didn't support config snippets.
| - When the supported types are published over mqtt, their source plugin information is also appended to that type | ||
| in the format <log_type>::<plugin_type>. |
There was a problem hiding this comment.
I'm wondering if we can address the problem that users face if they have a heterogeneous fleet, and only care about getting the "mosquitto" logs regardless if it is using the file-based plugin or the systemd. By requiring the "::" suffix when using any other plugin apart from the "file" based plugin, it would result in users having to use different command depending on the device.
Is there anyway where we can try to resolve the log type without the provider...and if there is only one match, then it is used, otherwise, the first match is used...if the user wants to be more specific, then can provide the ::<plugin> info...though this would complicate the publishing of the supported types.
There was a problem hiding this comment.
This is a tricky one. It sounds doable as you've proposed with some additional checks while log types from different plugins are merged. But I'm just wondering if this inconsistency where the types coming from the same plugin appear differently in the cloud, some with the suffix and the others without it, would be more confusing to the user as they wouldn't know which plugin is responsible for that type.
There was a problem hiding this comment.
Though couldn't we just trigger an update of the list again, and then check for a match (either with or without the type)? It would help with the management of a heterogeneous fleet (which is a real benefit). But we can do this in a follow up PR.
| Instead of the `exeucting` phase of `config_update` workflow doing everything in that single step, | ||
| break it down into multiple smaller stages that corresponds to each plugin command/phase as well as follows: |
There was a problem hiding this comment.
I think we still need to evaluate the phase 2 proposal, as I'm not convinced that these builtin internal functions will be easy to use/document, but I believe with the PoC we might get some clarity in the right direction once we have something semi-functional.
There was a problem hiding this comment.
As an initial approach, I would even be tempted to make a tedge (or tedge-agent) command that would call the appropriate function with information provided as arguments, as this would be more re-usable for other integrations rather than relying on this workflow specific api calls.
There was a problem hiding this comment.
Config plugin spec will be refined further in a separate PR, although there was a general consensus on not going ahead with this specific phase 2 proposal.
tests/RobotFramework/tests/cumulocity/log/log_operation_plugins.robot
Outdated
Show resolved
Hide resolved
reubenmiller
left a comment
There was a problem hiding this comment.
Approved. Thanks again for the nice feature that I'm sure many users will be able to benefit from.
On a box where sudo is installed and configured to requests a password, the log manager tests are failing when the file log plugin is launched with sudo. This fix disables sudo when running the unit tests. Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
ebcb0bb to
6425dcf
Compare
Proposed changes
filelog plugin for file based logsTBD in a follow-up PR:
tedge-log-plugin.tomlplugins.includeandplugins.excludepatterns in log plugins configsTypes of changes
Paste Link to the issue
Checklist
just prepare-devonce)just formatas mentioned in CODING_GUIDELINESjust checkas mentioned in CODING_GUIDELINESFurther comments