-
Notifications
You must be signed in to change notification settings - Fork 322
chore(telegraf-controller): add telegraf controller architectural overview #6699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+284
−0
Merged
Changes from 1 commit
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| --- | ||
| title: Telegraf Controller reference documentation | ||
| description: > | ||
| Reference documentation for Telegraf Controller, the application that | ||
| centralizes configuration management and provides information about the health | ||
| of Telegraf agent deployments. | ||
| menu: | ||
| telegraf_controller: | ||
| name: Reference | ||
| weight: 20 | ||
| --- | ||
|
|
||
| Use the reference docs to look up Telegraf Controller configuration options, | ||
| APIs, and operational details. | ||
|
|
||
| {{< children hlevel="h2" >}} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,268 @@ | ||
| --- | ||
| title: Telegraf Controller architecture | ||
| description: > | ||
| Architectural overview of the {{% product-name %}} application. | ||
| menu: | ||
| telegraf_controller: | ||
| name: Architectural overview | ||
| parent: Reference | ||
| weight: 105 | ||
| --- | ||
|
|
||
| {{% product-name %}} is a standalone application that provides centralized | ||
| management for Telegraf agents. It runs as a single binary that starts two | ||
| separate servers: a web interface/API server and a dedicated high-performance | ||
| heartbeat server for agent monitoring. | ||
|
|
||
| ## Runtime Architecture | ||
|
|
||
| ### Application Components | ||
|
|
||
| When you run the Telegraf Controller binary, it starts four main subsystems: | ||
|
|
||
| - **Web Server**: Serves the management interface (default port: `8888`) | ||
| - **API Server**: Handles configuration management and administrative requests | ||
| (served on the same port as the web server) | ||
| - **Heartbeat Server**: Dedicated high-performance server for agent heartbeats | ||
| (default port: `8000`) | ||
| - **Background Scheduler**: Monitors agent health every 60 seconds | ||
|
|
||
| ### Process Model | ||
|
|
||
| - **telegraf_controller** _(single process, multiple servers)_ | ||
| - **Main HTTP Server** _(port `8888`)_ | ||
| - Web UI (`/`) | ||
| - API Endpoints (`/api/*`) | ||
| - **Heartbeat Server** (port `8000`) | ||
| - POST /heartbeat _(high-performance endpoint)_ | ||
| - **Database Connection** | ||
| - SQLite or PostgreSQL | ||
| - **Background Tasks** | ||
| - Agent Status Monitor (60s interval) | ||
|
|
||
| The dual-server architecture separates high-frequency heartbeat traffic from | ||
| regular management operations, ensuring that the web interface remains | ||
| responsive even under heavy agent load. | ||
|
|
||
| ## Configuration | ||
|
|
||
| {{% product-name %}} configuration is controlled through command options and | ||
| environment variables. | ||
|
|
||
| | Command Option | Environment Variable | Description | | ||
| | :----------------- | :------------------- | :--------------------------------------------------------------------------------------------------------------- | | ||
| | `--port` | `PORT` | API server port (default is `8888`) | | ||
| | `--heartbeat-port` | `HEARTBEAT_PORT` | Heartbeat service port (default: `8000`) | | ||
| | `--database` | `DATABASE` | Database filepath or URL (default is [SQLite path](/telegraf/controller/install/#default-sqlite-data-locations)) | | ||
| | `--ssl-cert` | `SSL_CERT` | Path to SSL certificate | | ||
| | `--ssl-key` | `SSL_KEY` | Path to SSL private key | | ||
|
|
||
| To use environment variables, create a `.env` file in the same directory as the | ||
| binary or export these environment variables in your terminal session. | ||
|
|
||
| ### Database Selection | ||
|
|
||
| {{% product-name %}} automatically selects the database type based on the | ||
| `DATABASE` string: | ||
|
|
||
| - **SQLite** (default): Best for development and small deployments with less | ||
| than 1000 agents. Database file created automatically. | ||
| - **PostgreSQL**: Required for large deployments. Must be provisioned separately. | ||
|
|
||
| Example PostgreSQL configuration: | ||
|
|
||
| ```bash | ||
| DATABASE="postgresql://user:password@localhost:5432/telegraf_controller" | ||
| ``` | ||
|
|
||
| ## Data Flow | ||
|
|
||
| ### Agent registration and heartbeats | ||
|
|
||
| {{< diagram >}} | ||
| flowchart LR | ||
| T["Telegraf Agents<br/>(POST heartbeats)"] --> H["Port 8000<br/>Heartbeat Server"] | ||
| H --Direct Write--> D[("Database")] | ||
| W["Web UI/API<br/>"] --> A["Port 8888<br/>API Server"] --View Agents (Read-Only)--> D | ||
| R["Rust Scheduler<br/>(Agent status updates)"] --> D | ||
|
|
||
| {{< /diagram >}} | ||
|
|
||
| 1. **Agents send heartbeats**: | ||
|
|
||
| Telegraf agents with the heartbeat output plugin send `POST` requests to the | ||
| dedicated heartbeat server (port `8000` by default). | ||
|
|
||
| 2. **Heartbeat server process the heartbeat**: | ||
|
|
||
| The heartbeat server is a high-performance Rust-based HTTP server that: | ||
|
|
||
| - Receives the `POST` request at `/agents/heartbeat` | ||
| - Validates the heartbeat payload | ||
| - Extracts agent information (ID, hostname, IP address, status, etc.) | ||
| - Uniquely identifies each agent using the `instance_id` in the heartbeat | ||
| payload. | ||
|
|
||
| 3. **Heartbeat server writes directly to the database**: | ||
|
|
||
| The heartbeat server uses a Rust NAPI module that: | ||
|
|
||
| - Bypasses the application ORM (Object-Relational Mapping) layer entirely | ||
| - Uses `sqlx` (Rust SQL library) to write directly to the database | ||
| - Implements batch processing to efficiently process multiple heartbeats | ||
| - Provides much higher throughput than going through the API layer | ||
|
|
||
| The Rust module performs these operations: | ||
|
|
||
| - Creates a new agent if it does not already exist | ||
| - Adds or updates the `last_seen` timestamp | ||
| - Adds or updates the agent status to the status reported in the heartbeat | ||
| - Adds or updates other agent metadata (hostname, IP, etc.) | ||
sanderson marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| 4. **API layer reads agent data**: | ||
|
|
||
| The API layer has read-only access for agent data and performs the following | ||
| actions: | ||
|
|
||
| - `GET /api/agents` - List agents | ||
| - `GET /api/agents/summary` - Agent status summary | ||
|
|
||
| The API never writes to the agents table. Only the heartbeat server does. | ||
|
|
||
| 5. **The Web UI displays updated agent data**: | ||
|
|
||
| The web interface polls the API endpoints to display: | ||
|
|
||
| - Real-time agent status | ||
| - Last seen timestamps | ||
| - Agent health metrics | ||
|
|
||
| 5. **The background scheduler evaluates agent statuses**: | ||
sanderson marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Every 60 seconds, a Rust-based scheduler (also part of the NAPI module): | ||
|
|
||
| - Scans all agents in the database | ||
| - Checks `last_seen` timestamps against the agent's assigned reporting rule | ||
| - Updates agent statuses: | ||
| - ok → not_reporting (if heartbeat missed beyond threshold) | ||
| - not_reporting → ok (if heartbeat resumes) | ||
| - Auto-deletes agents based that have exceeded the auto-delete threshold | ||
sanderson marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| (if enabled for the reporting rule) | ||
|
|
||
| ### Configuration distribution | ||
|
|
||
| 1. **An agent requests a configuration**: | ||
|
|
||
| Telegraf agents request their configuration from the main API server | ||
| (port `8888`): | ||
|
|
||
| ```bash | ||
| telegraf --config "http://localhost:8888/api/configs/{config-id}/toml?location=datacenter1&env=prod" | ||
| ``` | ||
|
|
||
| The agent makes a `GET` request with: | ||
|
|
||
| - **Config ID**: Unique identifier for the configuration template | ||
| - **Query Parameters**: Variables for parameter substitution | ||
| - **Accept Header**: Can specify `text/x-toml` or `application/octet-stream` | ||
| for download | ||
|
|
||
| 2. **The API server receives request**: | ||
|
|
||
| The API server on port `8888` handles the request at | ||
| `/api/configs/{id}/toml` and does the following: | ||
|
|
||
| - Validates the configuration ID | ||
| - Extracts all query parameters for substitution | ||
| - Checks the `Accept` header to determine response format | ||
|
|
||
| 3. **The application retrieves the configuration from the database**: | ||
|
|
||
| {{% product-name %}} fetches configuration data from the database: | ||
|
|
||
| - **Configuration TOML**: The raw configuration with parameter placeholders | ||
| - **Configuration name**: Used for filename if downloading | ||
| - **Updated timestamp**: For the `Last-Modified` header | ||
|
|
||
| 4. **{{% product-name %}} substitutes parameters**: | ||
|
|
||
| {{% product-name %}} processes the TOML template and replaces parameters | ||
| with parameter values specified in the `GET` request. | ||
|
|
||
| 5. **{{% product-name %}} sets response headers**: | ||
|
|
||
| - Content-Type | ||
| - Last-Modified | ||
|
|
||
| Telegraf uses the `Last-Modified` header to determine if a configuration | ||
| has been updated and, if so, download and use the updated configuration. | ||
|
|
||
| 6. **{{% product-name %}} delivers the response**: | ||
|
|
||
| Based on the `Accept` header: | ||
|
|
||
| {{< tabs-wrapper >}} | ||
| {{% tabs "medium" %}} | ||
| [text/x-toml (TOML)](#) | ||
| [application/octet-stream (Download)](#) | ||
| {{% /tabs %}} | ||
| {{% tab-content %}} | ||
| <!------------------------------- BEGIN TOML ------------------------------> | ||
|
|
||
| ``` | ||
| HTTP/1.1 200 OK | ||
| Content-Type: text/x-toml; charset=utf-8 | ||
| Last-Modified: Mon, 05 Jan 2025 07:28:00 GMT | ||
|
|
||
| [agent] | ||
| hostname = "server-01" | ||
| environment = "prod" | ||
| ... | ||
| ``` | ||
|
|
||
| <!-------------------------------- END TOML -------------------------------> | ||
| {{% /tab-content %}} | ||
| {{% tab-content %}} | ||
| <!----------------------------- BEGIN DOWNLOAD ----------------------------> | ||
|
|
||
| ``` | ||
| HTTP/1.1 200 OK | ||
| Content-Type: application/octet-stream | ||
| Content-Disposition: attachment; filename="config_name.toml" | ||
| Last-Modified: Mon, 05 Jan 2025 07:28:00 GMT | ||
|
|
||
| [agent] | ||
| hostname = "server-01" | ||
| ... | ||
| ``` | ||
|
|
||
| <!------------------------------ END DOWNLOAD -----------------------------> | ||
| {{% /tab-content %}} | ||
| {{< /tabs-wrapper >}} | ||
|
|
||
| 7. _(Optional)_ **Telegraf regularly checks the configuration for updates**: | ||
|
|
||
| Telegraf agents can regularly check {{% product-name %}} for configuration | ||
| updates and automatically load updates when detected. When starting a | ||
| Telegraf agent, include the `--config-url-watch-interval` option with the | ||
| interval that you want the agent to use to check for updates—for example: | ||
|
|
||
| ```bash | ||
| telegraf \ | ||
| --config http://localhost:8888/api/configs/xxxxxx/toml \ | ||
| --config-url-watch-interval 1h | ||
| ``` | ||
|
|
||
| ## Reporting Rules | ||
|
|
||
| {{% product-name %}} uses reporting rules to determine when agents should be | ||
| marked as not reporting: | ||
|
|
||
| - **Default Rule**: Created automatically on first run | ||
| - **Heartbeat Interval**: Expected frequency of agent heartbeats (default: 60s) | ||
| - **Threshold Multiplier**: How many intervals to wait before marking not_reporting (default: 3x) | ||
|
|
||
| Access reporting rules via: | ||
|
|
||
| - **Web UI**: Reporting Rules | ||
| - **API**: `GET /api/reporting-rules` | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.