document cluster deployment mode for k8s and improve telemetry docs

aucahuasi · aucahuasi · commit 0d9f2f17216f · 2025-03-18T21:48:57.000-05:00
diff --git a/docs/admin.rst b/docs/admin.rst
@@ -12,8 +12,8 @@
    planning/index
    install/cloud/index
    install/on-prem/index
+   install/cluster/index
    install/testing-an-install
-   install/cluster
    app-config/index
    debugging/index
    security/index
diff --git a/docs/install/cluster/docker-compose-mode.md b/docs/install/cluster/docker-compose-mode.md
@@ -1,19 +1,6 @@
 # Multinode Deployment with Docker Compose
 
-**Note**: *This deployment configuration is currently **experimental** and subject to future updates.*
-
-This document provides step-by-step instructions for deploying **Graphistry** in a multinode environment using Docker Compose. In this architecture, both the **Leader** and **Follower** nodes can ingest datasets and files, with all nodes accessing the same **PostgreSQL** instance on the **Leader** node. As a result, **Follower** nodes can also perform data uploads, ensuring that both **Leader** and **Follower** nodes have equal access to dataset ingestion and visualization.
-
-The leader and followers will share datasets using a **Distributed File System**, for example, using the **Network File System (NFS)** protocol. This setup allows all nodes to access the same dataset directory. This configuration ensures that **Graphistry** can be deployed across multiple machines, each with different **GPU** configuration profiles (some with more powerful GPUs, enabling **multi-GPU** on multinode setups), while keeping the dataset storage centralized and synchronized.
-
-This deployment mode is flexible and can be used both in **on-premises** clusters or in the **cloud**. For example, it should be possible to use **Amazon Machine Images (AMIs)** from the [Graphistry AWS Marketplace](https://aws.amazon.com/marketplace/pp/prodview-ppbjy2nny7xzk?sr=0-1&ref_=beagle&applicationId=AWSMPContessa), assigning Amazon VMs created from those images to the **leader** and **follower** roles. This allows for scalable and customizable cloud-based deployments with the same multinode architecture.
-
-## Cluster Configuration Overview
-
-1. **Leader Node**: Handles the ingestion of datasets, PostgreSQL write operations, and exposes the required PostgreSQL ports.
-2. **Follower Nodes**: Connect to the PostgreSQL instance on the leader and can visualize graphs using the shared datasets. However, they do not have their own attached PostgreSQL instance.
-3. **Shared Dataset**: All nodes will access the dataset directory using a **Distributed File System**. This ensures that the leader and followers use the same dataset, maintaining consistency across all nodes.
-4. **PostgreSQL**: The PostgreSQL instance on the **Leader** node is used by all nodes for querying. The **Nexus** service on the **Leader** manages access to the database, while **Follower** nodes also use the **Leader’s** PostgreSQL instance. Both **Leader** and **Follower** nodes can perform actions like user sign-ups and settings modifications through their own **Nexus** dashboards, with changes applied system-wide for consistency across all nodes.
+This document provides step-by-step instructions for deploying **Graphistry** in a multinode environment using Docker Compose.
 
 ## Configuration File: `cluster.env`
 
diff --git a/docs/install/cluster/index.rst b/docs/install/cluster/index.rst
@@ -0,0 +1,37 @@
+Cluster Installation
+========================
+
+.. toctree::
+  :maxdepth: 1
+
+  docker-compose-mode
+  kubernetes-mode
+
+
+Multinode Deployment Overview
+------------------------
+
+**Note**: *This deployment configuration is currently **experimental** and subject to future updates.*
+
+
+In this installation, both the **Leader** and **Follower** nodes can ingest datasets and files, with all nodes accessing the same **PostgreSQL** instance on the **Leader** node. As a result, **Follower** nodes can also perform data uploads, ensuring that both **Leader** and **Follower** nodes have equal access to dataset ingestion and visualization.
+
+The leader and followers will share datasets using a **Distributed File System**, for example, using the **Network File System (NFS)** protocol. This setup allows all nodes to access the same dataset directory. This configuration ensures that **Graphistry** can be deployed across multiple machines, each with different **GPU** configuration profiles (some with more powerful GPUs, enabling **multi-GPU** on multinode setups), while keeping the dataset storage centralized and synchronized.
+
+Cluster Configuration Overview
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+1. **Leader Node**:  
+   Handles the ingestion of datasets, PostgreSQL write operations, and exposes the required PostgreSQL ports.
+
+2. **Follower Nodes**:  
+   Connect to the PostgreSQL instance on the leader and can visualize graphs using the shared datasets. However, they do not have their own attached PostgreSQL instance.
+
+3. **Shared Data**:  
+   All nodes will access the same **datasets directory** using a **Distributed File System**. This ensures that the leader and followers use the same dataset, maintaining consistency across all nodes.
+
+4. **PostgreSQL**:  
+   The PostgreSQL instance on the **Leader** node is used by all nodes for querying. The **Nexus** service on the **Leader** manages access to the database, while **Follower** nodes also use the **Leader’s** PostgreSQL instance. Both **Leader** and **Follower** nodes can perform actions like user sign-ups and settings modifications through their own **Nexus** dashboards, with changes applied system-wide for consistency across all nodes.
+
+5. **Redis**:  
+   The Redis instance on the **Leader** will be used by all **Nexus** and **forge-etl-python** services on the **Follower** nodes. However, for **StreamGL** visualizations, each **Graphistry** instance will have its own Redis instance.
diff --git a/docs/install/cluster/kubernetes-mode.md b/docs/install/cluster/kubernetes-mode.md
@@ -0,0 +1,5 @@
+# Multinode Deployment with Kubernetes
+
+We can deploy a **Graphistry** cluster on any **Kubernetes (K8s)** distribution, making it versatile and adaptable to a variety of environments. Whether you're using a cloud-based solution like **Google Kubernetes Engine (GKE)**, **Amazon Elastic Kubernetes Service (EKS)**, or **Azure Kubernetes Service (AKS)**, or a local setup like **K3s** or **MicroK8s**, Graphistry can be deployed across any K8s platform.
+
+As an example, you can follow the steps for deploying Graphistry on **K3s** or **Google Kubernetes Engine (GKE)** by referring to the [Graphistry Cluster setup guide](https://github.com/graphistry/graphistry-helm/tree/main/charts/values-overrides/examples/cluster). These steps are also a great reference for configuring your cluster on other Kubernetes distributions, including the setup of a **Distributed File System** for shared directories, such as using a **Network File System (NFS)** as an example.
diff --git a/docs/telemetry/index.rst b/docs/telemetry/index.rst
@@ -6,3 +6,42 @@ Telemetry
 
   docker-compose
   kubernetes
+
+
+Deploying Telemetry Services for Graphistry
+------------------------
+
+Graphistry leverages **OpenTelemetry** to collect and export telemetry data, such as **metrics** and **traces**, from its services. Regardless of whether you are using the **Docker Compose** platform or the **Kubernetes** platform, the telemetry stack can be deployed in **three key modes**, offering flexibility in how data is collected and routed to observability tools. **Graphistry's telemetry stack** provides flexible deployment options for collecting and visualizing metrics and traces.
+
+Both **Docker Compose and Kubernetes platforms** share **common specifications** for deploying OpenTelemetry, ensuring seamless integration and consistent behavior across environments. Whether you choose to forward telemetry data to external services, use packaged observability tools, or combine both methods in hybrid mode, the platforms offer a unified and scalable approach to telemetry data collection and export.
+
+Common Deployment Modes
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+1. **Forwarding to External Services**
+
+   * Telemetry data is forwarded to external observability services (e.g., **Grafana Cloud**, **Datadog**, etc.).
+   * Both platforms support configuration for sending telemetry data to **OTLP-compatible** endpoints.
+
+2. **Using Packaged Observability Tools**
+
+   * A local stack of observability tools (e.g., **Prometheus**, **Grafana**, **Jaeger**, **NVIDIA/dcgm-exporter**, etc.) is deployed to collect and visualize telemetry data, including GPU metrics.
+   * Both platforms offer the option to deploy this self-contained stack for on-premises monitoring.
+
+3. **Hybrid Mode**
+
+   * Can combine both local observability tools and external services for telemetry data routing.
+   * Data can be sent both to internal tools (e.g., **Prometheus**) and external observability platforms for comprehensive monitoring.
+   * Provides more flexibility for custom deployments, such as **skipping forwarding to the OpenTelemetry Collector** and **forwarding to a custom vendor-based OTLP-compatible collector**, or adding more rules to the telemetry processing pipeline.
+
+Common Technical Specifications
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+- **Telemetry Collection**: Both platforms use the OpenTelemetry Collector or compatible OTLP collectors to gather telemetry data (metrics, traces).
+- **Endpoint Configuration**: Both platforms allow specifying endpoints for sending data, including local tools and external services like Grafana Cloud.
+- **Authentication**: External services require credentials (e.g., API tokens, access keys) for secure data transmission.
+- **Secure Communication**: Both platforms support encrypted communication channels for telemetry data transfer.
+- **Credential Management**: Platforms enable secure management of credentials for safe data forwarding to external services.
+- **Scalability**: Both platforms support scaling the telemetry stack, including the OpenTelemetry Collector and other observability components.
+- **Telemetry Configuration**: Both platforms allow fine-tuning settings for data collection, export formats, and the types of telemetry data.
+- **Flexible Deployment Options**: Both platforms support various configurations, from self-contained observability stacks to combining internal and external tools.
diff --git a/docs/telemetry/kubernetes.md b/docs/telemetry/kubernetes.md