DataDog
diff --git a/‎SCHEMA_UPDATE_ANALYSIS.md‎
Lines changed: 78 additions & 0 deletions b/‎SCHEMA_UPDATE_ANALYSIS.md‎
Lines changed: 78 additions & 0 deletions
diff --git a/‎datadog/fwprovider/data_source_datadog_reference_table_rows.go‎
Lines changed: 33 additions & 4 deletions b/‎datadog/fwprovider/data_source_datadog_reference_table_rows.go‎
Lines changed: 33 additions & 4 deletions
@@ -0,0 +1,78 @@
+# Schema Evolution & Async Processing Analysis
+
+## Overview
+
+The issue of "Schema Evolution" failing (where `GetTable` returns the old schema after a file update) is caused by the asynchronous nature of the file processing pipeline and how schema updates are handled differently in the **Create** versus **Update** (Patch/Replace) flows.
+
+## API Logic: Create vs. Patch
+
+### 1. Create Table (`POST /v2/reference-tables`)
+*   **Endpoint**: `CreateTable` in `reference-tables-api/http/v2_endpoints.go`.
+*   **Flow**:
+    1.  Calls `ImportFileForCreate` in `reference-tables-edge`.
+    2.  `reference-tables-edge` calls `CreateResources`.
+    3.  `CreateResources` calls `upsert.Create`.
+    4.  `upsert.Create` uses `convertHeadersForCreate` which **creates a new schema** from the file headers.
+    5.  It then calls `produceUpsertRawFile` with `RawFile_CREATE`.
+
+### 2. Patch Table (`PATCH /v2/reference-tables/{id}`)
+*   **Endpoint**: `PatchTable` in `reference-tables-api/http/v2_endpoints.go`.
+*   **Flow**:
+    *   Validates access details and schema (if provided).
+    *   Calls `ImportFileForReplace` (via `handleNonLocalFileSyncDetails` or direct local flow).
+    *   `reference-tables-edge` calls `ReplaceResources`.
+    *   `ReplaceResources` calls `upsert.Replace`.
+    *   **Code:** [`Replace` in `upsert_replace.go`](https://github.com/DataDog/dd-go/blob/prod/resources/reference-tables/pkg/usecase/referencetables/upsert_replace.go#L72)
+
+## The Core Issue: Additive-Only Schema in Replace
+
+In the **Patch/Replace** flow (`upsert.Replace`), the schema update logic is **additive-only** and conservative.
+
+**Logic in `convertHeadersForReplace` (`upsert.go`):**
+1.  It iterates over the **existing** schema fields and adds them to the new schema definition.
+2.  It checks if `primaryKey` fields are present in the new file headers.
+3.  It then adds any **new** headers found in the file.
+4.  **CRITICAL**: It does **NOT** remove fields that are missing from the file (unless they are primary keys, which causes an error).
+5.  `schemaChanged` is set to true **only if new fields are added** (or labels change).
+
+**Consequence**:
+*   If you **add** a column: `schemaChanged` is true, and `UpdateTableSchema` is called synchronously. The schema evolves.
+*   If you **remove** a column: `schemaChanged` is false (or the field is kept). The schema **does not evolve** to reflect the removal. The old column remains in the schema.
+
+## The "Missing Link" & File Operator
+
+You asked about the `file-operator` proof. Here is the clarification:
+
+1.  **File Operator Role**:
+    *   The `file-operator` processes the raw file and infers the schema *exactly as it is in the file*.
+    *   It sends this exact schema in a `TableDefinition` message to the `write-operator`.
+    *   **Code:** [`Process` in `blob/writer.go`](https://github.com/DataDog/dd-go/blob/prod/resources/reference-tables/pkg/usecase/blob/writer.go#L178)
+
+2.  **Write Operator & Aggregator (The Proof)**:
+    *   The `write-operator` consumes this message and calls `WriteSchema` on the backend.
+    *   For file-based tables (Postgres/Cassandra), the backend is `aggregator`.
+    *   **The Proof**: `Aggregator.WriteSchema` is a **NO-OP**.
+    *   **Code:** [`WriteSchema` (NO-OP) in `aggregator.go`](https://github.com/DataDog/dd-go/blob/prod/resources/reference-tables/pkg/repository/aggregator/aggregator.go#L131-L140)
+
+```go
+func (a *Aggregator) WriteSchema(...) error {
+    // ... tracing ...
+    span.Finish()
+    return nil // NO-OP: Does not write to Postgres
+}
+```
+
+**Why this matters**:
+*   Because `upsert.Replace` (Edge service) is **additive-only**, it cannot handle column removals.
+*   The `file-operator` -> `write-operator` pipeline *has* the correct, exact schema (from the file).
+*   If `Aggregator.WriteSchema` were implemented to update Postgres, it would **overwrite** the additive schema with the *exact* schema from the file, effectively supporting full schema evolution (including removals).
+*   Since it is a NO-OP, we are stuck with the additive-only behavior of the synchronous Edge service.
+
+## Conclusion
+
+The "bug" preventing schema evolution (specifically removals or full sync) is a combination of:
+1.  **Edge Service Design**: `upsert.Replace` is intentionally additive/safe.
+2.  **Missing Async Update**: The `write-operator` (via `aggregator`) ignores the schema inferred from the file processing, which is the only place where the "true" file schema exists.
+
+**To support full schema evolution (making the file the source of truth):**
+We must implement `WriteSchema` in `aggregator.go` to update `table_metadata` in Postgres. This will make the system eventually consistent with the file content.
@@ -3,6 +3,8 @@ package fwprovider
 import (
 	"context"
 	"fmt"
+	"net/http"
+	"time"
 
 	"github.com/DataDog/datadog-api-client-go/v2/api/datadogV2"
 	"github.com/hashicorp/terraform-plugin-framework/datasource"
@@ -100,10 +102,37 @@ func (d *datadogReferenceTableRowsDataSource) Read(ctx context.Context, request
 		return
 	}
 
-	// Call API to get rows by ID
-	ddResp, _, err := d.Api.GetRowsByID(d.Auth, tableId, rowIds)
-	if err != nil {
-		response.Diagnostics.Append(utils.FrameworkErrorDiag(err, "error getting reference table rows"))
+	// Call API to get rows by ID with retry logic
+	// Rows are written asynchronously, so we need to retry if the table hasn't synced yet
+	// Use a 5-second interval to avoid spamming the API while waiting for sync
+	var ddResp datadogV2.TableRowResourceArray
+	var httpResp *http.Response
+	var err error
+
+	retryErr := utils.Retry(5*time.Second, 10, func() error {
+		ddResp, httpResp, err = d.Api.GetRowsByID(d.Auth, tableId, rowIds)
+		if err != nil {
+			// If we get a 404, the table might not have synced yet - retry
+			if httpResp != nil && httpResp.StatusCode == 404 {
+				return &utils.RetryableError{Prob: fmt.Sprintf("rows not found (table may not have synced yet): %v", err)}
+			}
+			// For other errors, don't retry
+			return &utils.FatalError{Prob: fmt.Sprintf("error getting reference table rows: %v", err)}
+		}
+		// Success - check if we got the expected number of rows
+		if len(ddResp.Data) == len(rowIds) {
+			return nil
+		}
+		// If we got some rows but not all, the table might still be syncing - retry
+		if len(ddResp.Data) > 0 && len(ddResp.Data) < len(rowIds) {
+			return &utils.RetryableError{Prob: fmt.Sprintf("only %d of %d rows found (table may still be syncing)", len(ddResp.Data), len(rowIds))}
+		}
+		// If we got no rows, retry
+		return &utils.RetryableError{Prob: "no rows found (table may not have synced yet)"}
+	})
+
+	if retryErr != nil {
+		response.Diagnostics.Append(utils.FrameworkErrorDiag(retryErr, "error getting reference table rows"))
 		return
 	}