DataDog
diff --git a/‎SCHEMA_EVOLUTION_DEBUG_SUMMARY.md‎
Lines changed: 277 additions & 0 deletions b/‎SCHEMA_EVOLUTION_DEBUG_SUMMARY.md‎
Lines changed: 277 additions & 0 deletions
diff --git a/‎SCHEMA_UPDATE_ANALYSIS.md‎
Lines changed: 78 additions & 0 deletions b/‎SCHEMA_UPDATE_ANALYSIS.md‎
Lines changed: 78 additions & 0 deletions
@@ -0,0 +1,277 @@
+# Reference Table Schema Evolution - Debug Summary
+
+## Context
+
+We are debugging why `TestAccReferenceTable_SchemaEvolution` fails in Terraform when updating a cloud file table's schema, even though the same operation works when calling the API directly.
+
+### The Problem
+
+- **Terraform test**: Fails when updating schema from 3 fields (a, b, c) to 4 fields (a, b, c, d) for a cloud file table
+- **Direct API test**: Same operation succeeds when done manually
+- **Root cause**: Schema updates for cloud file tables are asynchronous - the file sync must complete before the schema is updated in the backend
+
+## What We've Done
+
+### 1. Identified the Asynchronous Nature
+- Schema updates for cloud files happen asynchronously after file sync completes
+- The API returns 200 OK immediately, but the schema update happens later
+- In manual testing, we confirmed schema updates take ~5-10 seconds after the API call
+
+### 2. Added Retry Logic to Terraform Provider
+**File**: `datadog/fwprovider/resource_datadog_reference_table.go`
+
+- Added retry logic in `Update()` method to wait for schema updates (lines 415-470)
+- Added 3-second initial wait before first check (line 424)
+- Added pre-update check to ensure table is ready (status DONE/ERROR) before updating schema (lines 363-401)
+- Improved error messages to include status and file path for debugging
+- Retry configuration: 10 attempts, 5-second intervals (50 seconds total)
+
+### 3. Updated Test Configuration
+**File**: `datadog/tests/resource_datadog_reference_table_test.go`
+
+- Test creates table with `test.csv` (3 fields: a, b, c)
+- Test updates table with `test2.csv` (4 fields: a, b, c, d)
+- Added wait step between create and update (currently commented out for debugging)
+- Test expects 3 fields initially, then 4 fields after update
+
+### 4. Key Code Changes
+
+#### Pre-Update Wait Logic
+```go
+// If we're updating schema for a cloud file table, ensure the table is ready (not processing)
+// This prevents race conditions where we try to update while the initial sync is still running
+isUpdatingSchema := planState.Schema != nil
+if isUpdatingSchema && currentState.Source.ValueString() != "LOCAL_FILE" {
+    // Check current status - if still processing, wait for it to complete
+    currentResp, _, err := r.Api.GetTable(r.Auth, id)
+    if err == nil && currentResp.Data != nil {
+        attrs := currentResp.Data.GetAttributes()
+        if status, ok := attrs.GetStatusOk(); ok && status != nil {
+            statusStr := string(*status)
+            if statusStr != "DONE" && statusStr != "ERROR" {
+                // Wait for table to be ready before updating
+                // ... retry logic ...
+            }
+        }
+    }
+}
+```
+
+#### Post-Update Retry Logic
+```go
+// Wait 3 seconds before first check (matches manual API test timing)
+if isUpdatingSchema && expectedFieldCount > 0 {
+    time.Sleep(3 * time.Second)
+}
+
+// Retry until schema matches expected field count
+retryErr := utils.Retry(5*time.Second, 10, func() error {
+    resp, httpResp, err = r.Api.GetTable(r.Auth, id)
+    // Check schema field count matches expected
+    // Check for file processing errors
+    // Return retryable error if schema doesn't match yet
+})
+```
+
+## What Remains to Be Fixed
+
+### 1. Test Still Failing
+- The test currently fails, but we've been debugging API 500 errors (likely transient)
+- Need to verify the retry logic works correctly once API is stable
+- May need to adjust retry count/interval based on actual API timing
+
+### 2. Wait Step in Test
+- The wait step between create and update is currently commented out
+- Should be re-enabled once we confirm the pre-update wait logic works
+- Location: `datadog/tests/resource_datadog_reference_table_test.go` lines 88-95
+
+### 3. Potential Issues to Investigate
+- **Race condition**: Even with pre-update wait, there might be a race between file sync completion and schema update
+- **Error handling**: Need to verify error messages are helpful when schema doesn't update
+- **Timing**: May need to increase retry count if schema updates take longer than 50 seconds
+
+## Exact Commands
+
+### Terraform Test Command
+
+```bash
+cd /Users/guillaume.brizolier/go/src/github.com/DataDog/terraform-provider-datadog
+
+dd-auth --domain dd.datad0g.com -- sh -c '
+export DD_TEST_CLIENT_API_KEY=$DD_API_KEY
+export DD_TEST_CLIENT_APP_KEY=$DD_APP_KEY
+export DD_TEST_SITE_URL=https://dd.datad0g.com/
+export DD_TEST_ORG=yB5yjZ
+export TF_ACC=true
+go test -v -run TestAccReferenceTable_SchemaEvolution ./datadog/tests/ -timeout 30m
+'
+```
+
+### Manual API Test Commands
+
+#### 1. Create Table with test.csv (3 fields: a, b, c)
+
+```bash
+cd /Users/guillaume.brizolier/go/src/github.com/DataDog/terraform-provider-datadog
+
+dd-auth --domain dd.datad0g.com -- sh -c '
+export DD_API_KEY
+export DD_APP_KEY
+TABLE_NAME="test_schema_evolution_$(date +%s)"
+
+curl -X POST "https://dd.datad0g.com/api/v2/reference-tables/tables" \
+  -H "DD-API-KEY: $DD_API_KEY" \
+  -H "DD-APPLICATION-KEY: $DD_APP_KEY" \
+  -H "Content-Type: application/json" \
+  -d "{
+    \"data\": {
+      \"type\": \"reference_table\",
+      \"attributes\": {
+        \"table_name\": \"$TABLE_NAME\",
+        \"source\": \"S3\",
+        \"file_metadata\": {
+          \"sync_enabled\": true,
+          \"access_details\": {
+            \"aws_detail\": {
+              \"aws_account_id\": \"924305315327\",
+              \"aws_bucket_name\": \"dd-reference-tables-dev-staging\",
+              \"file_path\": \"test.csv\"
+            }
+          }
+        },
+        \"schema\": {
+          \"primary_keys\": [\"a\"],
+          \"fields\": [
+            {\"name\": \"a\", \"type\": \"STRING\"},
+            {\"name\": \"b\", \"type\": \"STRING\"},
+            {\"name\": \"c\", \"type\": \"STRING\"}
+          ]
+        }
+      }
+    }
+  }"
+'
+```
+
+**Save the table ID from the response** (e.g., `TABLE_ID="abc123..."`)
+
+#### 2. Wait for Initial Sync (3 seconds)
+
+```bash
+sleep 3
+```
+
+#### 3. Update Table with test2.csv (4 fields: a, b, c, d)
+
+```bash
+dd-auth --domain dd.datad0g.com -- sh -c '
+export DD_API_KEY
+export DD_APP_KEY
+TABLE_ID="<TABLE_ID_FROM_STEP_1>"
+
+curl -X PATCH "https://dd.datad0g.com/api/v2/reference-tables/tables/$TABLE_ID" \
+  -H "DD-API-KEY: $DD_API_KEY" \
+  -H "DD-APPLICATION-KEY: $DD_APP_KEY" \
+  -H "Content-Type: application/json" \
+  -d "{
+    \"data\": {
+      \"type\": \"reference_table\",
+      \"attributes\": {
+        \"file_metadata\": {
+          \"sync_enabled\": true,
+          \"access_details\": {
+            \"aws_detail\": {
+              \"aws_account_id\": \"924305315327\",
+              \"aws_bucket_name\": \"dd-reference-tables-dev-staging\",
+              \"file_path\": \"test2.csv\"
+            }
+          }
+        },
+        \"schema\": {
+          \"primary_keys\": [\"a\"],
+          \"fields\": [
+            {\"name\": \"a\", \"type\": \"STRING\"},
+            {\"name\": \"b\", \"type\": \"STRING\"},
+            {\"name\": \"c\", \"type\": \"STRING\"},
+            {\"name\": \"d\", \"type\": \"STRING\"}
+          ]
+        }
+      }
+    }
+  }"
+'
+```
+
+#### 4. Check Schema Immediately (should still show 3 fields)
+
+```bash
+dd-auth --domain dd.datad0g.com -- sh -c '
+export DD_API_KEY
+export DD_APP_KEY
+TABLE_ID="<TABLE_ID_FROM_STEP_1>"
+
+curl -X GET "https://dd.datad0g.com/api/v2/reference-tables/tables/$TABLE_ID" \
+  -H "DD-API-KEY: $DD_API_KEY" \
+  -H "DD-APPLICATION-KEY: $DD_APP_KEY" | jq ".data.attributes.schema.fields | length"
+'
+```
+
+**Expected**: `3` (schema not updated yet)
+
+#### 5. Wait and Check Again (should show 4 fields)
+
+```bash
+sleep 5
+
+dd-auth --domain dd.datad0g.com -- sh -c '
+export DD_API_KEY
+export DD_APP_KEY
+TABLE_ID="<TABLE_ID_FROM_STEP_1>"
+
+curl -X GET "https://dd.datad0g.com/api/v2/reference-tables/tables/$TABLE_ID" \
+  -H "DD-API-KEY: $DD_API_KEY" \
+  -H "DD-APPLICATION-KEY: $DD_APP_KEY" | jq ".data.attributes.schema.fields | length"
+'
+```
+
+**Expected**: `4` (schema updated after file sync completes)
+
+## Test Files
+
+### Test Configuration
+- **File**: `datadog/tests/resource_datadog_reference_table_test.go`
+- **Test function**: `TestAccReferenceTable_SchemaEvolution` (line 61)
+- **Initial config**: `testAccCheckDatadogReferenceTableSchemaInitial` (uses `test.csv` with 3 fields)
+- **Update config**: `testAccCheckDatadogReferenceTableSchemaAddFields` (uses `test2.csv` with 4 fields)
+
+### Test Data Files (in S3 bucket `dd-reference-tables-dev-staging`)
+- **test.csv**: Contains columns `a, b, c` with sample data
+- **test2.csv**: Contains columns `a, b, c, d` with sample data
+
+## Environment Variables Required
+
+```bash
+export DD_TEST_CLIENT_API_KEY=<api_key>
+export DD_TEST_CLIENT_APP_KEY=<app_key>
+export DD_TEST_SITE_URL=https://dd.datad0g.com/
+export DD_TEST_ORG=yB5yjZ  # Public org ID for staging
+export TF_ACC=true  # Required to run acceptance tests
+```
+
+## Key Insights
+
+1. **Asynchronous Schema Updates**: Schema updates for cloud files are asynchronous and happen after file sync completes
+2. **Timing Matters**: Need to wait 3-5 seconds after update API call before schema reflects changes
+3. **Pre-Update Check**: Table must be in DONE/ERROR status before attempting schema update
+4. **Retry Logic**: Terraform provider needs retry logic to wait for async schema updates
+5. **Error Detection**: File processing errors (like "more columns than schema") indicate schema hasn't updated yet
+
+## Next Steps
+
+1. Re-run Terraform test once API is stable (currently seeing 500 errors)
+2. Verify pre-update wait logic prevents race conditions
+3. Re-enable wait step in test if needed
+4. Adjust retry timing if schema updates take longer than expected
+5. Add more detailed logging if issues persist
+
+
@@ -0,0 +1,78 @@
+# Schema Evolution & Async Processing Analysis
+
+## Overview
+
+The issue of "Schema Evolution" failing (where `GetTable` returns the old schema after a file update) is caused by the asynchronous nature of the file processing pipeline and how schema updates are handled differently in the **Create** versus **Update** (Patch/Replace) flows.
+
+## API Logic: Create vs. Patch
+
+### 1. Create Table (`POST /v2/reference-tables`)
+*   **Endpoint**: `CreateTable` in `reference-tables-api/http/v2_endpoints.go`.
+*   **Flow**:
+    1.  Calls `ImportFileForCreate` in `reference-tables-edge`.
+    2.  `reference-tables-edge` calls `CreateResources`.
+    3.  `CreateResources` calls `upsert.Create`.
+    4.  `upsert.Create` uses `convertHeadersForCreate` which **creates a new schema** from the file headers.
+    5.  It then calls `produceUpsertRawFile` with `RawFile_CREATE`.
+
+### 2. Patch Table (`PATCH /v2/reference-tables/{id}`)
+*   **Endpoint**: `PatchTable` in `reference-tables-api/http/v2_endpoints.go`.
+*   **Flow**:
+    *   Validates access details and schema (if provided).
+    *   Calls `ImportFileForReplace` (via `handleNonLocalFileSyncDetails` or direct local flow).
+    *   `reference-tables-edge` calls `ReplaceResources`.
+    *   `ReplaceResources` calls `upsert.Replace`.
+    *   **Code:** [`Replace` in `upsert_replace.go`](https://github.com/DataDog/dd-go/blob/prod/resources/reference-tables/pkg/usecase/referencetables/upsert_replace.go#L72)
+
+## The Core Issue: Additive-Only Schema in Replace
+
+In the **Patch/Replace** flow (`upsert.Replace`), the schema update logic is **additive-only** and conservative.
+
+**Logic in `convertHeadersForReplace` (`upsert.go`):**
+1.  It iterates over the **existing** schema fields and adds them to the new schema definition.
+2.  It checks if `primaryKey` fields are present in the new file headers.
+3.  It then adds any **new** headers found in the file.
+4.  **CRITICAL**: It does **NOT** remove fields that are missing from the file (unless they are primary keys, which causes an error).
+5.  `schemaChanged` is set to true **only if new fields are added** (or labels change).
+
+**Consequence**:
+*   If you **add** a column: `schemaChanged` is true, and `UpdateTableSchema` is called synchronously. The schema evolves.
+*   If you **remove** a column: `schemaChanged` is false (or the field is kept). The schema **does not evolve** to reflect the removal. The old column remains in the schema.
+
+## The "Missing Link" & File Operator
+
+You asked about the `file-operator` proof. Here is the clarification:
+
+1.  **File Operator Role**:
+    *   The `file-operator` processes the raw file and infers the schema *exactly as it is in the file*.
+    *   It sends this exact schema in a `TableDefinition` message to the `write-operator`.
+    *   **Code:** [`Process` in `blob/writer.go`](https://github.com/DataDog/dd-go/blob/prod/resources/reference-tables/pkg/usecase/blob/writer.go#L178)
+
+2.  **Write Operator & Aggregator (The Proof)**:
+    *   The `write-operator` consumes this message and calls `WriteSchema` on the backend.
+    *   For file-based tables (Postgres/Cassandra), the backend is `aggregator`.
+    *   **The Proof**: `Aggregator.WriteSchema` is a **NO-OP**.
+    *   **Code:** [`WriteSchema` (NO-OP) in `aggregator.go`](https://github.com/DataDog/dd-go/blob/prod/resources/reference-tables/pkg/repository/aggregator/aggregator.go#L131-L140)
+
+```go
+func (a *Aggregator) WriteSchema(...) error {
+    // ... tracing ...
+    span.Finish()
+    return nil // NO-OP: Does not write to Postgres
+}
+```
+
+**Why this matters**:
+*   Because `upsert.Replace` (Edge service) is **additive-only**, it cannot handle column removals.
+*   The `file-operator` -> `write-operator` pipeline *has* the correct, exact schema (from the file).
+*   If `Aggregator.WriteSchema` were implemented to update Postgres, it would **overwrite** the additive schema with the *exact* schema from the file, effectively supporting full schema evolution (including removals).
+*   Since it is a NO-OP, we are stuck with the additive-only behavior of the synchronous Edge service.
+
+## Conclusion
+
+The "bug" preventing schema evolution (specifically removals or full sync) is a combination of:
+1.  **Edge Service Design**: `upsert.Replace` is intentionally additive/safe.
+2.  **Missing Async Update**: The `write-operator` (via `aggregator`) ignores the schema inferred from the file processing, which is the only place where the "true" file schema exists.
+
+**To support full schema evolution (making the file the source of truth):**
+We must implement `WriteSchema` in `aggregator.go` to update `table_metadata` in Postgres. This will make the system eventually consistent with the file content.