Skip to content

Commit 4673c5d

Browse files
committed
add passing tests, full support for cloud file tables
1 parent 5edc871 commit 4673c5d

30 files changed

+3017
-412
lines changed

SCHEMA_EVOLUTION_DEBUG_SUMMARY.md

Lines changed: 277 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,277 @@
1+
# Reference Table Schema Evolution - Debug Summary
2+
3+
## Context
4+
5+
We are debugging why `TestAccReferenceTable_SchemaEvolution` fails in Terraform when updating a cloud file table's schema, even though the same operation works when calling the API directly.
6+
7+
### The Problem
8+
9+
- **Terraform test**: Fails when updating schema from 3 fields (a, b, c) to 4 fields (a, b, c, d) for a cloud file table
10+
- **Direct API test**: Same operation succeeds when done manually
11+
- **Root cause**: Schema updates for cloud file tables are asynchronous - the file sync must complete before the schema is updated in the backend
12+
13+
## What We've Done
14+
15+
### 1. Identified the Asynchronous Nature
16+
- Schema updates for cloud files happen asynchronously after file sync completes
17+
- The API returns 200 OK immediately, but the schema update happens later
18+
- In manual testing, we confirmed schema updates take ~5-10 seconds after the API call
19+
20+
### 2. Added Retry Logic to Terraform Provider
21+
**File**: `datadog/fwprovider/resource_datadog_reference_table.go`
22+
23+
- Added retry logic in `Update()` method to wait for schema updates (lines 415-470)
24+
- Added 3-second initial wait before first check (line 424)
25+
- Added pre-update check to ensure table is ready (status DONE/ERROR) before updating schema (lines 363-401)
26+
- Improved error messages to include status and file path for debugging
27+
- Retry configuration: 10 attempts, 5-second intervals (50 seconds total)
28+
29+
### 3. Updated Test Configuration
30+
**File**: `datadog/tests/resource_datadog_reference_table_test.go`
31+
32+
- Test creates table with `test.csv` (3 fields: a, b, c)
33+
- Test updates table with `test2.csv` (4 fields: a, b, c, d)
34+
- Added wait step between create and update (currently commented out for debugging)
35+
- Test expects 3 fields initially, then 4 fields after update
36+
37+
### 4. Key Code Changes
38+
39+
#### Pre-Update Wait Logic
40+
```go
41+
// If we're updating schema for a cloud file table, ensure the table is ready (not processing)
42+
// This prevents race conditions where we try to update while the initial sync is still running
43+
isUpdatingSchema := planState.Schema != nil
44+
if isUpdatingSchema && currentState.Source.ValueString() != "LOCAL_FILE" {
45+
// Check current status - if still processing, wait for it to complete
46+
currentResp, _, err := r.Api.GetTable(r.Auth, id)
47+
if err == nil && currentResp.Data != nil {
48+
attrs := currentResp.Data.GetAttributes()
49+
if status, ok := attrs.GetStatusOk(); ok && status != nil {
50+
statusStr := string(*status)
51+
if statusStr != "DONE" && statusStr != "ERROR" {
52+
// Wait for table to be ready before updating
53+
// ... retry logic ...
54+
}
55+
}
56+
}
57+
}
58+
```
59+
60+
#### Post-Update Retry Logic
61+
```go
62+
// Wait 3 seconds before first check (matches manual API test timing)
63+
if isUpdatingSchema && expectedFieldCount > 0 {
64+
time.Sleep(3 * time.Second)
65+
}
66+
67+
// Retry until schema matches expected field count
68+
retryErr := utils.Retry(5*time.Second, 10, func() error {
69+
resp, httpResp, err = r.Api.GetTable(r.Auth, id)
70+
// Check schema field count matches expected
71+
// Check for file processing errors
72+
// Return retryable error if schema doesn't match yet
73+
})
74+
```
75+
76+
## What Remains to Be Fixed
77+
78+
### 1. Test Still Failing
79+
- The test currently fails, but we've been debugging API 500 errors (likely transient)
80+
- Need to verify the retry logic works correctly once API is stable
81+
- May need to adjust retry count/interval based on actual API timing
82+
83+
### 2. Wait Step in Test
84+
- The wait step between create and update is currently commented out
85+
- Should be re-enabled once we confirm the pre-update wait logic works
86+
- Location: `datadog/tests/resource_datadog_reference_table_test.go` lines 88-95
87+
88+
### 3. Potential Issues to Investigate
89+
- **Race condition**: Even with pre-update wait, there might be a race between file sync completion and schema update
90+
- **Error handling**: Need to verify error messages are helpful when schema doesn't update
91+
- **Timing**: May need to increase retry count if schema updates take longer than 50 seconds
92+
93+
## Exact Commands
94+
95+
### Terraform Test Command
96+
97+
```bash
98+
cd /Users/guillaume.brizolier/go/src/github.com/DataDog/terraform-provider-datadog
99+
100+
dd-auth --domain dd.datad0g.com -- sh -c '
101+
export DD_TEST_CLIENT_API_KEY=$DD_API_KEY
102+
export DD_TEST_CLIENT_APP_KEY=$DD_APP_KEY
103+
export DD_TEST_SITE_URL=https://dd.datad0g.com/
104+
export DD_TEST_ORG=yB5yjZ
105+
export TF_ACC=true
106+
go test -v -run TestAccReferenceTable_SchemaEvolution ./datadog/tests/ -timeout 30m
107+
'
108+
```
109+
110+
### Manual API Test Commands
111+
112+
#### 1. Create Table with test.csv (3 fields: a, b, c)
113+
114+
```bash
115+
cd /Users/guillaume.brizolier/go/src/github.com/DataDog/terraform-provider-datadog
116+
117+
dd-auth --domain dd.datad0g.com -- sh -c '
118+
export DD_API_KEY
119+
export DD_APP_KEY
120+
TABLE_NAME="test_schema_evolution_$(date +%s)"
121+
122+
curl -X POST "https://dd.datad0g.com/api/v2/reference-tables/tables" \
123+
-H "DD-API-KEY: $DD_API_KEY" \
124+
-H "DD-APPLICATION-KEY: $DD_APP_KEY" \
125+
-H "Content-Type: application/json" \
126+
-d "{
127+
\"data\": {
128+
\"type\": \"reference_table\",
129+
\"attributes\": {
130+
\"table_name\": \"$TABLE_NAME\",
131+
\"source\": \"S3\",
132+
\"file_metadata\": {
133+
\"sync_enabled\": true,
134+
\"access_details\": {
135+
\"aws_detail\": {
136+
\"aws_account_id\": \"924305315327\",
137+
\"aws_bucket_name\": \"dd-reference-tables-dev-staging\",
138+
\"file_path\": \"test.csv\"
139+
}
140+
}
141+
},
142+
\"schema\": {
143+
\"primary_keys\": [\"a\"],
144+
\"fields\": [
145+
{\"name\": \"a\", \"type\": \"STRING\"},
146+
{\"name\": \"b\", \"type\": \"STRING\"},
147+
{\"name\": \"c\", \"type\": \"STRING\"}
148+
]
149+
}
150+
}
151+
}
152+
}"
153+
'
154+
```
155+
156+
**Save the table ID from the response** (e.g., `TABLE_ID="abc123..."`)
157+
158+
#### 2. Wait for Initial Sync (3 seconds)
159+
160+
```bash
161+
sleep 3
162+
```
163+
164+
#### 3. Update Table with test2.csv (4 fields: a, b, c, d)
165+
166+
```bash
167+
dd-auth --domain dd.datad0g.com -- sh -c '
168+
export DD_API_KEY
169+
export DD_APP_KEY
170+
TABLE_ID="<TABLE_ID_FROM_STEP_1>"
171+
172+
curl -X PATCH "https://dd.datad0g.com/api/v2/reference-tables/tables/$TABLE_ID" \
173+
-H "DD-API-KEY: $DD_API_KEY" \
174+
-H "DD-APPLICATION-KEY: $DD_APP_KEY" \
175+
-H "Content-Type: application/json" \
176+
-d "{
177+
\"data\": {
178+
\"type\": \"reference_table\",
179+
\"attributes\": {
180+
\"file_metadata\": {
181+
\"sync_enabled\": true,
182+
\"access_details\": {
183+
\"aws_detail\": {
184+
\"aws_account_id\": \"924305315327\",
185+
\"aws_bucket_name\": \"dd-reference-tables-dev-staging\",
186+
\"file_path\": \"test2.csv\"
187+
}
188+
}
189+
},
190+
\"schema\": {
191+
\"primary_keys\": [\"a\"],
192+
\"fields\": [
193+
{\"name\": \"a\", \"type\": \"STRING\"},
194+
{\"name\": \"b\", \"type\": \"STRING\"},
195+
{\"name\": \"c\", \"type\": \"STRING\"},
196+
{\"name\": \"d\", \"type\": \"STRING\"}
197+
]
198+
}
199+
}
200+
}
201+
}"
202+
'
203+
```
204+
205+
#### 4. Check Schema Immediately (should still show 3 fields)
206+
207+
```bash
208+
dd-auth --domain dd.datad0g.com -- sh -c '
209+
export DD_API_KEY
210+
export DD_APP_KEY
211+
TABLE_ID="<TABLE_ID_FROM_STEP_1>"
212+
213+
curl -X GET "https://dd.datad0g.com/api/v2/reference-tables/tables/$TABLE_ID" \
214+
-H "DD-API-KEY: $DD_API_KEY" \
215+
-H "DD-APPLICATION-KEY: $DD_APP_KEY" | jq ".data.attributes.schema.fields | length"
216+
'
217+
```
218+
219+
**Expected**: `3` (schema not updated yet)
220+
221+
#### 5. Wait and Check Again (should show 4 fields)
222+
223+
```bash
224+
sleep 5
225+
226+
dd-auth --domain dd.datad0g.com -- sh -c '
227+
export DD_API_KEY
228+
export DD_APP_KEY
229+
TABLE_ID="<TABLE_ID_FROM_STEP_1>"
230+
231+
curl -X GET "https://dd.datad0g.com/api/v2/reference-tables/tables/$TABLE_ID" \
232+
-H "DD-API-KEY: $DD_API_KEY" \
233+
-H "DD-APPLICATION-KEY: $DD_APP_KEY" | jq ".data.attributes.schema.fields | length"
234+
'
235+
```
236+
237+
**Expected**: `4` (schema updated after file sync completes)
238+
239+
## Test Files
240+
241+
### Test Configuration
242+
- **File**: `datadog/tests/resource_datadog_reference_table_test.go`
243+
- **Test function**: `TestAccReferenceTable_SchemaEvolution` (line 61)
244+
- **Initial config**: `testAccCheckDatadogReferenceTableSchemaInitial` (uses `test.csv` with 3 fields)
245+
- **Update config**: `testAccCheckDatadogReferenceTableSchemaAddFields` (uses `test2.csv` with 4 fields)
246+
247+
### Test Data Files (in S3 bucket `dd-reference-tables-dev-staging`)
248+
- **test.csv**: Contains columns `a, b, c` with sample data
249+
- **test2.csv**: Contains columns `a, b, c, d` with sample data
250+
251+
## Environment Variables Required
252+
253+
```bash
254+
export DD_TEST_CLIENT_API_KEY=<api_key>
255+
export DD_TEST_CLIENT_APP_KEY=<app_key>
256+
export DD_TEST_SITE_URL=https://dd.datad0g.com/
257+
export DD_TEST_ORG=yB5yjZ # Public org ID for staging
258+
export TF_ACC=true # Required to run acceptance tests
259+
```
260+
261+
## Key Insights
262+
263+
1. **Asynchronous Schema Updates**: Schema updates for cloud files are asynchronous and happen after file sync completes
264+
2. **Timing Matters**: Need to wait 3-5 seconds after update API call before schema reflects changes
265+
3. **Pre-Update Check**: Table must be in DONE/ERROR status before attempting schema update
266+
4. **Retry Logic**: Terraform provider needs retry logic to wait for async schema updates
267+
5. **Error Detection**: File processing errors (like "more columns than schema") indicate schema hasn't updated yet
268+
269+
## Next Steps
270+
271+
1. Re-run Terraform test once API is stable (currently seeing 500 errors)
272+
2. Verify pre-update wait logic prevents race conditions
273+
3. Re-enable wait step in test if needed
274+
4. Adjust retry timing if schema updates take longer than expected
275+
5. Add more detailed logging if issues persist
276+
277+

SCHEMA_UPDATE_ANALYSIS.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Schema Evolution & Async Processing Analysis
2+
3+
## Overview
4+
5+
The issue of "Schema Evolution" failing (where `GetTable` returns the old schema after a file update) is caused by the asynchronous nature of the file processing pipeline and how schema updates are handled differently in the **Create** versus **Update** (Patch/Replace) flows.
6+
7+
## API Logic: Create vs. Patch
8+
9+
### 1. Create Table (`POST /v2/reference-tables`)
10+
* **Endpoint**: `CreateTable` in `reference-tables-api/http/v2_endpoints.go`.
11+
* **Flow**:
12+
1. Calls `ImportFileForCreate` in `reference-tables-edge`.
13+
2. `reference-tables-edge` calls `CreateResources`.
14+
3. `CreateResources` calls `upsert.Create`.
15+
4. `upsert.Create` uses `convertHeadersForCreate` which **creates a new schema** from the file headers.
16+
5. It then calls `produceUpsertRawFile` with `RawFile_CREATE`.
17+
18+
### 2. Patch Table (`PATCH /v2/reference-tables/{id}`)
19+
* **Endpoint**: `PatchTable` in `reference-tables-api/http/v2_endpoints.go`.
20+
* **Flow**:
21+
* Validates access details and schema (if provided).
22+
* Calls `ImportFileForReplace` (via `handleNonLocalFileSyncDetails` or direct local flow).
23+
* `reference-tables-edge` calls `ReplaceResources`.
24+
* `ReplaceResources` calls `upsert.Replace`.
25+
* **Code:** [`Replace` in `upsert_replace.go`](https://github.com/DataDog/dd-go/blob/prod/resources/reference-tables/pkg/usecase/referencetables/upsert_replace.go#L72)
26+
27+
## The Core Issue: Additive-Only Schema in Replace
28+
29+
In the **Patch/Replace** flow (`upsert.Replace`), the schema update logic is **additive-only** and conservative.
30+
31+
**Logic in `convertHeadersForReplace` (`upsert.go`):**
32+
1. It iterates over the **existing** schema fields and adds them to the new schema definition.
33+
2. It checks if `primaryKey` fields are present in the new file headers.
34+
3. It then adds any **new** headers found in the file.
35+
4. **CRITICAL**: It does **NOT** remove fields that are missing from the file (unless they are primary keys, which causes an error).
36+
5. `schemaChanged` is set to true **only if new fields are added** (or labels change).
37+
38+
**Consequence**:
39+
* If you **add** a column: `schemaChanged` is true, and `UpdateTableSchema` is called synchronously. The schema evolves.
40+
* If you **remove** a column: `schemaChanged` is false (or the field is kept). The schema **does not evolve** to reflect the removal. The old column remains in the schema.
41+
42+
## The "Missing Link" & File Operator
43+
44+
You asked about the `file-operator` proof. Here is the clarification:
45+
46+
1. **File Operator Role**:
47+
* The `file-operator` processes the raw file and infers the schema *exactly as it is in the file*.
48+
* It sends this exact schema in a `TableDefinition` message to the `write-operator`.
49+
* **Code:** [`Process` in `blob/writer.go`](https://github.com/DataDog/dd-go/blob/prod/resources/reference-tables/pkg/usecase/blob/writer.go#L178)
50+
51+
2. **Write Operator & Aggregator (The Proof)**:
52+
* The `write-operator` consumes this message and calls `WriteSchema` on the backend.
53+
* For file-based tables (Postgres/Cassandra), the backend is `aggregator`.
54+
* **The Proof**: `Aggregator.WriteSchema` is a **NO-OP**.
55+
* **Code:** [`WriteSchema` (NO-OP) in `aggregator.go`](https://github.com/DataDog/dd-go/blob/prod/resources/reference-tables/pkg/repository/aggregator/aggregator.go#L131-L140)
56+
57+
```go
58+
func (a *Aggregator) WriteSchema(...) error {
59+
// ... tracing ...
60+
span.Finish()
61+
return nil // NO-OP: Does not write to Postgres
62+
}
63+
```
64+
65+
**Why this matters**:
66+
* Because `upsert.Replace` (Edge service) is **additive-only**, it cannot handle column removals.
67+
* The `file-operator` -> `write-operator` pipeline *has* the correct, exact schema (from the file).
68+
* If `Aggregator.WriteSchema` were implemented to update Postgres, it would **overwrite** the additive schema with the *exact* schema from the file, effectively supporting full schema evolution (including removals).
69+
* Since it is a NO-OP, we are stuck with the additive-only behavior of the synchronous Edge service.
70+
71+
## Conclusion
72+
73+
The "bug" preventing schema evolution (specifically removals or full sync) is a combination of:
74+
1. **Edge Service Design**: `upsert.Replace` is intentionally additive/safe.
75+
2. **Missing Async Update**: The `write-operator` (via `aggregator`) ignores the schema inferred from the file processing, which is the only place where the "true" file schema exists.
76+
77+
**To support full schema evolution (making the file the source of truth):**
78+
We must implement `WriteSchema` in `aggregator.go` to update `table_metadata` in Postgres. This will make the system eventually consistent with the file content.

0 commit comments

Comments
 (0)