Skip to content

Commit 175d592

Browse files
committed
Address several roborev review findings
All 24 individual tests pass (all checkmarks). The "FAIL" is a pre-existing vitest configuration issue ("No test suite found") unrelated to my changes — all actual test cases pass. Changes: - In `translateSql`, only the first standalone `_t_1` occurrence gets the `AS _t_1` alias; subsequent standalone occurrences get the bare `readExpr` to avoid duplicate alias issues in UNION/subquery contexts - Added test case for multiple standalone `_t_1` references verifying alias-on-first-only behavior Address review findings (job 115) All 274 tests pass. The changes compile cleanly and tests are green. Changes: - Include Parquet conversion warning in the error path (not just success), so users see the full chain of failures when both DuckDB and sqlp fail - Log when Parquet warning cannot be prepended due to unexpected content structure Address review findings (job 117) All changes are complete. Build passes and all 274 tests pass. Changes: - Fix SQL injection via `compression`/codec parameter: whitelist valid codecs (`zstd`, `snappy`, `gzip`, `lz4`, `uncompressed`) and reject anything else - Fix SQL injection via `delimiter` parameter: validate exactly 1 character and escape single quotes - Fix race condition in `ensureParquet()`: add in-memory lock map to deduplicate concurrent conversions for the same file - Add error checking after each qsv subprocess call in `ensureParquet()`: wrap each step in try/catch with descriptive error messages identifying which step failed - Verify Parquet output file exists after conversion before returning success Address review findings (job 118) No tests reference the codec list. The change is minimal and the build succeeds. Changes: - Add `brotli` and `lzo` to `VALID_PARQUET_CODECS` set to match DuckDB's supported compression codecs Address review findings (job 120) All changes complete. Build succeeds and all 276 tests pass. Changes: - Remove unused `runQsvSimple` import from `duckdb.ts` (finding #1) - Add trust assumption comment documenting that `sql` parameter comes from trusted MCP agent in `tryDuckDbExecution` (finding #2) - Clean up partial `.parquet` file on conversion failure in `doParquetConversion` to prevent stale corrupted files (finding #4) - Detect multi-table SQL queries (`_t_2`, `_t_3`, etc.) and fall back to sqlp preemptively instead of sending untranslated references to DuckDB (finding #5) - Add tests for multi-table reference detection regex and `_t_2+` non-translation behavior Address review findings (job 121) Everything looks clean. Here's a summary: Changes: - Fix multi-table sqlp fallback: stop setting `SKIP_INPUT` for multi-table queries so sqlp receives the original input file and can resolve `_t_2+` references natively - Extract `MULTI_TABLE_PATTERN` regex to a shared constant exported from `duckdb.ts`, eliminating duplication between `mcp-tools.ts` and `duckdb.test.ts` - Clean up asymmetric brace style in the DuckDB/sqlp interception block into a clean if/else Address review findings (job 122) All 276 tests pass. The changes are minimal and focused: Changes: - Remove case-insensitive `/i` flag from `MULTI_TABLE_PATTERN` regex since sqlp table references (`_t_N`) are case-sensitive - Add test cases verifying uppercase `_T_2` and `_T_10` do not match the multi-table pattern Address review findings (job 124) All changes applied, build and tests pass (276/276). Changes: - Add `.sz` compressed file variants to `isCsvLikeFile()` so Snappy-compressed CSV/TSV/TAB/SSV files are recognized and converted to Parquet (finding 1) - Move `ensureParquet()` call after `MULTI_TABLE_PATTERN` check to avoid wasted Parquet conversion on multi-table queries that fall through to sqlp (finding 2) - Restore case-insensitive flag on `MULTI_TABLE_PATTERN` regex for consistency with case-insensitive `_t_1` replacement in `translateSql` (finding 3) - Update test assertions to expect uppercase `_T_2`/`_T_10` to match the now case-insensitive pattern Address review findings (job 125) All 280 tests pass. Here's the summary: Changes: - Reverted `MULTI_TABLE_PATTERN` regex to case-sensitive (removed `/i` flag) since sqlp table references (`_t_N`) are case-sensitive and uppercase `_T_2` would cause query errors - Reverted duckdb tests to assert uppercase `_T_N` does NOT match the pattern - Exported `isCsvLikeFile` function from `mcp-tools.ts` to enable testing - Added unit tests for `isCsvLikeFile` covering standard extensions, `.sz` Snappy-compressed variants, case-insensitivity, and non-CSV rejection Address review findings (job 127) Changes: - `getParquetPath` now uses `basename()` for extension matching, preventing false matches when directory paths contain CSV-like extensions (e.g., `/data/CSV_FILES/test.csv`) - `MULTI_TABLE_PATTERN` made case-insensitive (`/i` flag) to be consistent with the `_t_1` replacement pattern, preventing confusing DuckDB errors when agents send uppercase `_T_2` - `translateSql` now logs a warning for unknown file extensions (e.g., `.xlsx`) that are treated as CSV, helping diagnose cryptic DuckDB errors - Secondary SIGKILL timer in `executeDuckDbQuery` is now properly cleared in both `close` and `error` handlers, preventing unnecessary SIGKILL on already-exited processes - Updated `duckdb.test.ts` to match the case-insensitive `MULTI_TABLE_PATTERN` behavior Address review findings (job 128) All changes are complete. TypeScript compiles cleanly and all 286 tests pass. Changes: - Fix `MULTI_TABLE_PATTERN` flip-flop: make regex case-sensitive again, add `normalizeTableRefs()` to lowercase `_T_N` → `_t_N` before routing, so sqlp receives consistent lowercase refs - Extract `CSV_LIKE_EXTENSIONS` shared constant in `duckdb.ts`, reuse in both `isCsvLikeFile()` and `translateSql()` to eliminate duplicated extension lists - Export `getParquetPath()` and add regression tests for directory-path false-match bug (e.g., `/data/csv_files/test.json`) - Add tests for `normalizeTableRefs()` and its interaction with `MULTI_TABLE_PATTERN` Address review findings (job 129) All 286 tests pass. The diagnostics about `qsv_bin` are pre-existing Rust issues unrelated to these changes. Changes: - Added comment explaining `.txt`/`.dat` are warning-only suppressions, not in `CSV_LIKE_EXTENSIONS` (no Parquet conversion) - Simplified `params.sql` assignment by removing conditional check — always assign normalized SQL Address review findings (job 131) All 286 tests pass. Here's the summary: Changes: - Remove `params.output` fallback in `tryDuckDbExecution` to use only the pre-validated `outputFile` parameter, closing a potential path validation bypass - Add clarifying comment to `getParquetPath` explaining why the mixed-case logic (lowercase matching + original-path slicing) is correct - Add comment to `translateSql` documenting that `_t_1` replacement also applies inside SQL string literals (acceptable limitation for MCP agent usage) Address review findings (job 133) All 287 tests pass, 0 failures. Changes: - Consolidated `ChildProcess` type import with `spawn` into a single `child_process` import in `duckdb.ts` - Extracted inline `maxStderrSize` magic number to named constant `MAX_STDERR_SIZE` alongside `DUCKDB_VALIDATION_TIMEOUT_MS` - Exported `ensureParquet` and added tests for its early-return paths (non-CSV file passthrough) Address review findings (job 135) All 287 tests pass. Changes: - Made `MULTI_TABLE_PATTERN` case-insensitive (`/i` flag) for defense-in-depth, so uppercase `_T_N` references are caught even without `normalizeTableRefs` being called first - Fixed TOCTOU race in `ensureParquet` by checking the lock map before any async `stat()` calls - Added zero-byte file validation after Parquet conversion to catch corrupted/empty output files - Updated test assertions to match the case-insensitive `MULTI_TABLE_PATTERN` behavior Address review findings (job 136) All 287 tests pass. The diagnostics about `qsv_bin` and `lesson_*.rs` are pre-existing issues unrelated to this change. Changes: - Restructured zero-byte file validation in `doParquetConversion` to use separate try/catch for `stat()` and a plain `if` for the size check, eliminating brittle string-matching error re-throw logic Address review findings (job 138) All changes are complete. Build passes and all 289 tests pass. Changes: - Fix `translateSql` to skip `_t_1` replacements inside single-quoted SQL string literals using a quote-aware regex - Fix `normalizeTableRefs` regex to handle both `_T_N` and `_t_N` variants (was only matching uppercase `_T_`) - Prefix delimiter validation error with `[DuckDB]` for clearer error attribution - Document `getParquetPath` behavior for non-CSV files (appends `.parquet` suffix) - Add tests for string literal protection (`SELECT '_t_1'` preserved) and escaped quotes - Add test for `normalizeTableRefs` idempotency on already-lowercase refs Address review findings (job 140) All 290 tests pass. The changes are minimal and focused on the review findings. Changes: - Added comment documenting that multi-table queries (`_t_2+`) don't benefit from Parquet auto-conversion, directing users to manually convert with `qsv_to_parquet` - Added comment clarifying delimiter validation: actual tab character (`\t`, length 1) works correctly; only the 2-char string literal `\\t` would be rejected - Added test verifying multi-character delimiter strings are rejected by `translateSql` Address review findings (job 142) All 290 tests pass, build succeeds. Here's a summary: Changes: - Fix `translateSql` to use `(read_expr) AS _t_1` alias so qualified column refs like `_t_1.column` remain valid (only standalone `_t_1` not followed by `.` is replaced) - Reuse `translateSql()` in sqlp fallback path instead of duplicating the `_t_1` → `read_parquet()` regex logic inline - Add early `inputFile` existence check in `doParquetConversion` with clear "Input file not found" error message - Simplify the stats/schema freshness check by removing unnecessary nested try/catch (input file is now validated upfront) - Update all test assertions to match the new aliased read expression format
1 parent ab9daad commit 175d592

File tree

5 files changed

+387
-92
lines changed

5 files changed

+387
-92
lines changed

.claude/skills/package-lock.json

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

.claude/skills/src/duckdb.ts

Lines changed: 72 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -6,20 +6,27 @@
66
* for better PostgreSQL compatibility and performance.
77
*/
88

9-
import { execFileSync } from "child_process";
9+
import { execFileSync, spawn, type ChildProcess } from "child_process";
1010
import { statSync } from "fs";
1111
import { homedir } from "os";
1212
import { join } from "path";
1313
import { config } from "./config.js";
14-
import { runQsvSimple } from "./executor.js";
15-
import type { ChildProcess } from "child_process";
16-
import { spawn } from "child_process";
1714

1815
/**
1916
* Timeout for DuckDB binary validation in milliseconds (5 seconds)
2017
*/
2118
const DUCKDB_VALIDATION_TIMEOUT_MS = 5000;
2219

20+
/**
21+
* Maximum stderr buffer size in bytes (1 MB) — prevents unbounded memory growth
22+
*/
23+
const MAX_STDERR_SIZE = 1024 * 1024;
24+
25+
/**
26+
* Valid Parquet compression codecs for DuckDB COPY TO
27+
*/
28+
const VALID_PARQUET_CODECS = new Set(["zstd", "snappy", "gzip", "lz4", "lzo", "brotli", "uncompressed"]);
29+
2330
/**
2431
* DuckDB detection state — sticky once resolved
2532
*/
@@ -74,6 +81,24 @@ export interface DuckDbResult {
7481
stderr: string;
7582
}
7683

84+
/** Matches sqlp multi-table references (_t_2, _t_3, … _t_10, etc.) — case-insensitive for defense-in-depth */
85+
export const MULTI_TABLE_PATTERN = /\b_t_[2-9]\b|\b_t_\d{2,}\b/i;
86+
87+
/** Known CSV-like file extensions (shared across modules) */
88+
export const CSV_LIKE_EXTENSIONS = [
89+
".csv", ".tsv", ".tab", ".ssv",
90+
".csv.sz", ".tsv.sz", ".tab.sz", ".ssv.sz",
91+
] as const;
92+
93+
/**
94+
* Normalize uppercase table references (_T_1, _T_2, etc.) to lowercase.
95+
* Agents may send uppercase `_T_N`; this ensures consistent lowercase
96+
* references before pattern matching and sqlp execution.
97+
*/
98+
export function normalizeTableRefs(sql: string): string {
99+
return sql.replace(/\b_[tT]_(\d+)\b/g, (_match, n) => `_t_${n}`);
100+
}
101+
77102
/**
78103
* Check if DuckDB is enabled via environment variable.
79104
* Returns false only if explicitly disabled.
@@ -246,10 +271,22 @@ export function translateSql(
246271
} else if (lowerPath.endsWith(".jsonl") || lowerPath.endsWith(".ndjson")) {
247272
readExpr = `read_json('${escapedPath}')`;
248273
} else {
249-
// CSV-like: build read_csv with options
274+
// CSV-like or unknown extension: build read_csv with options
275+
// Warn for extensions that are unlikely to be CSV-compatible
276+
// .txt/.dat suppress warnings but aren't in CSV_LIKE_EXTENSIONS (no Parquet conversion)
277+
const knownExts = [".txt", ".dat", ...CSV_LIKE_EXTENSIONS];
278+
if (!knownExts.some((ext) => lowerPath.endsWith(ext))) {
279+
console.error(`[DuckDB] Warning: Unknown file extension treated as CSV: ${inputFile}`);
280+
}
250281
const csvOptions: string[] = [];
251282
if (options?.delimiter) {
252-
csvOptions.push(`delim = '${options.delimiter}'`);
283+
// Note: actual tab character (\t) has length 1 and works correctly.
284+
// The 2-char string literal "\\t" would be rejected, which is intentional.
285+
if (options.delimiter.length !== 1) {
286+
throw new Error(`[DuckDB] Invalid delimiter: must be exactly 1 character, got ${options.delimiter.length}`);
287+
}
288+
const escapedDelim = options.delimiter.replace(/'/g, "''");
289+
csvOptions.push(`delim = '${escapedDelim}'`);
253290
}
254291
if (options?.rnullValues) {
255292
// Parse comma-separated null values into array, escaping single quotes
@@ -266,8 +303,24 @@ export function translateSql(
266303
}
267304
}
268305

269-
// Replace _t_1 (case-insensitive, word boundary) with the read expression
270-
const translated = sql.replace(/\b_t_1\b/gi, readExpr);
306+
// Replace standalone _t_1 (not followed by a dot) with the read expression,
307+
// so that qualified column refs like _t_1.column remain valid via the alias.
308+
// Only the first standalone occurrence gets the alias; subsequent ones get bare readExpr
309+
// to avoid duplicate alias issues in UNION or subquery contexts.
310+
// Skip replacements inside single-quoted SQL string literals.
311+
const aliasedExpr = `(${readExpr}) AS _t_1`;
312+
let firstReplaced = false;
313+
const translated = sql.replace(
314+
/'[^']*(?:''[^']*)*'|\b_t_1\b(?!\.)/gi,
315+
(match) => {
316+
if (match.startsWith("'")) return match;
317+
if (!firstReplaced) {
318+
firstReplaced = true;
319+
return aliasedExpr;
320+
}
321+
return readExpr;
322+
},
323+
);
271324
return translated;
272325
}
273326

@@ -312,7 +365,10 @@ export async function executeDuckDbQuery(
312365
throw new Error("output_file is required for parquet format output with DuckDB");
313366
}
314367
const normalizedOutput = outputFile.replace(/\\/g, "/").replace(/'/g, "''");
315-
const codec = options?.compression ?? "zstd";
368+
const codec = (options?.compression ?? "zstd").toLowerCase();
369+
if (!VALID_PARQUET_CODECS.has(codec)) {
370+
throw new Error(`Invalid parquet codec '${codec}'. Valid codecs: ${[...VALID_PARQUET_CODECS].join(", ")}`);
371+
}
316372
fullSql += `COPY (${sql}) TO '${normalizedOutput}' (FORMAT PARQUET, CODEC '${codec}');`;
317373
} else {
318374
fullSql += sql;
@@ -340,11 +396,12 @@ export async function executeDuckDbQuery(
340396
let stdoutTruncated = false;
341397
let timedOut = false;
342398
let timer: ReturnType<typeof setTimeout> | null = null;
399+
let killTimer: ReturnType<typeof setTimeout> | null = null;
343400

344401
timer = setTimeout(() => {
345402
timedOut = true;
346403
proc.kill("SIGTERM");
347-
setTimeout(() => {
404+
killTimer = setTimeout(() => {
348405
if (proc.exitCode === null) {
349406
try { proc.kill("SIGKILL"); } catch { /* ignore */ }
350407
proc.unref();
@@ -364,18 +421,18 @@ export async function executeDuckDbQuery(
364421
stdout += data;
365422
});
366423

367-
const maxStderrSize = 1024 * 1024; // 1 MB cap for stderr
368424
proc.stderr!.on("data", (chunk) => {
369-
if (stderr.length < maxStderrSize) {
425+
if (stderr.length < MAX_STDERR_SIZE) {
370426
stderr += chunk.toString();
371-
if (stderr.length > maxStderrSize) {
372-
stderr = stderr.slice(0, maxStderrSize) + "\n[STDERR TRUNCATED]";
427+
if (stderr.length > MAX_STDERR_SIZE) {
428+
stderr = stderr.slice(0, MAX_STDERR_SIZE) + "\n[STDERR TRUNCATED]";
373429
}
374430
}
375431
});
376432

377433
proc.on("close", (exitCode) => {
378434
if (timer) clearTimeout(timer);
435+
if (killTimer) clearTimeout(killTimer);
379436
options?.onExit?.(proc);
380437

381438
if (timedOut) {
@@ -398,6 +455,7 @@ export async function executeDuckDbQuery(
398455

399456
proc.on("error", (err) => {
400457
if (timer) clearTimeout(timer);
458+
if (killTimer) clearTimeout(killTimer);
401459
options?.onExit?.(proc);
402460

403461
// Binary-level failure (e.g., ENOENT)

0 commit comments

Comments
 (0)