This document describes the SQL injection prevention measures implemented in ssb-parquedit.
The codebase has been refactored to use parameterized queries (bind variables) to prevent SQL injection attacks, where applicable.
Status: ✅ Implemented
DuckDB does not support parameterizing SQL identifiers (table names, column names, schema names). Instead, these are validated using strict whitelist patterns:
- Table names: Must start with a letter or underscore, contain only alphanumeric characters and underscores
- Column names: Must follow the same pattern as table names
- Partition columns: Validated before use in ALTER TABLE statements
Location: utils.py - SchemaUtils.validate_table_name() and SQLSanitizer.validate_column_list()
Status: ✅ Partial (Defense-in-depth)
WHERE and ORDER BY clauses are passed as strings because they often contain complex expressions. These cannot be fully parameterized without a major architectural refactor. Instead, the implementation uses:
- Checks for dangerous SQL keywords that shouldn't appear in WHERE clauses
- Detects SQL comment sequences (
--,/*,*/) - Allows legitimate keywords like
CASTthat are used in normal WHERE expressions
Location: utils.py - SQLSanitizer.validate_where_clause()
- Stricter validation than WHERE clauses
- Only allows column names, ASC/DESC keywords, and basic arithmetic operators
- Rejects any dangerous SQL patterns
- Uses regex pattern matching to ensure valid format
Location: utils.py - SQLSanitizer.validate_order_by_clause()
Called from: query.py - select() and count() methods
Status: ✅ Implemented
LIMIT and OFFSET values are now passed as parameterized values using DuckDB's ? placeholder syntax.
Before:
query += f" LIMIT {limit}"
query += f" OFFSET {offset}"
result = self.conn.execute(query)After:
if limit is not None:
query += " LIMIT ?"
params.append(limit)
if offset > 0:
query += " OFFSET ?"
params.append(offset)
result = self.conn.execute(query, params)Location: query.py - select() method
Status: ✅ Implemented
File paths passed to read_parquet() are now parameterized instead of string interpolated.
Before:
ddl = f"""
CREATE TABLE {table_name} AS
SELECT * FROM read_parquet('{parquet_path}')
"""
self.conn.execute(ddl)After:
ddl = f"""
CREATE TABLE {table_name} AS
SELECT * FROM read_parquet(?)
"""
self.conn.execute(ddl, [parquet_path])Locations:
ddl.py-_create_from_parquet()methoddml.py-_insert_from_parquet()method
- Numeric LIMIT/OFFSET values
- File paths in read_parquet()
- Column names in SELECT clauses, partitioning
- Table names
- WHERE clause values: These are validated but not fully parameterized due to architectural constraints
- ORDER BY expressions: Validated with strict pattern matching
Recommendations for Users:
- Always validate and sanitize WHERE clause input at the application level
- Use prepared statements in your application when constructing WHERE clauses
- Consider using DuckDB's filter APIs or query builders that support parameterization
- Run ssb-parquedit with database users that have minimal required permissions
# Parameterized LIMIT/OFFSET
df = editor.view("users", limit=10, offset=20)
# Validated table and column names
df = editor.view("users", columns=["id", "name", "email"])
# Structured filters (RECOMMENDED)
df = editor.view("users", filters={"column": "age", "operator": ">", "value": 25})
# Parameterized file paths
editor.insert_data("users", "/path/to/users.parquet")# DON'T construct filter values without parameterization
# Instead, use structured filters which handle parameterization automatically:
user_age = user_input # Could be malicious
df = editor.view("users", filters={"column": "age", "operator": ">", "value": user_age})
# The value is automatically parameterized, preventing injection
# Even if user_age contains: "25); DROP TABLE users; --"
# It will be safely treated as a literal string valueTo verify the SQL injection prevention measures are working:
# Run the test suite
pytest tests/
# Run type checking
mypy src/ssb_parquedit/
# Check for potential issues
ruff check src/The refactoring is backward compatible - existing code will continue to work, but now with enhanced security:
- LIMIT/OFFSET now use parameterized queries internally (transparent to users)
- WHERE/ORDER BY clauses are validated before execution
- File paths are parameterized (transparent to users)
- Column and table name validation is more explicit (will raise errors on invalid names)