Skip to content

Commit 39da385

Browse files
vidstraurb346roarv2osn-bip
authored
Refactored repo for pre-release
* added duckdb, pandas and gcsfs * added view_table in ParquEdit * added config-files * add use_parquedit.py and config-files * added tests for view_table * Utkast refaktorisering * flytte filer * lagt inn uuid-kolonne * justere insert av df * Revert "justere insert av df" This reverts commit d78e3a2. * saving work * saving work * refacotred insert methods to accept parquet as input * added parameters to view_table * identified de facto duplicate code for select (or view table) * ryddet i navn og metoder * suggested changes to prevent sql-injection * further filtration and protection added by mister Claude * removed all usage of old where clauses * Changed method name from select to view and refactored all tests * cleaned according to some pre-commit tests * cleaned to conform to pydoclint tests * cleaned all pydoclint * fixed connection issues after refactoring * removed . from imports * generated refactored tests * fixed pytest test_connection.py * fixed pytest test_utils * pytest corrections * pre-commit and pytest * pre-commit fixes * nox cleanup * clean up in repo * Fixed all nox-issues * Remove Python 3.10 from tests workflow Removed Python 3.10 from the testing sessions in the workflow. * fixed end-of-line blank * fixed security issue with regexp in utils.py --------- Co-authored-by: Urban Olsson <63000943+urb346@users.noreply.github.com> Co-authored-by: urb346 <urb@ssb.no> Co-authored-by: Roar Vålen <84438439+roarv2@users.noreply.github.com> Co-authored-by: ssb-rav <rav@ssb.no> Co-authored-by: osn-bip <osn@ssb.no>
1 parent 0493fde commit 39da385

21 files changed

+5672
-556
lines changed

.github/workflows/tests.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,6 @@ jobs:
2222
- { python: "3.11", os: "ubuntu-latest", session: "mypy" }
2323
- { python: "3.12", os: "ubuntu-latest", session: "mypy" }
2424
- { python: "3.13", os: "ubuntu-latest", session: "mypy" }
25-
- { python: "3.10", os: "ubuntu-latest", session: "tests" }
2625
- { python: "3.11", os: "ubuntu-latest", session: "tests" }
2726
- { python: "3.12", os: "ubuntu-latest", session: "tests" }
2827
- { python: "3.13", os: "ubuntu-latest", session: "tests" }

SQL_INJECTION_PREVENTION.md

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# SQL Injection Prevention Implementation
2+
3+
This document describes the SQL injection prevention measures implemented in ssb-parquedit.
4+
5+
## Overview
6+
7+
The codebase has been refactored to use parameterized queries (bind variables) to prevent SQL injection attacks, where applicable.
8+
9+
## Implementation Approach
10+
11+
### 1. Identifier Validation (Table Names, Column Names)
12+
13+
**Status**: ✅ Implemented
14+
15+
DuckDB does not support parameterizing SQL identifiers (table names, column names, schema names). Instead, these are validated using strict whitelist patterns:
16+
17+
- **Table names**: Must start with a letter or underscore, contain only alphanumeric characters and underscores
18+
- **Column names**: Must follow the same pattern as table names
19+
- **Partition columns**: Validated before use in ALTER TABLE statements
20+
21+
**Location**: `utils.py` - `SchemaUtils.validate_table_name()` and `SQLSanitizer.validate_column_list()`
22+
23+
### 2. Parameterized WHERE and ORDER BY Clauses
24+
25+
**Status**: ✅ Partial (Defense-in-depth)
26+
27+
WHERE and ORDER BY clauses are passed as strings because they often contain complex expressions. These cannot be fully parameterized without a major architectural refactor. Instead, the implementation uses:
28+
29+
#### WHERE Clause Validation
30+
- Checks for dangerous SQL keywords that shouldn't appear in WHERE clauses
31+
- Detects SQL comment sequences (`--`, `/*`, `*/`)
32+
- Allows legitimate keywords like `CAST` that are used in normal WHERE expressions
33+
34+
**Location**: `utils.py` - `SQLSanitizer.validate_where_clause()`
35+
36+
#### ORDER BY Clause Validation
37+
- Stricter validation than WHERE clauses
38+
- Only allows column names, ASC/DESC keywords, and basic arithmetic operators
39+
- Rejects any dangerous SQL patterns
40+
- Uses regex pattern matching to ensure valid format
41+
42+
**Location**: `utils.py` - `SQLSanitizer.validate_order_by_clause()`
43+
44+
**Called from**: `query.py` - `select()` and `count()` methods
45+
46+
### 3. Parameterized Numeric Values
47+
48+
**Status**: ✅ Implemented
49+
50+
LIMIT and OFFSET values are now passed as parameterized values using DuckDB's `?` placeholder syntax.
51+
52+
**Before**:
53+
```python
54+
query += f" LIMIT {limit}"
55+
query += f" OFFSET {offset}"
56+
result = self.conn.execute(query)
57+
```
58+
59+
**After**:
60+
```python
61+
if limit is not None:
62+
query += " LIMIT ?"
63+
params.append(limit)
64+
if offset > 0:
65+
query += " OFFSET ?"
66+
params.append(offset)
67+
result = self.conn.execute(query, params)
68+
```
69+
70+
**Location**: `query.py` - `select()` method
71+
72+
### 4. Parameterized File Paths
73+
74+
**Status**: ✅ Implemented
75+
76+
File paths passed to `read_parquet()` are now parameterized instead of string interpolated.
77+
78+
**Before**:
79+
```python
80+
ddl = f"""
81+
CREATE TABLE {table_name} AS
82+
SELECT * FROM read_parquet('{parquet_path}')
83+
"""
84+
self.conn.execute(ddl)
85+
```
86+
87+
**After**:
88+
```python
89+
ddl = f"""
90+
CREATE TABLE {table_name} AS
91+
SELECT * FROM read_parquet(?)
92+
"""
93+
self.conn.execute(ddl, [parquet_path])
94+
```
95+
96+
**Locations**:
97+
- `ddl.py` - `_create_from_parquet()` method
98+
- `dml.py` - `_insert_from_parquet()` method
99+
100+
## Security Considerations
101+
102+
### What's Protected
103+
104+
- Numeric LIMIT/OFFSET values
105+
- File paths in read_parquet()
106+
- Column names in SELECT clauses, partitioning
107+
- Table names
108+
109+
### What's Not Fully Protected (Defense-in-Depth)
110+
111+
- WHERE clause values: These are validated but not fully parameterized due to architectural constraints
112+
- ORDER BY expressions: Validated with strict pattern matching
113+
114+
**Recommendations for Users**:
115+
1. Always validate and sanitize WHERE clause input at the application level
116+
2. Use prepared statements in your application when constructing WHERE clauses
117+
3. Consider using DuckDB's filter APIs or query builders that support parameterization
118+
4. Run ssb-parquedit with database users that have minimal required permissions
119+
120+
## Security Usage Examples
121+
122+
### Safe Usage
123+
124+
```python
125+
# Parameterized LIMIT/OFFSET
126+
df = editor.view("users", limit=10, offset=20)
127+
128+
# Validated table and column names
129+
df = editor.view("users", columns=["id", "name", "email"])
130+
131+
# Structured filters (RECOMMENDED)
132+
df = editor.view("users", filters={"column": "age", "operator": ">", "value": 25})
133+
134+
# Parameterized file paths
135+
editor.insert_data("users", "/path/to/users.parquet")
136+
```
137+
138+
### Things to Avoid
139+
140+
```python
141+
# DON'T construct filter values without parameterization
142+
# Instead, use structured filters which handle parameterization automatically:
143+
user_age = user_input # Could be malicious
144+
df = editor.view("users", filters={"column": "age", "operator": ">", "value": user_age})
145+
146+
# The value is automatically parameterized, preventing injection
147+
# Even if user_age contains: "25); DROP TABLE users; --"
148+
# It will be safely treated as a literal string value
149+
```
150+
151+
## Testing
152+
153+
To verify the SQL injection prevention measures are working:
154+
155+
```bash
156+
# Run the test suite
157+
pytest tests/
158+
159+
# Run type checking
160+
mypy src/ssb_parquedit/
161+
162+
# Check for potential issues
163+
ruff check src/
164+
```
165+
166+
## Migration Notes
167+
168+
The refactoring is backward compatible - existing code will continue to work, but now with enhanced security:
169+
170+
1. LIMIT/OFFSET now use parameterized queries internally (transparent to users)
171+
2. WHERE/ORDER BY clauses are validated before execution
172+
3. File paths are parameterized (transparent to users)
173+
4. Column and table name validation is more explicit (will raise errors on invalid names)
174+
175+
## References
176+
177+
- [DuckDB Documentation - Parameter Binding](https://duckdb.org/docs/api/overview)
178+
- [OWASP - SQL Injection](https://owasp.org/www-community/attacks/SQL_Injection)
179+
- [CWE-89: SQL Injection](https://cwe.mitre.org/data/definitions/89.html)

0 commit comments

Comments
 (0)