Skip to content

Commit 252e39b

Browse files
dipsethRoo
andauthored
feat: enhance prompts and job submission with ESLint fixes (#33)
* Enhanced dynamic prompt functions with improved filtering and validation * Improved job submission workflow with better error handling * Added comprehensive prompt generation capabilities * Enhanced knowledge reindexer with better search capabilities * Improved local file staging with better validation * Fixed 16 critical ESLint errors blocking CI/CD pipeline * Fixed documentation validation errors in Hive query examples * Resolved unused variables, imports, and regex escape issues * Enhanced code quality and maintainability * All quality gates now passing (0 errors, 345 warnings) Co-authored-by: Roo <roo@veterinary.inc>
1 parent d9a56b2 commit 252e39b

24 files changed

+6788
-143
lines changed

.gitignore

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ test-formatted-output.js
2727
old-tests/
2828
.nyc_output/
2929
coverage/
30+
tests/prompts/
3031

3132
# IDE and editor files
3233
.vscode/
@@ -116,3 +117,17 @@ scripts/copy-config-to-desktop.sh
116117
src/services/qdrantpayloadexample.json
117118
tests/manual/state/*
118119
dataproc-tools-test-results*.json
120+
config/state/transformers-cache/Xenova/all-MiniLM-L6-v2/config.json
121+
config/state/transformers-cache/Xenova/all-MiniLM-L6-v2/tokenizer_config.json
122+
config/state/transformers-cache/Xenova/all-MiniLM-L6-v2/tokenizer.json
123+
config/state/transformers-cache/Xenova/all-MiniLM-L6-v2/onnx/model_quantized.onnx
124+
test-dynamic-functions.js
125+
.issue-analysis/*
126+
state/**
127+
examples/local-file-staging/test-basedirectory-resolution.js
128+
config/state/embedding-training-data.json
129+
sample-enhanced-prompts.js
130+
dataproc-ops-test-report.json
131+
enhanced-prompt-demo.js
132+
test-spark-job.py
133+
verification-report.json

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,7 @@ npx @dipseth/dataproc-mcp-server@latest
152152
| Tool | Description | Smart Defaults | Key Features |
153153
|------|-------------|----------------|--------------|
154154
| `submit_hive_query` | Submit Hive queries to clusters | ✅ 70% fewer params | Async support, timeouts |
155-
| `submit_dataproc_job` | Submit Spark/PySpark/Presto jobs | ✅ 75% fewer params | Multi-engine support |
155+
| `submit_dataproc_job` | Submit Spark/PySpark/Presto jobs | ✅ 75% fewer params | Multi-engine support, **Local file staging** |
156156
| `get_job_status` | Get job execution status | ✅ JobID only needed | Real-time monitoring |
157157
| `get_job_results` | Get job outputs and results | ✅ Auto-pagination | Result formatting |
158158
| `get_query_status` | Get Hive query status | ✅ Minimal params | Query tracking |

config/default-params.json

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,19 @@
2020
"description": "Default GCP zone",
2121
"required": false,
2222
"defaultValue": "us-central1-a"
23+
},
24+
{
25+
"name": "stagingBucket",
26+
"type": "string",
27+
"description": "Override staging bucket (auto-discovered if not set)",
28+
"required": false
29+
},
30+
{
31+
"name": "baseDirectory",
32+
"type": "string",
33+
"description": "Base directory for relative file paths",
34+
"required": false,
35+
"defaultValue": "."
2336
}
2437
],
2538
"environments": [

config/test-spark-job.py

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Simple PySpark test job for local file staging demonstration.
4+
This script performs basic Spark operations to validate the staging functionality.
5+
"""
6+
7+
from pyspark.sql import SparkSession
8+
import sys
9+
10+
def main():
11+
# Initialize Spark session
12+
spark = SparkSession.builder \
13+
.appName("LocalFileStagingTest") \
14+
.getOrCreate()
15+
16+
print("=== PySpark Local File Staging Test ===")
17+
print(f"Spark version: {spark.version}")
18+
print(f"Arguments received: {sys.argv[1:] if len(sys.argv) > 1 else 'None'}")
19+
20+
# Create a simple DataFrame for testing
21+
data = [
22+
("Alice", 25, "Engineer"),
23+
("Bob", 30, "Manager"),
24+
("Charlie", 35, "Analyst"),
25+
("Diana", 28, "Designer")
26+
]
27+
28+
columns = ["name", "age", "role"]
29+
df = spark.createDataFrame(data, columns)
30+
31+
print("\n=== Sample Data ===")
32+
df.show()
33+
34+
# Perform some basic operations
35+
print("\n=== Data Analysis ===")
36+
print(f"Total records: {df.count()}")
37+
print(f"Average age: {df.agg({'age': 'avg'}).collect()[0][0]:.1f}")
38+
39+
# Group by role
40+
print("\n=== Role Distribution ===")
41+
df.groupBy("role").count().show()
42+
43+
# Filter and display
44+
print("\n=== Engineers and Managers ===")
45+
df.filter(df.role.isin(["Engineer", "Manager"])).show()
46+
47+
print("\n=== Test Completed Successfully! ===")
48+
print("Local file staging is working correctly.")
49+
50+
spark.stop()
51+
return 0
52+
53+
if __name__ == "__main__":
54+
exit(main())

docs/API_REFERENCE.md

Lines changed: 124 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -288,7 +288,7 @@ Submits a Hive query to a Dataproc cluster.
288288

289289
### 8. submit_dataproc_job
290290

291-
Submits a generic Dataproc job (Hive, Spark, PySpark, etc.).
291+
Submits a generic Dataproc job (Hive, Spark, PySpark, etc.) with enhanced local file staging support.
292292

293293
**Parameters:**
294294
- `projectId` (string, required): GCP project ID
@@ -298,6 +298,92 @@ Submits a generic Dataproc job (Hive, Spark, PySpark, etc.).
298298
- `jobConfig` (object, required): Job configuration object
299299
- `async` (boolean, optional): Whether to submit asynchronously
300300

301+
**🔧 LOCAL FILE STAGING:**
302+
303+
The `baseDirectory` parameter in the local file staging system controls how relative file paths are resolved when using the template syntax `{@./relative/path}` or direct relative paths in job configurations.
304+
305+
**Configuration:**
306+
The `baseDirectory` parameter is configured in `config/default-params.json` with a default value of `"."`, which refers to the **current working directory** where the MCP server process is running (typically the project root directory).
307+
308+
**Path Resolution Logic:**
309+
310+
1. **Absolute Paths**: If a file path is already absolute (starts with `/`), it's used as-is
311+
2. **Relative Path Resolution**: For relative paths, the system:
312+
- Gets the baseDirectory value from configuration (default: `"."`)
313+
- Resolves the baseDirectory if it's relative:
314+
- First tries to use `DATAPROC_CONFIG_PATH` environment variable's directory
315+
- Falls back to `process.cwd()` (current working directory)
316+
- Combines baseDirectory with the relative file path
317+
318+
**Template Syntax Support:**
319+
```typescript
320+
// Template syntax - recommended approach
321+
{@./relative/path/to/file.py}
322+
{@../parent/directory/file.jar}
323+
{@subdirectory/file.sql}
324+
325+
// Direct relative paths (also supported)
326+
"./relative/path/to/file.py"
327+
"../parent/directory/file.jar"
328+
"subdirectory/file.sql"
329+
```
330+
331+
**Practical Examples:**
332+
333+
*Example 1: Default Configuration (`baseDirectory: "."`)*
334+
- **Template**: `{@./test-spark-job.py}`
335+
- **Resolution**: `/Users/srivers/Documents/Cline/MCP/dataproc-server/test-spark-job.py`
336+
337+
*Example 2: Config Directory Base*
338+
- **Configuration**: `baseDirectory: "config"`
339+
- **Template**: `{@./my-script.py}`
340+
- **Resolution**: `/Users/srivers/Documents/Cline/MCP/dataproc-server/config/my-script.py`
341+
342+
*Example 3: Absolute Base Directory*
343+
- **Configuration**: `baseDirectory: "/absolute/path/to/files"`
344+
- **Template**: `{@./script.py}`
345+
- **Resolution**: `/absolute/path/to/files/script.py`
346+
347+
**Environment Variable Influence:**
348+
The `DATAPROC_CONFIG_PATH` environment variable affects path resolution:
349+
- **If set**: The directory containing the config file becomes the reference point for relative `baseDirectory` values
350+
- **If not set**: The current working directory (`process.cwd()`) is used as the reference point
351+
352+
**Best Practices:**
353+
1. **Use Template Syntax**: Prefer `{@./file.py}` over direct relative paths for clarity
354+
2. **Organize Files Relative to Project Root**: With the default `baseDirectory: "."`, organize your files relative to the project root
355+
3. **Consider Absolute Paths for External Files**: For files outside the project structure, use absolute paths
356+
357+
**Supported File Extensions:**
358+
- `.py` - Python files for PySpark jobs
359+
- `.jar` - Java/Scala JAR files for Spark jobs
360+
- `.sql` - SQL files for various job types
361+
- `.R` - R script files for SparkR jobs
362+
363+
**Troubleshooting:**
364+
- **File Not Found**: Check that the resolved absolute path exists
365+
- **Permission Denied**: Ensure the MCP server has read access to the file
366+
- **Unexpected Path Resolution**: Verify your `baseDirectory` setting and current working directory
367+
368+
**Debug Path Resolution:**
369+
Enable debug logging to see the actual path resolution:
370+
```bash
371+
DEBUG=dataproc-mcp:* node build/index.js
372+
```
373+
374+
**Configuration Override:**
375+
You can override the `baseDirectory` in your environment-specific configuration:
376+
```json
377+
{
378+
"environment": "development",
379+
"parameters": {
380+
"baseDirectory": "./dev-scripts"
381+
}
382+
}
383+
```
384+
385+
Files are automatically staged to GCS and cleaned up after job completion.
386+
301387
**Example - Spark Job:**
302388
```json
303389
{
@@ -309,7 +395,7 @@ Submits a generic Dataproc job (Hive, Spark, PySpark, etc.).
309395
"jobType": "spark",
310396
"jobConfig": {
311397
"mainClass": "com.example.SparkApp",
312-
"jarFileUris": ["gs://my-bucket/spark-app.jar"],
398+
"jarFileUris": ["{@./spark-app.jar}"],
313399
"args": ["--input", "gs://my-bucket/input/", "--output", "gs://my-bucket/output/"],
314400
"properties": {
315401
"spark.executor.memory": "4g",
@@ -321,7 +407,29 @@ Submits a generic Dataproc job (Hive, Spark, PySpark, etc.).
321407
}
322408
```
323409
324-
**Example - PySpark Job:**
410+
**Example - PySpark Job with Local File Staging:**
411+
```json
412+
{
413+
"tool": "submit_dataproc_job",
414+
"arguments": {
415+
"projectId": "my-project-123",
416+
"region": "us-central1",
417+
"clusterName": "pyspark-cluster",
418+
"jobType": "pyspark",
419+
"jobConfig": {
420+
"mainPythonFileUri": "{@./test-spark-job.py}",
421+
"pythonFileUris": ["{@./utils/helper.py}", "{@/absolute/path/library.py}"],
422+
"args": ["--date", "2024-01-01"],
423+
"properties": {
424+
"spark.sql.adaptive.enabled": "true",
425+
"spark.sql.adaptive.coalescePartitions.enabled": "true"
426+
}
427+
}
428+
}
429+
}
430+
```
431+
432+
**Example - Traditional PySpark Job (GCS URIs):**
325433
```json
326434
{
327435
"tool": "submit_dataproc_job",
@@ -343,6 +451,19 @@ Submits a generic Dataproc job (Hive, Spark, PySpark, etc.).
343451
}
344452
```
345453
454+
**Local File Staging Process:**
455+
1. **Detection**: Local file paths are automatically detected using template syntax
456+
2. **Staging**: Files are uploaded to the cluster's staging bucket with unique names
457+
3. **Transformation**: Job config is updated with GCS URIs
458+
4. **Execution**: Job runs with staged files
459+
5. **Cleanup**: Staged files are automatically cleaned up after job completion
460+
461+
**Supported File Extensions:**
462+
- `.py` - Python files for PySpark jobs
463+
- `.jar` - Java/Scala JAR files for Spark jobs
464+
- `.sql` - SQL files for various job types
465+
- `.R` - R script files for SparkR jobs
466+
346467
### 9. get_job_status
347468
348469
Gets the status of a Dataproc job.

docs/examples/queries/hive-query-examples.md

Lines changed: 104 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -137,4 +137,107 @@ To set Hive properties for the query:
137137
},
138138
"timeoutMs": 300000
139139
}
140-
}
140+
}
141+
```
142+
143+
## PySpark Job Examples with Local File Staging
144+
145+
The MCP server supports automatic local file staging for PySpark jobs, allowing you to reference local Python files using template syntax. Files are automatically uploaded to GCS and the job configuration is transformed to use the staged files.
146+
147+
### Basic PySpark Job with Local File
148+
149+
```json
150+
{
151+
"projectId": "your-project-id",
152+
"region": "us-central1",
153+
"clusterName": "pyspark-cluster",
154+
"tool": "submit_dataproc_job",
155+
"arguments": {
156+
"jobType": "pyspark",
157+
"jobConfig": {
158+
"mainPythonFileUri": "{@./test-spark-job.py}",
159+
"args": ["--mode", "test"]
160+
}
161+
}
162+
}
163+
```
164+
165+
### PySpark Job with Multiple Local Files
166+
167+
```json
168+
{
169+
"projectId": "your-project-id",
170+
"region": "us-central1",
171+
"clusterName": "analytics-cluster",
172+
"tool": "submit_dataproc_job",
173+
"arguments": {
174+
"jobType": "pyspark",
175+
"jobConfig": {
176+
"mainPythonFileUri": "{@./scripts/data_processor.py}",
177+
"pythonFileUris": [
178+
"{@./utils/data_utils.py}",
179+
"{@./utils/spark_helpers.py}",
180+
"{@/absolute/path/to/shared_library.py}"
181+
],
182+
"args": ["--input", "gs://data-bucket/raw/", "--output", "gs://data-bucket/processed/"],
183+
"properties": {
184+
"spark.sql.adaptive.enabled": "true",
185+
"spark.executor.memory": "4g"
186+
}
187+
}
188+
}
189+
}
190+
```
191+
192+
### Local File Staging Features
193+
194+
- **Template Syntax**: Use `{@./relative/path}` or `{@/absolute/path}` to reference local files
195+
- **Automatic Upload**: Files are automatically staged to the cluster's GCS staging bucket
196+
- **Unique Naming**: Staged files get unique names with timestamps to avoid conflicts
197+
- **Cleanup**: Staged files are automatically cleaned up after job completion
198+
- **Supported Extensions**: `.py`, `.jar`, `.sql`, `.R` files are supported
199+
200+
### Example Test Job Output
201+
202+
When using the test-spark-job.py file, you can expect output similar to:
203+
204+
```
205+
=== PySpark Local File Staging Test ===
206+
Spark version: 3.1.3
207+
Arguments received: ['--mode', 'test']
208+
209+
=== Sample Data ===
210+
+---------+---+---------+
211+
| name|age| role|
212+
+---------+---+---------+
213+
| Alice| 25| Engineer|
214+
| Bob| 30| Manager|
215+
| Charlie| 35| Analyst|
216+
| Diana| 28| Designer|
217+
+---------+---+---------+
218+
219+
=== Data Analysis ===
220+
Total records: 4
221+
Average age: 29.5
222+
223+
=== Role Distribution ===
224+
+---------+-----+
225+
| role|count|
226+
+---------+-----+
227+
| Engineer| 1|
228+
| Manager| 1|
229+
| Analyst| 1|
230+
| Designer| 1|
231+
+---------+-----+
232+
233+
=== Test Completed Successfully! ===
234+
Local file staging is working correctly.
235+
```
236+
237+
### Successful Test Cases
238+
239+
The following job IDs demonstrate successful local file staging:
240+
- Job ID: `db620480-135f-4de6-b9a6-4045b308fe97` - Basic PySpark job with local file
241+
- Job ID: `36ed88b2-acad-4cfb-8fbf-88ad1ba22ad7` - PySpark job with multiple local files
242+
243+
These examples show that local file staging works seamlessly with the Dataproc MCP server, providing the same experience as using `gcloud dataproc jobs submit pyspark` with local files.

0 commit comments

Comments
 (0)