dipseth
diff --git a/‎.gitignore‎
Lines changed: 15 additions & 0 deletions b/‎.gitignore‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 1 addition & 1 deletion b/‎README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎config/default-params.json‎
Lines changed: 13 additions & 0 deletions b/‎config/default-params.json‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎config/test-spark-job.py‎
Lines changed: 54 additions & 0 deletions b/‎config/test-spark-job.py‎
Lines changed: 54 additions & 0 deletions
diff --git a/‎docs/API_REFERENCE.md‎
Lines changed: 124 additions & 3 deletions b/‎docs/API_REFERENCE.md‎
Lines changed: 124 additions & 3 deletions
diff --git a/‎docs/examples/queries/hive-query-examples.md‎
Lines changed: 104 additions & 1 deletion b/‎docs/examples/queries/hive-query-examples.md‎
Lines changed: 104 additions & 1 deletion
@@ -27,6 +27,7 @@ test-formatted-output.js
 old-tests/
 .nyc_output/
 coverage/
+tests/prompts/
 
 # IDE and editor files
 .vscode/
@@ -116,3 +117,17 @@ scripts/copy-config-to-desktop.sh
 src/services/qdrantpayloadexample.json
 tests/manual/state/*
 dataproc-tools-test-results*.json
+config/state/transformers-cache/Xenova/all-MiniLM-L6-v2/config.json
+config/state/transformers-cache/Xenova/all-MiniLM-L6-v2/tokenizer_config.json
+config/state/transformers-cache/Xenova/all-MiniLM-L6-v2/tokenizer.json
+config/state/transformers-cache/Xenova/all-MiniLM-L6-v2/onnx/model_quantized.onnx
+test-dynamic-functions.js
+.issue-analysis/*
+state/**
+examples/local-file-staging/test-basedirectory-resolution.js
+config/state/embedding-training-data.json
+sample-enhanced-prompts.js
+dataproc-ops-test-report.json
+enhanced-prompt-demo.js
+test-spark-job.py
+verification-report.json
@@ -152,7 +152,7 @@ npx @dipseth/dataproc-mcp-server@latest
 | Tool | Description | Smart Defaults | Key Features |
 |------|-------------|----------------|--------------|
 | `submit_hive_query` | Submit Hive queries to clusters | ✅ 70% fewer params | Async support, timeouts |
-| `submit_dataproc_job` | Submit Spark/PySpark/Presto jobs | ✅ 75% fewer params | Multi-engine support |
+| `submit_dataproc_job` | Submit Spark/PySpark/Presto jobs | ✅ 75% fewer params | Multi-engine support, **Local file staging** |
 | `get_job_status` | Get job execution status | ✅ JobID only needed | Real-time monitoring |
 | `get_job_results` | Get job outputs and results | ✅ Auto-pagination | Result formatting |
 | `get_query_status` | Get Hive query status | ✅ Minimal params | Query tracking |
 
@@ -20,6 +20,19 @@
       "description": "Default GCP zone",
       "required": false,
       "defaultValue": "us-central1-a"
+    },
+    {
+      "name": "stagingBucket",
+      "type": "string",
+      "description": "Override staging bucket (auto-discovered if not set)",
+      "required": false
+    },
+    {
+      "name": "baseDirectory",
+      "type": "string",
+      "description": "Base directory for relative file paths",
+      "required": false,
+      "defaultValue": "."
     }
   ],
   "environments": [
 
@@ -0,0 +1,54 @@
+#!/usr/bin/env python3
+"""
+Simple PySpark test job for local file staging demonstration.
+This script performs basic Spark operations to validate the staging functionality.
+"""
+
+from pyspark.sql import SparkSession
+import sys
+
+def main():
+    # Initialize Spark session
+    spark = SparkSession.builder \
+        .appName("LocalFileStagingTest") \
+        .getOrCreate()
+    
+    print("=== PySpark Local File Staging Test ===")
+    print(f"Spark version: {spark.version}")
+    print(f"Arguments received: {sys.argv[1:] if len(sys.argv) > 1 else 'None'}")
+    
+    # Create a simple DataFrame for testing
+    data = [
+        ("Alice", 25, "Engineer"),
+        ("Bob", 30, "Manager"), 
+        ("Charlie", 35, "Analyst"),
+        ("Diana", 28, "Designer")
+    ]
+    
+    columns = ["name", "age", "role"]
+    df = spark.createDataFrame(data, columns)
+    
+    print("\n=== Sample Data ===")
+    df.show()
+    
+    # Perform some basic operations
+    print("\n=== Data Analysis ===")
+    print(f"Total records: {df.count()}")
+    print(f"Average age: {df.agg({'age': 'avg'}).collect()[0][0]:.1f}")
+    
+    # Group by role
+    print("\n=== Role Distribution ===")
+    df.groupBy("role").count().show()
+    
+    # Filter and display
+    print("\n=== Engineers and Managers ===")
+    df.filter(df.role.isin(["Engineer", "Manager"])).show()
+    
+    print("\n=== Test Completed Successfully! ===")
+    print("Local file staging is working correctly.")
+    
+    spark.stop()
+    return 0
+
+if __name__ == "__main__":
+    exit(main())
@@ -288,7 +288,7 @@ Submits a Hive query to a Dataproc cluster.
 
 ### 8. submit_dataproc_job
 
-Submits a generic Dataproc job (Hive, Spark, PySpark, etc.).
+Submits a generic Dataproc job (Hive, Spark, PySpark, etc.) with enhanced local file staging support.
 
 **Parameters:**
 - `projectId` (string, required): GCP project ID
@@ -298,6 +298,92 @@ Submits a generic Dataproc job (Hive, Spark, PySpark, etc.).
 - `jobConfig` (object, required): Job configuration object
 - `async` (boolean, optional): Whether to submit asynchronously
 
+**🔧 LOCAL FILE STAGING:**
+
+The `baseDirectory` parameter in the local file staging system controls how relative file paths are resolved when using the template syntax `{@./relative/path}` or direct relative paths in job configurations.
+
+**Configuration:**
+The `baseDirectory` parameter is configured in `config/default-params.json` with a default value of `"."`, which refers to the **current working directory** where the MCP server process is running (typically the project root directory).
+
+**Path Resolution Logic:**
+
+1. **Absolute Paths**: If a file path is already absolute (starts with `/`), it's used as-is
+2. **Relative Path Resolution**: For relative paths, the system:
+   - Gets the baseDirectory value from configuration (default: `"."`)
+   - Resolves the baseDirectory if it's relative:
+     - First tries to use `DATAPROC_CONFIG_PATH` environment variable's directory
+     - Falls back to `process.cwd()` (current working directory)
+   - Combines baseDirectory with the relative file path
+
+**Template Syntax Support:**
+```typescript
+// Template syntax - recommended approach
+{@./relative/path/to/file.py}
+{@../parent/directory/file.jar}
+{@subdirectory/file.sql}
+
+// Direct relative paths (also supported)
+"./relative/path/to/file.py"
+"../parent/directory/file.jar"
+"subdirectory/file.sql"
+```
+
+**Practical Examples:**
+
+*Example 1: Default Configuration (`baseDirectory: "."`)*
+- **Template**: `{@./test-spark-job.py}`
+- **Resolution**: `/Users/srivers/Documents/Cline/MCP/dataproc-server/test-spark-job.py`
+
+*Example 2: Config Directory Base*
+- **Configuration**: `baseDirectory: "config"`
+- **Template**: `{@./my-script.py}`
+- **Resolution**: `/Users/srivers/Documents/Cline/MCP/dataproc-server/config/my-script.py`
+
+*Example 3: Absolute Base Directory*
+- **Configuration**: `baseDirectory: "/absolute/path/to/files"`
+- **Template**: `{@./script.py}`
+- **Resolution**: `/absolute/path/to/files/script.py`
+
+**Environment Variable Influence:**
+The `DATAPROC_CONFIG_PATH` environment variable affects path resolution:
+- **If set**: The directory containing the config file becomes the reference point for relative `baseDirectory` values
+- **If not set**: The current working directory (`process.cwd()`) is used as the reference point
+
+**Best Practices:**
+1. **Use Template Syntax**: Prefer `{@./file.py}` over direct relative paths for clarity
+2. **Organize Files Relative to Project Root**: With the default `baseDirectory: "."`, organize your files relative to the project root
+3. **Consider Absolute Paths for External Files**: For files outside the project structure, use absolute paths
+
+**Supported File Extensions:**
+- `.py` - Python files for PySpark jobs
+- `.jar` - Java/Scala JAR files for Spark jobs
+- `.sql` - SQL files for various job types
+- `.R` - R script files for SparkR jobs
+
+**Troubleshooting:**
+- **File Not Found**: Check that the resolved absolute path exists
+- **Permission Denied**: Ensure the MCP server has read access to the file
+- **Unexpected Path Resolution**: Verify your `baseDirectory` setting and current working directory
+
+**Debug Path Resolution:**
+Enable debug logging to see the actual path resolution:
+```bash
+DEBUG=dataproc-mcp:* node build/index.js
+```
+
+**Configuration Override:**
+You can override the `baseDirectory` in your environment-specific configuration:
+```json
+{
+  "environment": "development",
+  "parameters": {
+    "baseDirectory": "./dev-scripts"
+  }
+}
+```
+
+Files are automatically staged to GCS and cleaned up after job completion.
+
 **Example - Spark Job:**
 ```json
 {
@@ -309,7 +395,7 @@ Submits a generic Dataproc job (Hive, Spark, PySpark, etc.).
     "jobType": "spark",
     "jobConfig": {
       "mainClass": "com.example.SparkApp",
-      "jarFileUris": ["gs://my-bucket/spark-app.jar"],
+      "jarFileUris": ["{@./spark-app.jar}"],
       "args": ["--input", "gs://my-bucket/input/", "--output", "gs://my-bucket/output/"],
       "properties": {
         "spark.executor.memory": "4g",
@@ -321,7 +407,29 @@ Submits a generic Dataproc job (Hive, Spark, PySpark, etc.).
 }
 ```
 
-**Example - PySpark Job:**
+**Example - PySpark Job with Local File Staging:**
+```json
+{
+  "tool": "submit_dataproc_job",
+  "arguments": {
+    "projectId": "my-project-123",
+    "region": "us-central1",
+    "clusterName": "pyspark-cluster",
+    "jobType": "pyspark",
+    "jobConfig": {
+      "mainPythonFileUri": "{@./test-spark-job.py}",
+      "pythonFileUris": ["{@./utils/helper.py}", "{@/absolute/path/library.py}"],
+      "args": ["--date", "2024-01-01"],
+      "properties": {
+        "spark.sql.adaptive.enabled": "true",
+        "spark.sql.adaptive.coalescePartitions.enabled": "true"
+      }
+    }
+  }
+}
+```
+
+**Example - Traditional PySpark Job (GCS URIs):**
 ```json
 {
   "tool": "submit_dataproc_job",
@@ -343,6 +451,19 @@ Submits a generic Dataproc job (Hive, Spark, PySpark, etc.).
 }
 ```
 
+**Local File Staging Process:**
+1. **Detection**: Local file paths are automatically detected using template syntax
+2. **Staging**: Files are uploaded to the cluster's staging bucket with unique names
+3. **Transformation**: Job config is updated with GCS URIs
+4. **Execution**: Job runs with staged files
+5. **Cleanup**: Staged files are automatically cleaned up after job completion
+
+**Supported File Extensions:**
+- `.py` - Python files for PySpark jobs
+- `.jar` - Java/Scala JAR files for Spark jobs
+- `.sql` - SQL files for various job types
+- `.R` - R script files for SparkR jobs
+
 ### 9. get_job_status
 
 Gets the status of a Dataproc job.
 
@@ -137,4 +137,107 @@ To set Hive properties for the query:
     },
     "timeoutMs": 300000
   }
-}
+}
+```
+
+## PySpark Job Examples with Local File Staging
+
+The MCP server supports automatic local file staging for PySpark jobs, allowing you to reference local Python files using template syntax. Files are automatically uploaded to GCS and the job configuration is transformed to use the staged files.
+
+### Basic PySpark Job with Local File
+
+```json
+{
+  "projectId": "your-project-id",
+  "region": "us-central1",
+  "clusterName": "pyspark-cluster",
+  "tool": "submit_dataproc_job",
+  "arguments": {
+    "jobType": "pyspark",
+    "jobConfig": {
+      "mainPythonFileUri": "{@./test-spark-job.py}",
+      "args": ["--mode", "test"]
+    }
+  }
+}
+```
+
+### PySpark Job with Multiple Local Files
+
+```json
+{
+  "projectId": "your-project-id",
+  "region": "us-central1",
+  "clusterName": "analytics-cluster",
+  "tool": "submit_dataproc_job",
+  "arguments": {
+    "jobType": "pyspark",
+    "jobConfig": {
+      "mainPythonFileUri": "{@./scripts/data_processor.py}",
+      "pythonFileUris": [
+        "{@./utils/data_utils.py}",
+        "{@./utils/spark_helpers.py}",
+        "{@/absolute/path/to/shared_library.py}"
+      ],
+      "args": ["--input", "gs://data-bucket/raw/", "--output", "gs://data-bucket/processed/"],
+      "properties": {
+        "spark.sql.adaptive.enabled": "true",
+        "spark.executor.memory": "4g"
+      }
+    }
+  }
+}
+```
+
+### Local File Staging Features
+
+- **Template Syntax**: Use `{@./relative/path}` or `{@/absolute/path}` to reference local files
+- **Automatic Upload**: Files are automatically staged to the cluster's GCS staging bucket
+- **Unique Naming**: Staged files get unique names with timestamps to avoid conflicts
+- **Cleanup**: Staged files are automatically cleaned up after job completion
+- **Supported Extensions**: `.py`, `.jar`, `.sql`, `.R` files are supported
+
+### Example Test Job Output
+
+When using the test-spark-job.py file, you can expect output similar to:
+
+```
+=== PySpark Local File Staging Test ===
+Spark version: 3.1.3
+Arguments received: ['--mode', 'test']
+
+=== Sample Data ===
++---------+---+---------+
+|     name|age|     role|
++---------+---+---------+
+|    Alice| 25| Engineer|
+|      Bob| 30|  Manager|
+|  Charlie| 35|  Analyst|
+|    Diana| 28| Designer|
++---------+---+---------+
+
+=== Data Analysis ===
+Total records: 4
+Average age: 29.5
+
+=== Role Distribution ===
++---------+-----+
+|     role|count|
++---------+-----+
+| Engineer|    1|
+|  Manager|    1|
+|  Analyst|    1|
+| Designer|    1|
++---------+-----+
+
+=== Test Completed Successfully! ===
+Local file staging is working correctly.
+```
+
+### Successful Test Cases
+
+The following job IDs demonstrate successful local file staging:
+- Job ID: `db620480-135f-4de6-b9a6-4045b308fe97` - Basic PySpark job with local file
+- Job ID: `36ed88b2-acad-4cfb-8fbf-88ad1ba22ad7` - PySpark job with multiple local files
+
+These examples show that local file staging works seamlessly with the Dataproc MCP server, providing the same experience as using `gcloud dataproc jobs submit pyspark` with local files.