Skip to content

Commit d3656c2

Browse files
authored
feat(sagemaker-unified-studio-mcp-spark-troubleshooting): Add readme for Spark troubleshooting Remote MCP (#1889)
* feat(sagemaker-unified-studio-mcp-spark-troubleshooting): Add readme * chore(sagemaker-unified-studio-spark-troubleshooting-mcp-server): Address Comments
1 parent 3593c5e commit d3656c2

File tree

1 file changed

+236
-0
lines changed
  • src/sagemaker-unified-studio-spark-troubleshooting-mcp-server

1 file changed

+236
-0
lines changed
Lines changed: 236 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,236 @@
1+
# SageMaker Unified Studio MCP for Spark Troubleshooting
2+
3+
A fully managed remote MCP server that provides specialized tools for troubleshooting Apache Spark applications on Amazon EMR, AWS Glue, and Amazon SageMaker Notebooks. This server simplifies the troubleshooting process through conversational AI capabilities, automated workload analysis, and intelligent code recommendations.
4+
5+
**Important Note**: Not all MCP clients today support remote servers. Please make sure that your client supports remote MCP servers or that you have a suitable proxy setup to use this server. The Amazon SageMaker Unified Studio MCP server is in preview and is subject to change.
6+
7+
## Key Features & Capabilities
8+
9+
- **Intelligent Failure Analysis**: Automatically analyzes Spark event logs, error messages, and resource usage to pinpoint exact issues including memory problems, configuration errors, and code bugs
10+
- **Multi-Platform Support**: Troubleshoot PySpark and Scala applications across Amazon EMR on EC2, EMR Serverless, AWS Glue, and Amazon SageMaker Notebooks
11+
- **Automated Feature Extraction**: Connects to platform-specific spark history server (EMR, Glue, EMR-Serverless) to extract comprehensive context
12+
- **GenAI Root Cause Analysis**: Leverages AI models and Spark knowledge base to correlate features and identify root causes of performance issues or failures
13+
- **Code Recommendation Engine**: Provides actionable code modifications, configuration adjustments, and architectural improvements with concrete examples
14+
- **Natural Language Interface**: Use conversational prompts to request troubleshooting analysis and code recommendations
15+
16+
## Architecture
17+
18+
The troubleshooting agent has three main components: an MCP-compatible AI Assistant in your development environment for interaction, the [MCP Proxy for AWS](https://github.com/aws/mcp-proxy-for-aws) that handles secure communication and authentication between your client and AWS services, and the Amazon SageMaker Unified Studio Remote MCP Server (preview) that provides specialized Spark troubleshooting tools for Amazon EMR, AWS Glue and Amazon SageMaker Notebooks. This diagram illustrates how you interact with the Amazon SageMaker Unified Studio Remote MCP Server through your AI Assistant.
19+
20+
![img](https://docs.aws.amazon.com/images/emr/latest/ReleaseGuide/images/spark-troubleshooting-agent-architecture.png)
21+
22+
The AI assistant orchestrates the troubleshooting process using specialized tools provided by the MCP server following these steps:
23+
24+
- **Feature Extraction and Context Building**: Automatically collects and analyzes telemetry data from your Spark application including Spark History Server logs, configuration settings, and error traces. Extracts key performance metrics, resource utilization patterns, and failure signatures.
25+
26+
- **GenAI Root Cause Analyzer and Recommendation Engine**: Leverages AI models and Spark knowledge base to correlate extracted features and identify root causes of performance issues or failures. Provides diagnostic insights and analysis of application execution problems.
27+
28+
- **GenAI Spark Code Recommendation**: Based on root cause analysis, analyzes existing code patterns and identifies inefficient operations that need fixes. Provides actionable recommendations including specific code modifications, configuration adjustments, and architectural improvements.
29+
30+
### Supported Platforms & Languages
31+
32+
- **Languages**: Python (PySpark) and Scala Spark applications
33+
- **Target Platforms**:
34+
- Amazon EMR on EC2
35+
- Amazon EMR Serverless
36+
- AWS Glue
37+
- Amazon SageMaker Notebooks
38+
39+
### Data Source Integration
40+
41+
- **EMR on EC2**: Connects to [EMR Persistent UI](https://docs.aws.amazon.com/emr/latest/ManagementGuide/app-history-spark-UI.html) for cluster analysis
42+
- **AWS Glue**: Builds context from Glue Studio's [Spark UI](https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-jobs.html) for job analysis
43+
- **EMR Serverless**: Connects to EMR-Serverless [Spark History Server](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_GetDashboardForJobRun.html) for job run analysis
44+
45+
## Configuration
46+
47+
You can configure the Apache Spark Troubleshooting Agent MCP server for use with any MCP client.
48+
49+
**Example Configuration for Kiro CLI:**
50+
51+
For code troubleshooting, you can add:
52+
```json
53+
{
54+
"mcpServers": {
55+
"sagemaker-unified-studio-mcp-troubleshooting": {
56+
"type": "stdio",
57+
"command": "uvx",
58+
"args": [
59+
"mcp-proxy-for-aws@latest",
60+
"https://sagemaker-unified-studio-mcp.us-east-1.api.aws/spark-troubleshooting/mcp",
61+
"--service",
62+
"sagemaker-unified-studio-mcp",
63+
"--profile",
64+
"smus-mcp-profile",
65+
"--region",
66+
"us-east-1",
67+
"--read-timeout",
68+
"180"
69+
],
70+
"timeout": 180000,
71+
"disabled": false
72+
}
73+
}
74+
}
75+
```
76+
77+
For code recommendations, you can also add:
78+
79+
```json
80+
{
81+
"sagemaker-unified-studio-mcp-code-rec": {
82+
"type": "stdio",
83+
"command": "uvx",
84+
"args": [
85+
"mcp-proxy-for-aws@latest",
86+
"https://sagemaker-unified-studio-mcp.us-east-1.api.aws/spark-code-recommendation/mcp",
87+
"--service",
88+
"sagemaker-unified-studio-mcp",
89+
"--profile",
90+
"smus-mcp-profile",
91+
"--region",
92+
"us-east-1",
93+
"--read-timeout",
94+
"180"
95+
],
96+
"timeout": 180000,
97+
"disabled": false
98+
}
99+
}
100+
```
101+
102+
## Setup & Installation
103+
104+
### Deploy CloudFormation Stack
105+
106+
Choose the appropriate **Launch Stack** button for your region to deploy the required resources, See [Setup Documentation](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/spark-troubleshooting-agent-setup.html) for complete list
107+
108+
### Setup Local Environment and AWS CLI Profile
109+
110+
Copy the 1-line instruction from the CloudFormation output and execute it locally:
111+
112+
```bash
113+
export SMUS_MCP_REGION=us-east-1 && export IAM_ROLE=arn:aws:iam::111122223333:role/spark-troubleshooting-role-xxxxxx
114+
```
115+
116+
```bash
117+
aws configure set profile.smus-mcp-profile.role_arn ${IAM_ROLE}
118+
aws configure set profile.smus-mcp-profile.source_profile default
119+
aws configure set profile.smus-mcp-profile.region ${SMUS_MCP_REGION}
120+
```
121+
122+
### Step 5: Configure MCP Client (Kiro CLI Example)
123+
124+
```bash
125+
# Add Spark Troubleshooting MCP Server
126+
kiro-cli-chat mcp add \
127+
--name "sagemaker-unified-studio-mcp-troubleshooting" \
128+
--command "uvx" \
129+
--args "[\"mcp-proxy-for-aws@latest\",\"https://sagemaker-unified-studio-mcp.${SMUS_MCP_REGION}.api.aws/spark-troubleshooting/mcp\", \"--service\", \"sagemaker-unified-studio-mcp\", \"--profile\", \"smus-mcp-profile\", \"--region\", \"${SMUS_MCP_REGION}\", \"--read-timeout\", \"180\"]" \
130+
--timeout 180000 \
131+
--scope global
132+
133+
# Add Spark Code Recommendation MCP Server
134+
kiro-cli-chat mcp add \
135+
--name "sagemaker-unified-studio-mcp-code-rec" \
136+
--command "uvx" \
137+
--args "[\"mcp-proxy-for-aws@latest\",\"https://sagemaker-unified-studio-mcp.${SMUS_MCP_REGION}.api.aws/spark-code-recommendation/mcp\", \"--service\", \"sagemaker-unified-studio-mcp\", \"--profile\", \"smus-mcp-profile\", \"--region\", \"${SMUS_MCP_REGION}\", \"--read-timeout\", \"180\"]" \
138+
--timeout 180000 \
139+
--scope global
140+
```
141+
142+
## Usage Examples
143+
144+
### 1. Troubleshoot Spark Job Execution Failures
145+
146+
**EMR on EC2 Troubleshooting:**
147+
```
148+
Troubleshoot my EMR-EC2 step with id s-xxxxxxxxxxxx on cluster j-xxxxxxxxxxxxx
149+
```
150+
151+
**Glue Job Troubleshooting:**
152+
```
153+
Troubleshoot my Glue job with job run id jr_xxxxxxxxxxxxxxxxxxxxxxxxxxxx and job name test_job
154+
```
155+
156+
**EMR Serverless Troubleshooting:**
157+
```
158+
Troubleshoot my EMR-Serverless job run with application id 00xxxxxxxx and job run id 00xxxxxxxx
159+
```
160+
161+
### 2. Request Code Fix Recommendations
162+
163+
**EMR on EC2 Code Recommendations:**
164+
```
165+
Recommend code fix for my EMR-EC2 step with id s-STEP_ID on cluster j-CLUSTER_ID
166+
```
167+
168+
**Glue Job Code Recommendations:**
169+
```
170+
Recommend code fix for my Glue job with job run id jr_JOB_RUN_ID and job name test_job
171+
```
172+
173+
## Limitations & Requirements
174+
175+
### Supported Workload States
176+
- **Failed Workloads Only**: Tools only support responses for failed Spark workloads
177+
178+
### Platform-Specific Considerations
179+
180+
- **EMR Persistent UI**: When analyzing Amazon EMR-EC2 workloads, the tool connects to EMR Persistent UI. See [limitations](https://docs.aws.amazon.com/emr/latest/ManagementGuide/app-history-spark-UI.html#app-history-spark-UI-limitations)
181+
- **Glue Studio Spark UI**: Retrieves information by parsing Spark event logs from Amazon S3. Maximum allowed event log size: 512 MB (2 GB for rolling logs)
182+
- **Code Recommendations**: Only supported for Amazon EMR-EC2 and AWS Glue workloads for PySpark applications
183+
- **Regional Resources**: The agent is regional and uses underlying EMR resources in that region. Cross-region troubleshooting is not supported
184+
185+
## Troubleshooting Common Issues
186+
187+
### MCP Server Failed to Load
188+
- Verify MCP configurations are properly set up
189+
- Validate JSON syntax for missing commas, quotes, or brackets
190+
- Verify local AWS credentials and IAM role policy configuration
191+
- Run `/mcp` to verify server availability (Kiro CLI)
192+
193+
### Slow Tool Loading
194+
- Tools may take a few seconds to load on first launch
195+
- Try restarting the chat if tools don't appear
196+
- Run `/tools` command to verify tool availability
197+
198+
### Tool Invocation Errors
199+
- **Throttling Error**: Wait a few seconds before retrying
200+
- **AccessDeniedException**: Check and fix permission issues
201+
- **InvalidInputException**: Correct tool input parameters
202+
- **ResourceNotFoundException**: Fix input parameters for resource reference
203+
- **Internal Service Exception**: Document analysis ID and contact AWS support
204+
205+
## Data Usage
206+
207+
This server processes your Spark application logs and configuration files to provide troubleshooting recommendations. No sensitive data is stored permanently, and all processing follows AWS data protection standards.
208+
209+
## Security Best Practices
210+
211+
- **Trust Settings**: Do not enable "trust" setting by default for all tool calls
212+
- **Version Control**: Operate on git-versioned build environments when accepting code recommendations
213+
- **Review Process**: Review each tool execution to understand what changes are being made
214+
- **Code Changes**: Maintain full control over all code modifications and recommendations
215+
216+
## FAQs
217+
218+
### 1. What types of Spark applications are supported?
219+
The agent supports both PySpark and Scala Spark applications running on Amazon EMR on EC2, EMR Serverless, AWS Glue, and Amazon SageMaker Notebooks.
220+
221+
### 2. What happens if my Spark job is still running?
222+
The troubleshooting tools only support analysis of failed Spark workloads.
223+
224+
### 3. Can I get code recommendations for successful jobs?
225+
Code recommendations are primarily focused on fixing issues in failed workloads, but you can request code-level suggestions for optimization even without a full failure analysis.
226+
227+
### 4. How does the agent access my Spark logs?
228+
The agent connects to platform-specific interfaces: EMR Persistent UI for EMR-EC2, Glue Studio Spark UI for AWS Glue, Spark History Server for EMR Serverless And S3/Cloudwatch logs to extract necessary telemetry data.
229+
230+
### 5. Is my data secure during the troubleshooting process?
231+
Yes, all processing follows AWS data protection standards. The agent analyzes logs and configurations temporarily to provide recommendations without permanently storing sensitive data.
232+
233+
### 6. What should I do if the automated troubleshooting doesn't identify the issue?
234+
The agent provides detailed error analysis and suggested fixes. If issues persist, you can escalate to AWS support with the analysis ID and tool responses for further assistance.
235+
236+
For more information, refer to the [AWS EMR Spark Troubleshooting Documentation](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/spark-troubleshoot.html).

0 commit comments

Comments
 (0)