The Amazon Data Processing Agent is an intelligent conversational AI assistant that specializes in AWS data processing services. Built on top of the Model Context Protocol (MCP), this agent provides a natural language interface to complex data engineering tasks across AWS Glue, Amazon Athena, and Amazon EMR-EC2.
| Feature | Description |
|---|---|
| Agent Structure | Multi-agent architecture with specialized components |
| Custom Tools | send_email, manage_s3_table_buckets, manage_s3_namespaces, manage_s3_tables |
| MCP Servers | AWS Data Processing MCP Server |
| Model Provider | Amazon Bedrock (Claude 3.7 Sonnet, Claude 4.0 Sonnet) |
| UI Framework | Streamlit with real-time streaming |
This agent transforms how data engineers interact with AWS data processing services by:
- π€ Natural Language Interface: Convert complex data processing requirements into executable AWS solutions through conversational AI
- π§ Intelligent Tool Integration: Seamlessly connect to AWS services via the
aws-dataprocessing-mcp-serverfor real-time operations - π End-to-End Data Pipeline Management: From data discovery and cataloging to ETL job orchestration and big data analytics
- π‘ Expert Guidance: Provide cost optimization, performance tuning, and best practice recommendations
- π Rapid Prototyping: Generate, test, and deploy data processing solutions with minimal manual intervention
The agent leverages the Model Context Protocol (MCP) to connect with the aws-dataprocessing-mcp-server, which provides:
βββββββββββββββββββββββββββββββββββββββ
β Amazon Data Processing Agent β
β (Streamlit UI + Claude LLM) β
βββββββββββββββ¬ββββββββββββββββββββββββ
β MCP Protocol
βΌ
βββββββββββββββββββββββββββββββββββββββ
β aws-dataprocessing-mcp-server β
β β’ AWS Glue Operations β
β β’ Amazon Athena Queries β
β β’ Amazon EMR Management β
β β’ S3 Data Operations β
β β’ Cost & Performance Analytics β
βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β AWS Services β
β β’ AWS Glue (ETL Jobs & Catalog) β
β β’ Amazon Athena (SQL Analytics) β
β β’ Amazon EMR-EC2 (Big Data) β
β β’ Amazon S3 (Data Storage) β
βββββββββββββββββββββββββββββββββββββββ
- Data Catalog Management: Create, update, and manage databases, tables, and partitions
- ETL Job Development: Generate optimized PySpark scripts using Glue Version 5 (Spark 3.5.1)
- Crawler Operations: Automate schema discovery and metadata management
- Workflow Orchestration: Design and manage complex ETL workflows with triggers
- Natural Language to SQL: Convert business questions into optimized SQL queries
- Query Optimization: Analyze and improve query performance
- Cost Management: Monitor and optimize query costs
- Schema Discovery: Automatically understand table structures and relationships
- Cluster Management: Create, configure, and manage EMR clusters
- Big Data Processing: Handle large-scale data processing workloads
- Cost Optimization: Implement spot instances and auto-scaling strategies
- Performance Tuning: Optimize cluster configurations for specific workloads
- Real-time Streaming: Live responses with tool execution visibility
- Context Awareness: Maintains conversation history for complex multi-step operations
- Cost Analysis: Provides detailed cost breakdowns and optimization recommendations
- Error Handling: Intelligent retry logic and troubleshooting guidance
- Best Practices: Automated recommendations for security, performance, and cost optimization
The project has been refactored into a modular, extensible architecture:
src/amazon_dataprocessing_agent/
βββ __init__.py # Package initialization
βββ main.py # Main application entry point
βββ config/ # Configuration management
β βββ __init__.py
β βββ constants.py # Constants and styling
β βββ prompts.py # System prompts
βββ core/ # Core functionality
β βββ __init__.py
β βββ agent_manager.py # MCP agent management
β βββ bedrock_agent.py # Bedrock model interface
β βββ chat_history_manager.py # Chat history management
β βββ session_state.py # Session state management
β βββ streaming_handler.py # Real-time streaming
βββ tools/ # Tool implementations
β βββ __init__.py
βββ ui/ # User interface components
βββ __init__.py
βββ components.py # UI rendering
- Python >= 3.12
- uv package manager
- AWS credentials configured
-
Install uv (if not already installed):
curl -LsSf https://astral.sh/uv/install.sh | shOr using Homebrew:
brew install uv
-
Configure AWS credentials following instructions here.
-
Install dependencies using uv:
uv sync
-
Set up environment variables:
cp .env.template .env # Edit .env with your configuration
Many of the newer MCP offerings are primarily supported on macOS and specific Linux environments, including AppImage and Ubuntu. To run effectively on a Windows machine, we need to install and run this MCP project using the Windows Subsystem for Linux (wsl). For more information on wsl, refer to the Microsoft documentation here.
-
Install Windows Subsystem for Linux (wsl) on your windows machine
wsl --install -
Download and install a virtual Ubuntu instnce in your wsl environment:
wsl -d Ubuntu -
Create a user account: After installing Ubuntu, you will be asked to create a default user account (that will typically copy the name of the Windows user you are logged in as), and then enter a password.
-
Navigate to the home directory
cd
-
Install uv (if not already installed):
curl -LsSf https://astral.sh/uv/install.sh | sh -
Install unzip to support the AWS CLI installation next. You may need to use your password for the default user account created earlier to approve the install.
sudo apt install unzip
-
Install the AWS CLI: Given we are using a fresh Ubuntu instance, you will likely need to reinstall the AWS cli in your Linux environment.
# Download the AWS CLI installer curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" # Unzip the installer unzip awscliv2.zip # Run the install script sudo ./aws/install # Verify the installation aws --version
-
Configure AWS credentials: You will reference this profile (aws-dp-mcp) in your environment variables later
# Using AWS CLI aws configure --profile aws-dp-mcp
-
Clone the repo into the wsl environment and navigate to the project directory
# Clone the repo git clone https://github.com/strands-agents/samples.git # Navigate to the project directory cd samples/03-integrations/Amazon-DataProcessing-Agent
-
Install dependencies using uv:
uv sync- Set up environment variables: Copy the sample env.template to a new .env file and update the AWS_PROFILE to the profile you configured earlier (aws-dp-mcp). Update the AWS_REGION to align with your associated resources.
# Copy the sample file to a new .env file
copy .env.template .env
# (OPTION A) Open the new .env file and update with your AWS profile
nano .env
# (OPTION B) Open a file explorer to review the .env file and update with an editor of your choice
explorer.exe .Option 1: Using uv (Recommended)
# Run the Streamlit application
uv run streamlit run app.pyOption 2: Using the installed package
# Install in development mode
uv pip install -e .
# Run the application
uv run dataprocessing-agentOption 3: Direct execution
# Activate the virtual environment and run
uv run python -m streamlit run app.py- Modularity: Each component has a single responsibility
- Extensibility: Easy to add new tools, UI components, or core functionality
- Reusability: Components can be imported and used independently
- Maintainability: Clear separation of concerns makes debugging easier
- Testability: Individual components can be unit tested
- New Tools: Add to
src/amazon_dataprocessing_agent/tools/ - UI Components: Add to
src/amazon_dataprocessing_agent/ui/ - Core Logic: Add to
src/amazon_dataprocessing_agent/core/ - Configuration: Add to
src/amazon_dataprocessing_agent/config/
- Launch the application using one of the methods above
- Initialize the agent using the sidebar controls (connects to aws-dataprocessing-mcp-server)
- Select your preferred Claude model (Claude-3.7 Sonnet or Claude-4 Sonnet)
- Start chatting with the agent about your data processing needs
User: "Look at all the tables from my account federated across Glue Data Catalog"
Agent: Goes over glue databases and tables and provide summary of different databases, tables and it schema
User: "Help me create a Glue job to transform JSON data from S3 to Parquet format"
Agent: Generates optimized PySpark script, uploads to S3, creates Glue job with best practices
User: "Show me the top 10 customers by revenue for the month of July"
Agent: Converts to SQL, executes via Athena, provides results with cost analysis
User: "Identify EMR clusters which are sitting idle and can be terminated"
Agent: Recommends optimal cluster configuration, creates cluster, provides cost estimates
- Data Catalog Management: Create, update, and manage databases, tables, and partitions
- ETL Job Development: Generate optimized PySpark scripts using Glue Version 5 (Spark 3.5.1)
- Crawler Operations: Automate schema discovery and metadata management
- Workflow Orchestration: Design and manage complex ETL workflows with triggers
- Cost Optimization: Right-size DPU allocation and implement job bookmarks
- Natural Language to SQL: Convert business questions into optimized SQL queries
- Query Optimization: Analyze and improve query performance
- Cost Management: Monitor and optimize query costs with partitioning strategies
- Schema Discovery: Automatically understand table structures and relationships
- Result Analysis: Provide insights and visualizations from query results
- Cluster Management: Create, configure, and manage EMR clusters
- Big Data Processing: Handle large-scale data processing workloads (>10TB)
- Cost Optimization: Implement spot instances and auto-scaling strategies
- Performance Tuning: Optimize cluster configurations for specific workloads
- Step Management: Execute and monitor Spark/Hadoop jobs
- Bucket Management: List, analyze, and manage S3 buckets for data processing
- Script Deployment: Upload generated scripts to S3 for Glue job execution
- Usage Analysis: Identify idle buckets and optimize storage costs
- Data Format Optimization: Recommend optimal formats (Parquet, ORC) for analytics
The agent connects to the aws-dataprocessing-mcp-server which provides:
- Real-time AWS API Integration: Direct connection to AWS services
- Asynchronous Operation Monitoring: Track job status and completion
- Error Handling & Retries: Intelligent error recovery and user guidance
- Cost Tracking: Real-time cost analysis and optimization recommendations
- Security Best Practices: Automated security configuration and compliance checks
- Real-time streaming responses for immediate feedback
- Context-aware conversations with sliding window memory
- Usage statistics tracking for cost monitoring
- Export/import chat history for session management
- Responsive UI with modern styling
- Error handling and retry logic for robust operation
- Python >= 3.12
- Strands Agents
- Valid AWS credentials
- Access to Amazon Bedrock Claude Athropic models
- Ensure AWS credentials are properly configured
- Verify network connectivity to AWS services
- Check the Streamlit logs for detailed error messages
The Amazon Data Processing Agent creates various AWS resources during operation, including:
- AWS Glue: Databases, tables, crawlers, ETL jobs, and workflows
- Amazon Athena: Query results stored in S3, workgroups, and data catalogs
- Amazon EMR: EC2 clusters, security groups, and associated storage
- Amazon S3: Buckets, objects, and lifecycle policies
Ask the agent to help clean up resources:
"Please help me clean up all the AWS resources we created today"
Manually clean up resources through the AWS Console or CLI
