This course teaches production-grade multi-agent systems for large-scale incident management, log analysis, and automated triage.
Support and meta engineers handling:
- Large-scale log & metric analysis
- Incident triage, deduplication, clustering
- Root cause analysis with evidence
- Policy/runbook/SLA compliance
- Safe automation with strict guardrails
- Multi-agent architecture patterns
- Agent communication & coordination
- Building an automated incident triage system
- Hands-on: 3-agent triage pipeline
- Hierarchical agent systems
- Evidence-based root cause analysis
- Guardrails & safety mechanisms
- Hands-on: 5-agent RCA system
# Clone the repository
git clone <repository-url>
cd advanced-multi-agent-ai-systems-2-half-days
# Install dependencies
pip install -r requirements.txtThe labs work out-of-the-box with MockLLM - no API keys required!
# Just start Jupyter
jupyter notebookAll notebooks will automatically use MockLLM with deterministic responses.
To use real LLMs:
- Copy the example environment file:
cp .env.example .env- Edit
.envand add your OpenAI API key:
OPENAI_API_KEY=sk-...your-actual-key...
OPENAI_MODEL=gpt-4o-mini
- Start Jupyter:
jupyter notebookThe system will automatically detect the API key and use OpenAI.
IMPORTANT: This course follows security best practices:
- ✅ NO hardcoded API keys anywhere
- ✅ All keys read from environment variables only
- ✅ Automatic fallback to MockLLM if no key present
- ✅
.envfile is gitignored - ✅ Only
.env.exampleis committed (with placeholders)
-
day1_foundations_and_triage.ipynb- Multi-agent fundamentals
- Communication patterns
- 3-agent incident triage system
- Exercises with real-world scenarios
-
day2_advanced_patterns_and_rca.ipynb- Hierarchical coordination
- Evidence-based reasoning
- 5-agent root cause analysis system
- Production guardrails
The built-in MockLLM provides:
- Deterministic mode (default): Same inputs → same outputs
- Probabilistic mode: Controlled randomness for evaluation
- Zero dependencies: No API calls, no network
- Same interface: Drop-in replacement for real LLMs
- Classifier Agent: Categorizes incidents (P0-P4)
- Deduplication Agent: Finds similar incidents
- Router Agent: Routes to correct team
- Log Parser Agent: Extracts structured data
- Pattern Detector Agent: Finds anomalies
- Correlation Agent: Links related events
- Hypothesis Agent: Generates RCA hypotheses
- Validator Agent: Validates against evidence
After this course, you will:
- ✅ Design multi-agent systems for production incidents
- ✅ Implement agent communication & coordination
- ✅ Build evidence-based reasoning pipelines
- ✅ Apply guardrails for safe automation
- ✅ Handle large-scale log analysis with agents
- ✅ Deploy production-ready agent systems
.
├── README.md
├── requirements.txt
├── .env.example
├── day1_foundations_and_triage.ipynb
├── day2_advanced_patterns_and_rca.ipynb
├── src/
│ ├── llm/
│ │ ├── mock_llm.py # MockLLM implementation
│ │ ├── openai_llm.py # OpenAI wrapper
│ │ └── llm_factory.py # Factory pattern
│ ├── agents/
│ │ ├── base_agent.py # Agent base class
│ │ ├── communication.py # Message passing
│ │ └── orchestrator.py # Coordination logic
│ └── utils/
│ ├── log_parser.py # Log parsing utilities
│ └── metrics.py # Evaluation metrics
└── data/
├── sample_incidents.json # Sample incident data
└── sample_logs.txt # Sample log files
- Start with MockLLM: Let students understand the architecture without API costs
- Switch to real LLMs: For advanced exercises, enable OpenAI to show real behavior
- Cost control: Use
gpt-4o-mini(default) to minimize costs - Exercises: Each notebook has 3-5 hands-on exercises
- Time management: Each half-day is designed for 4 hours (3h teaching + 1h exercises)
Make sure you're running Jupyter from the project root directory.
Check that:
.envfile exists (not.env.example)OPENAI_API_KEYis set in.env- The key starts with
sk-
This is normal! MockLLM is the default. To use OpenAI, set up .env as described above.
This training material is provided for educational purposes.
Feedback and improvements welcome! Please open an issue or PR.