A rigorous, hands-on program to master enterprise AI architecture through production-grade projects.
This repository documents the engineering journey of building real AI systems—not toy examples—with focus on provider-agnostic design, RAG, Autonomous Agents, and Enterprise Integration. Code, architectural decisions, trade-offs, and failures are documented openly.
"Don't write code you can't defend on a whiteboard."
Master AI systems architecture at an enterprise level, with focus on:
- Provider-agnostic design — Learn the pattern, then implement across multiple providers
- Production-grade systems — No notebooks, everything deployable
- Complex real projects — Not tutorials, not toy examples
- Defensible decisions — Every architectural choice documented and justified
| Area | What You'll Build |
|---|---|
| RAG Architecture | Multi-provider retrieval systems with evaluation pipelines |
| Agent Orchestration | Autonomous agents with tool calling, memory, and multi-agent collaboration |
| Cloud Infrastructure | Production deployments with observability, CI/CD, and cost controls |
| Enterprise Integration | AI systems connected to real business platforms (ERPs, CRMs, custom APIs) |
| Category | Providers |
|---|---|
| Cloud | AWS Bedrock (Claude, Llama), OpenAI API, Google Vertex AI, Azure OpenAI |
| Enterprise | SAP AI Core, SAP Generative AI Hub |
| Open Source / Local | Ollama, vLLM, LocalAI |
- LangChain, LangGraph
- Semantic Kernel (optional)
- Custom orchestration patterns
| Type | Options |
|---|---|
| Managed | Knowledge Bases for Bedrock, Vertex AI Search |
| Self-hosted | PostgreSQL + pgvector, Chroma, Qdrant |
| Enterprise | SAP HANA Cloud Vector Engine |
- Compute: AWS ECS/Fargate, Lambda, Cloud Run
- IaC: Terraform, AWS CDK
- Observability: CloudWatch, X-Ray, OpenTelemetry
- CI/CD: GitHub Actions
- Primary: Python 3.11+
- Secondary: TypeScript (backend services)
Multi-environment development setup
- Local development environment (Docker, VSCode, Python)
- Cloud accounts with budget controls (AWS)
- Local LLM setup (Ollama)
- Repository structure and principles
Deliverables:
- AWS-SETUP.md — Complete AWS baseline (budgets, IAM, CLI, Bedrock)
- OLLAMA-SETUP.md — Local LLM setup guide
Core concepts without getting lost in theory
- Neural networks, transformers (conceptual, not from scratch)
- How LLMs work (tokenization, attention, inference)
- PyTorch basics for understanding, not training
- Project: API serving a pre-trained model (Dockerized)
The pattern that prevents vendor lock-in
- Provider abstraction architecture (interface + implementations)
- Implement: Ollama (local), Bedrock, OpenAI
- Prompt engineering patterns (few-shot, chain-of-thought, structured output)
- Streaming, error handling, retry patterns
- Project: Multi-provider chatbot with switchable backends
Retrieval-Augmented Generation across providers
- Embeddings: concepts and model comparison
- Vector stores: pgvector, Chroma, managed options
- Chunking strategies, retrieval optimization
- Hybrid search (keyword + semantic)
- Evaluation: relevance, groundedness, faithfulness
- Project: Technical documentation system with evaluation pipeline
Systems that act, not just respond
- Agent architectures (ReAct, Plan-and-Execute)
- Tool calling across providers
- LangGraph for stateful workflows
- Memory patterns (short-term, long-term, episodic)
- Human-in-the-loop, error recovery
- Project: Business process automation agent
- Project: Multi-agent collaborative system
Production-ready deployment
- Containerized AI services (ECS, Cloud Run)
- Serverless inference (Lambda, Cloud Functions)
- CI/CD pipelines for ML
- Observability: tracing, metrics, alerts
- Cost optimization strategies
- Project: Production API with full observability
Connecting AI to real business systems
- Integration patterns (sync, async, event-driven)
- Authentication and security
- SAP Integration (S/4HANA, BTP, AI Core)
- Other ERPs/CRMs patterns
- Project: Enterprise agent with real system connectivity
Portfolio, authority, and next steps
- Fine-tuning techniques (LoRA, QLoRA) — when and why
- Security: prompt injection, data leakage, PII handling
- Compliance: GDPR, auditing, enterprise requirements
- Deliverable: Reference architecture document
- Deliverable: Professional portfolio
Every significant decision is documented as an ADR (Architecture Decision Record).
| ADR | Decision | Status |
|---|---|---|
| 001 | Provider abstraction as core pattern | Accepted |
| 002 | Multi-provider from Phase 2 | Accepted |
| 003 | pgvector as default vector store (cost-effective) | Accepted |
| 004 | Real projects over tutorials | Accepted |
| 005 | Ollama for local development | Accepted |
Full documentation: docs/DECISIONS.md
| Topic | Reason |
|---|---|
| Beginner Python/ML tutorial | Assumes programming experience |
| Prompt engineering tricks collection | Focus is architecture, not prompts |
| Single-provider deep dive | Intentionally multi-provider |
| Academic research | Goal is production systems |
| Training LLMs from scratch | Focus is using and orchestrating LLMs |
This project operates under realistic constraints.
| Threshold | Action |
|---|---|
| $10 USD/month | Early warning — review consumption |
| $25 USD/month | Caution — identify active resources |
| $50 USD/month | Action — shut down non-essential |
Strategy:
- Local development with Ollama (free)
- Cloud free tiers aggressively
- Shut down resources when not in use
- Avoid always-on managed services
ai-architecture-master/
├── docs/
│ ├── ROADMAP.md # Detailed phase breakdown
│ ├── DECISIONS.md # Architecture Decision Records
│ ├── PROGRESS.md # Progress tracking
│ └── RESOURCES.md # Curated learning resources
├── phase-0-setup/ # ✅ Complete
│ ├── README.md
│ ├── AWS-SETUP.md
│ ├── OLLAMA-SETUP.md
│ └── images/
├── phase-1-foundations/
├── phase-2-llms-providers/
├── phase-3-rag/
├── phase-4-agents/
├── phase-5-infrastructure/
├── phase-6-enterprise/
├── phase-7-consolidation/
└── shared/
├── abstractions/ # Provider abstraction interfaces
├── docker/
├── terraform/
└── utils/
Current Phase: 1 - AI Foundations (Not Started)
| Phase | Status |
|---|---|
| Phase 0: Setup | ✅ Complete |
| Phase 1: AI Foundations | ⬜ Not Started |
| Phase 2: LLMs + Provider Abstraction | ⬜ Not Started |
| Phase 3: RAG Systems | ⬜ Not Started |
| Phase 4: Autonomous Agents | ⬜ Not Started |
| Phase 5: Infrastructure & MLOps | ⬜ Not Started |
| Phase 6: Enterprise Integrations | ⬜ Not Started |
| Phase 7: Consolidation | ⬜ Not Started |
Detailed progress: docs/PROGRESS.md
This is a personal learning journey documented publicly. While not accepting direct contributions, feedback and discussions are welcome through issues.
MIT License - see LICENSE
Last Updated: February 2026