| marp | true |
|---|---|
| theme | vibeminds |
| paginate | true |
| style | /* Mermaid diagram styling */ .mermaid-container { display: flex; justify-content: center; align-items: center; width: 100%; margin: 0.5em 0; } .mermaid { text-align: center; } .mermaid svg { max-height: 280px; width: auto; } .mermaid .node rect, .mermaid .node polygon { rx: 5px; ry: 5px; } .mermaid .nodeLabel { padding: 0 10px; } /* Two-column layout */ .columns { display: flex; gap: 40px; align-items: flex-start; } .column-left { flex: 1; } .column-right { flex: 1; } .column-left .mermaid svg { min-height: 400px; height: auto; max-height: 500px; } /* Section divider slides */ section.section-divider { display: flex; flex-direction: column; justify-content: center; align-items: center; text-align: center; background: linear-gradient(135deg, #1a1a3e 0%, #4a3f8a 50%, #2d2d5a 100%); } section.section-divider h1 { font-size: 3.5em; margin-bottom: 0.2em; } section.section-divider h2 { font-size: 1.5em; color: #b39ddb; font-weight: 400; } section.section-divider p { font-size: 1.1em; color: #9575cd; margin-top: 1em; } |
An AI-Assisted Development Case Study
Using Claude Opus 4.5 with Claude Code
What is Opik and how we approached the SDK
Opik is an open-source LLM observability platform by Comet ML
- Traces & Spans - Track LLM calls and application flow
- Datasets - Store test data for evaluation
- Experiments - Run and compare model evaluations
- Prompts - Version and manage prompt templates
- Feedback - Capture quality scores and user feedback
Goal: Build a comprehensive Go SDK matching the Python SDK's capabilities
| Component | Description |
|---|---|
| Core SDK | Client, traces, spans, context propagation |
| Data Management | Datasets, experiments, prompts |
| Evaluation | Heuristic metrics + LLM judges |
| Integrations | OpenAI, Anthropic, GoLLM |
| Infrastructure | CLI, middleware, streaming, batching |
| Testing | Comprehensive test suite with mocks |
| Documentation | MkDocs site, tutorials, README |
Source Analysis: 50+ Python files (~20K lines) | OpenAPI Spec: 201 operations (~15K lines)
Output: ~50+ Go source files (~15K lines) across 12 packages
go-opik/
├── *.go # Core SDK (client, trace, span, config, etc.)
├── cmd/opik/ # CLI tool
├── evaluation/
│ ├── heuristic/ # BLEU, ROUGE, Levenshtein, etc.
│ └── llm/ # LLM-as-judge metrics
├── integrations/
│ ├── openai/ # OpenAI provider + tracing
│ ├── anthropic/ # Anthropic provider + tracing
│ └── gollm/ # GoLLM adapter
├── middleware/ # HTTP tracing middleware
├── internal/api/ # ogen-generated API client
└── testutil/ # Mock server + matchers
- Type-safe, no reflection
- Handles optional/nullable fields correctly
- Generated from OpenAPI spec (14,846 lines)
client, err := opik.NewClient(
opik.WithAPIKey("key"),
opik.WithWorkspace("workspace"),
opik.WithProjectName("my-project"),
)- Idiomatic Go using
context.Context - Automatic parent-child span relationships
Features, Testing, Documentation & DevOps
Tracing
- Traces and nested spans
- LLM, Tool, General span types
- Distributed trace propagation
- Streaming support
- Automatic batching
Data
- Dataset CRUD + items
- Experiment tracking
- Prompt versioning
- Template rendering
Evaluation
- 10+ heuristic metrics
- 8+ LLM judge metrics
- G-EVAL implementation
- Custom judge support
- Concurrent evaluation engine
Integrations
- OpenAI tracing transport
- Anthropic tracing transport
- GoLLM adapter for ADK
- HTTP middleware
Test Utilities (testutil/)
Matcherinterface:Any(),AnyButNil(),AnyString(),AnyMap(),AnyFloat()MockServer- HTTP mock server similar to Python'srespx
Test Coverage
| Package | Test Files | Key Tests |
|---|---|---|
| Core SDK | 15 files | Client, trace, span, config, context |
| Evaluation | 6 files | All metrics, scoring, engine |
| Integrations | 6 files | Providers, tracing transports |
| Test Utils | 2 files | Matchers, mock server |
func TestProviderComplete(t *testing.T) {
ms := testutil.NewMockServer()
defer ms.Close()
ms.OnPost("/v1/chat/completions").RespondJSON(200, map[string]any{
"choices": []map[string]any{{
"message": map[string]any{"content": "Hello!"},
}},
})
provider := openai.NewProvider(openai.WithBaseURL(ms.URL()))
resp, err := provider.Complete(ctx, request)
// Verify response
assert.Equal(t, "Hello!", resp.Content)
// Verify request was made correctly
assert.Equal(t, 1, ms.RequestCount())
}- Getting Started: Installation, configuration, testing
- Core Concepts: Traces, spans, context, feedback
- Features: Datasets, experiments, prompts, streaming
- Evaluation: Heuristic metrics, LLM judges
- Integrations: OpenAI, Anthropic, GoLLM, middleware
- Tutorials: Agentic observability with Google ADK & Eino
- Comprehensive feature documentation
- Code examples for all major features
- CI badges, license, related links
Case study: Integrating Opik with multi-agent systems
Based on real-world stats-agent-team project:
(Eino)"] --> A["🔍 Research
(Search)"] A --> B["🧠 Synthesis
(LLM + ADK)"] B --> C["✅ Verification
(LLM + ADK)"] C --> D style A fill:#667eea,stroke:#764ba2,color:#fff style B fill:#667eea,stroke:#764ba2,color:#fff style C fill:#667eea,stroke:#764ba2,color:#fff style D fill:#764ba2,stroke:#667eea,color:#fff
Covers:
- Eino workflow graph tracing
- Google ADK agent tracing
- Tool invocation spans
- Quality feedback scores
jobs:
test:
strategy:
matrix:
go-version: ['1.21', '1.22', '1.23']
steps:
- run: go test -v -race -coverprofile=coverage.out ./...
lint:
steps:
- uses: golangci/golangci-lint-action@v6
build-cli:
strategy:
matrix:
goos: [linux, darwin, windows]
goarch: [amd64, arm64]golangci-lint: 25+ linters enabled
Claude Opus 4.5 performance, insights & lessons learned
| Setting | Value |
|---|---|
| Model | Claude Opus 4.5 (claude-opus-4-5-20250514) |
| Effort | High |
| Context | Extended (with summarization) |
| Tools | Full Claude Code toolset |
- Iterative implementation with immediate testing
- Parallel file reads and tool calls for efficiency
- Todo tracking for complex multi-step tasks
| Milestone | Time |
|---|---|
| Session Start | ~19:00 |
| Core SDK Complete | ~20:15 |
| Test Suite Complete + | ~22:00 |
| Docs & CI Complete + | ~24:00 |
| Total Time + | ~4-5 hours |
+ includes human multi-tasking
| Source | Time Estimate | Cost Estimate |
|---|---|---|
| APIMatic | 4 weeks per SDK | ~$52K including maintenance |
| Speakeasy | Months | ~$90K per SDK |
| liblab | "Weeks or months" | $50K+ per language |
These companies sell SDK tools, so they emphasize manual costs—but the ballpark is consistent.
Building a production SDK manually takes weeks, not hours.
| Approach | Time | Source |
|---|---|---|
| Industry Benchmark | 4+ weeks per SDK | APIMatic, Speakeasy, liblab |
| Claude Opus 4.5 | 4-5 hours | This project |
- What Accounts for the Difference?
- Parallel Processing - Claude reads multiple files simultaneously
- No Context Switching - Continuous focus on single project
- Pattern Recognition - Instantly applies Go idioms
- Reference Implementation - Quickly maps Python → Go patterns
- No Typing Delay - Generates code at output speed
- Integrated Testing - Writes tests alongside implementation
- Caveats
- Human review still essential for production deployment
- Your mileage may vary based on API complexity and coverage requirements
- Large Codebase Navigation
- Reading and understanding 50+ files
- Cross-referencing Python SDK patterns
- Code Generation Quality
- Idiomatic Go code
- Proper error handling
- Functional options pattern
- Test Development
- Table-driven tests
- Mock server implementation
- Edge case coverage
- Documentation
- MkDocs structure
- Tutorial with code examples
- README with badges
-
- Issue: Complex optional types in API responses
- Solution: Careful handling of
OptXxxtypes with.Setchecks
-
- Issue: Config format changed significantly
- Solution: Iterative fixes with immediate validation
-
- Issue: Different language idioms
- Solution: Adapted patterns (e.g., context vs decorators)
-
- Issue: Need Python's
respx-like mocking - Solution: Built custom
MockServerwith route matching
- Issue: Need Python's
$ golangci-lint run
dataset.go:174: duplicate of prompt.go:279-314 (dupl)
prompt.go:279: duplicate of dataset.go:174-209 (dupl)
Only 2 warnings - expected duplication in similar API patterns
$ go test ./...
ok github.com/plexusone/opik-go 0.070s
ok github.com/plexusone/opik-go/evaluation
ok github.com/plexusone/opik-go/evaluation/heuristic
ok github.com/plexusone/opik-go/evaluation/llm
ok github.com/plexusone/opik-go/integrations/...
All tests passing across 11 packages
- High effort setting enables deeper code analysis and better solutions
- Parallel tool calls significantly speed up exploration and implementation
- Todo tracking helps maintain focus on complex multi-file changes
- Iterative validation (run tests after each change) catches issues early
- Reference implementations (Python SDK) provide valuable patterns
A production-ready Go SDK built efficiently with AI assistance
Deliverables, future work & resources
| Deliverable | Status |
|---|---|
| Core SDK | ✅ Complete |
| CLI Tool | ✅ Complete |
| Evaluation Framework | ✅ Complete |
| LLM Integrations | ✅ Complete |
| Test Suite | ✅ Complete |
| MkDocs Documentation | ✅ Complete |
| CI/CD Pipeline | ✅ Complete |
| Agentic Tutorial | ✅ Complete |
Repository: github.com/plexusone/opik-go
- More LLM Providers: Gemini, Mistral, Cohere
- gRPC Support: For high-performance tracing
- OpenTelemetry Bridge: Export to OTel collectors
- More Tutorials: RAG applications, chatbots
- Benchmarks: Performance testing suite
- Open for contributions
- Issues and PRs welcome
- MIT License
- Repository: github.com/plexusone/opik-go
- Opik: github.com/comet-ml/opik
- Documentation: agentplexus.github.io/go-opik
- GitHub: @agentplexus
A Go SDK for LLM Observability
Built with Claude Opus 4.5 + Claude Code