marp	true
theme	vibeminds
paginate	true
style	/* Mermaid diagram styling / .mermaid-container { display: flex; justify-content: center; align-items: center; width: 100%; margin: 0.5em 0; } .mermaid { text-align: center; } .mermaid svg { max-height: 280px; width: auto; } .mermaid .node rect, .mermaid .node polygon { rx: 5px; ry: 5px; } .mermaid .nodeLabel { padding: 0 10px; } / Two-column layout / .columns { display: flex; gap: 40px; align-items: flex-start; } .column-left { flex: 1; } .column-right { flex: 1; } .column-left .mermaid svg { min-height: 400px; height: auto; max-height: 500px; } / Section divider slides */ section.section-divider { display: flex; flex-direction: column; justify-content: center; align-items: center; text-align: center; background: linear-gradient(135deg, #1a1a3e 0%, #4a3f8a 50%, #2d2d5a 100%); } section.section-divider h1 { font-size: 3.5em; margin-bottom: 0.2em; } section.section-divider h2 { font-size: 1.5em; color: #b39ddb; font-weight: 400; } section.section-divider p { font-size: 1.1em; color: #9575cd; margin-top: 1em; }

Building go-opik

A Go SDK for LLM Observability

An AI-Assisted Development Case Study

Using Claude Opus 4.5 with Claude Code

Section 1

Introduction & Overview

What is Opik and how we approached the SDK

What is Opik? 🔭

Opik is an open-source LLM observability platform by Comet ML

Traces & Spans - Track LLM calls and application flow
Datasets - Store test data for evaluation
Experiments - Run and compare model evaluations
Prompts - Version and manage prompt templates
Feedback - Capture quality scores and user feedback

Goal: Build a comprehensive Go SDK matching the Python SDK's capabilities

Project Scope 📋

Component	Description
Core SDK	Client, traces, spans, context propagation
Data Management	Datasets, experiments, prompts
Evaluation	Heuristic metrics + LLM judges
Integrations	OpenAI, Anthropic, GoLLM
Infrastructure	CLI, middleware, streaming, batching
Testing	Comprehensive test suite with mocks
Documentation	MkDocs site, tutorials, README

Source Analysis: 50+ Python files (~20K lines) | OpenAPI Spec: 201 operations (~15K lines)

Output: ~50+ Go source files (~15K lines) across 12 packages

Architecture Overview 🏗️

go-opik/
├── *.go                    # Core SDK (client, trace, span, config, etc.)
├── cmd/opik/               # CLI tool
├── evaluation/
│   ├── heuristic/          # BLEU, ROUGE, Levenshtein, etc.
│   └── llm/                # LLM-as-judge metrics
├── integrations/
│   ├── openai/             # OpenAI provider + tracing
│   ├── anthropic/          # Anthropic provider + tracing
│   └── gollm/              # GoLLM adapter
├── middleware/             # HTTP tracing middleware
├── internal/api/           # ogen-generated API client
└── testutil/               # Mock server + matchers

Key Design Decisions 🎯

1. ogen for API Client Generation

Type-safe, no reflection
Handles optional/nullable fields correctly
Generated from OpenAPI spec (14,846 lines)

2. Functional Options Pattern

client, err := opik.NewClient(
    opik.WithAPIKey("key"),
    opik.WithWorkspace("workspace"),
    opik.WithProjectName("my-project"),
)

3. Context-Based Tracing

Idiomatic Go using context.Context
Automatic parent-child span relationships

Section 2

Implementation Deep Dive

Features, Testing, Documentation & DevOps

Core Features Implemented ✨

Tracing

Traces and nested spans
LLM, Tool, General span types
Distributed trace propagation
Streaming support
Automatic batching

Data

Dataset CRUD + items
Experiment tracking
Prompt versioning
Template rendering

Evaluation

10+ heuristic metrics
8+ LLM judge metrics
G-EVAL implementation
Custom judge support
Concurrent evaluation engine

Integrations

OpenAI tracing transport
Anthropic tracing transport
GoLLM adapter for ADK
HTTP middleware

Testing Strategy 🧪

Mirrored Python SDK's Testing Patterns

Test Utilities (testutil/)

Matcher interface: Any(), AnyButNil(), AnyString(), AnyMap(), AnyFloat()
MockServer - HTTP mock server similar to Python's respx

Test Coverage

Package	Test Files	Key Tests
Core SDK	15 files	Client, trace, span, config, context
Evaluation	6 files	All metrics, scoring, engine
Integrations	6 files	Providers, tracing transports
Test Utils	2 files	Matchers, mock server

Test Example: Mock Server 💻

func TestProviderComplete(t *testing.T) {
    ms := testutil.NewMockServer()
    defer ms.Close()

    ms.OnPost("/v1/chat/completions").RespondJSON(200, map[string]any{
        "choices": []map[string]any{{
            "message": map[string]any{"content": "Hello!"},
        }},
    })

    provider := openai.NewProvider(openai.WithBaseURL(ms.URL()))
    resp, err := provider.Complete(ctx, request)

    // Verify response
    assert.Equal(t, "Hello!", resp.Content)

    // Verify request was made correctly
    assert.Equal(t, 1, ms.RequestCount())
}

Documentation Created 📚

MkDocs Site Structure

Getting Started: Installation, configuration, testing
Core Concepts: Traces, spans, context, feedback
Features: Datasets, experiments, prompts, streaming
Evaluation: Heuristic metrics, LLM judges
Integrations: OpenAI, Anthropic, GoLLM, middleware
Tutorials: Agentic observability with Google ADK & Eino

README.md

Comprehensive feature documentation
Code examples for all major features
CI badges, license, related links

Tutorial: Agentic Observability 🤖

Case study: Integrating Opik with multi-agent systems

Based on real-world stats-agent-team project:

flowchart LR D["🎯 Orchestrator
(Eino)"] --> A["🔍 Research
(Search)"] A --> B["🧠 Synthesis
(LLM + ADK)"] B --> C["✅ Verification
(LLM + ADK)"] C --> D style A fill:#667eea,stroke:#764ba2,color:#fff style B fill:#667eea,stroke:#764ba2,color:#fff style C fill:#667eea,stroke:#764ba2,color:#fff style D fill:#764ba2,stroke:#667eea,color:#fff

Covers:

Eino workflow graph tracing
Google ADK agent tracing
Tool invocation spans
Quality feedback scores

CI/CD Infrastructure ⚙️

GitHub Actions Workflow

jobs:
  test:
    strategy:
      matrix:
        go-version: ['1.21', '1.22', '1.23']
    steps:
      - run: go test -v -race -coverprofile=coverage.out ./...

  lint:
    steps:
      - uses: golangci/golangci-lint-action@v6

  build-cli:
    strategy:
      matrix:
        goos: [linux, darwin, windows]
        goarch: [amd64, arm64]

golangci-lint: 25+ linters enabled

Section 3

AI-Assisted Development

Claude Opus 4.5 performance, insights & lessons learned

Claude Opus 4.5 DevEx 🧠

Session Configuration

Setting	Value
Model	Claude Opus 4.5 (`claude-opus-4-5-20250514`)
Effort	High
Context	Extended (with summarization)
Tools	Full Claude Code toolset

Development Approach

Iterative implementation with immediate testing
Parallel file reads and tool calls for efficiency
Todo tracking for complex multi-step tasks

Session Statistics 📊

Development Timeline

Milestone	Time
Session Start	~19:00
Core SDK Complete	~20:15
Test Suite Complete +	~22:00
Docs & CI Complete +	~24:00
Total Time +	~4-5 hours

+ includes human multi-tasking

Token Usage (Estimated)

Category	Estimate
Input Tokens	~800K - 1M
Output Tokens	~150K - 200K
Files Read (Python)	50+
Lines of Code Read	~20,000+
Files Created/Modified	60+
Lines of Code Written	~15,000+
Estimated Cost	~$20 - $30

Industry Benchmarks ⏱️

What SDK Companies Report for Manual Development

Source	Time Estimate	Cost Estimate
APIMatic	4 weeks per SDK	~$52K including maintenance
Speakeasy	Months	~$90K per SDK
liblab	"Weeks or months"	$50K+ per language

These companies sell SDK tools, so they emphasize manual costs—but the ballpark is consistent.

Building a production SDK manually takes weeks, not hours.

Productivity Comparison 🚀

Time Comparison

Approach	Time	Source
Industry Benchmark	4+ weeks per SDK	APIMatic, Speakeasy, liblab
Claude Opus 4.5	4-5 hours	This project

What Accounts for the Difference?
1. Parallel Processing - Claude reads multiple files simultaneously
2. No Context Switching - Continuous focus on single project
3. Pattern Recognition - Instantly applies Go idioms
4. Reference Implementation - Quickly maps Python → Go patterns
5. No Typing Delay - Generates code at output speed
6. Integrated Testing - Writes tests alongside implementation
Caveats
1. Human review still essential for production deployment
2. Your mileage may vary based on API complexity and coverage requirements

What Claude Handled Well 💪

Strengths Demonstrated

Large Codebase Navigation
- Reading and understanding 50+ files
- Cross-referencing Python SDK patterns
Code Generation Quality
- Idiomatic Go code
- Proper error handling
- Functional options pattern

Test Development
- Table-driven tests
- Mock server implementation
- Edge case coverage
Documentation
- MkDocs structure
- Tutorial with code examples
- README with badges

Challenges & Solutions 🔧

Challenge 1: ogen Generated Code
- Issue: Complex optional types in API responses
- Solution: Careful handling of OptXxx types with .Set checks
Challenge 2: golangci-lint v2 Config
- Issue: Config format changed significantly
- Solution: Iterative fixes with immediate validation
Challenge 3: Matching Python SDK Patterns
- Issue: Different language idioms
- Solution: Adapted patterns (e.g., context vs decorators)
Challenge 4: Test Utilities
- Issue: Need Python's respx-like mocking
- Solution: Built custom MockServer with route matching

Code Quality Results ✅

golangci-lint Output

$ golangci-lint run
dataset.go:174: duplicate of prompt.go:279-314 (dupl)
prompt.go:279: duplicate of dataset.go:174-209 (dupl)

Only 2 warnings - expected duplication in similar API patterns

Test Results

$ go test ./...
ok  github.com/plexusone/opik-go          0.070s
ok  github.com/plexusone/opik-go/evaluation
ok  github.com/plexusone/opik-go/evaluation/heuristic
ok  github.com/plexusone/opik-go/evaluation/llm
ok  github.com/plexusone/opik-go/integrations/...

All tests passing across 11 packages

Key Takeaways 💡

AI-Assisted Development Insights

High effort setting enables deeper code analysis and better solutions
Parallel tool calls significantly speed up exploration and implementation
Todo tracking helps maintain focus on complex multi-file changes
Iterative validation (run tests after each change) catches issues early
Reference implementations (Python SDK) provide valuable patterns

Result

A production-ready Go SDK built efficiently with AI assistance

Section 4

Conclusion

Deliverables, future work & resources

Project Deliverables 📦

Deliverable	Status
Core SDK	✅ Complete
CLI Tool	✅ Complete
Evaluation Framework	✅ Complete
LLM Integrations	✅ Complete
Test Suite	✅ Complete
MkDocs Documentation	✅ Complete
CI/CD Pipeline	✅ Complete
Agentic Tutorial	✅ Complete

Repository: github.com/plexusone/opik-go

Future Enhancements 🔮

Potential Additions

More LLM Providers: Gemini, Mistral, Cohere
gRPC Support: For high-performance tracing
OpenTelemetry Bridge: Export to OTel collectors
More Tutorials: RAG applications, chatbots
Benchmarks: Performance testing suite

Community

Open for contributions
Issues and PRs welcome
MIT License

Resources 🔗

Contact

GitHub: @agentplexus

Thank You 🙏

go-opik

A Go SDK for LLM Observability

Built with Claude Opus 4.5 + Claude Code

FilesExpand file tree

PRESENTATION_CASE_STUDY.md

Latest commit

History