Skip to content

Latest commit

 

History

History
930 lines (739 loc) · 30.8 KB

File metadata and controls

930 lines (739 loc) · 30.8 KB
marp true
theme vibeminds
paginate true
style /* Mermaid diagram styling */ .mermaid-container { display: flex; justify-content: center; align-items: center; width: 100%; margin: 0.5em 0; } .mermaid { text-align: center; } .mermaid svg { max-height: 280px; width: auto; } .mermaid .node rect, .mermaid .node polygon { rx: 5px; ry: 5px; } .mermaid .nodeLabel { padding: 0 10px; } /* Two-column layout */ .columns { display: flex; gap: 40px; align-items: flex-start; } .column-left { flex: 1; } .column-right { flex: 1; } .column-left .mermaid svg { min-height: 400px; height: auto; max-height: 500px; } /* Section divider slides */ section.section-divider { display: flex; flex-direction: column; justify-content: center; align-items: center; text-align: center; background: linear-gradient(135deg, #1a1a3e 0%, #4a3f8a 50%, #2d2d5a 100%); } section.section-divider h1 { font-size: 3.5em; margin-bottom: 0.2em; } section.section-divider h2 { font-size: 1.5em; color: #b39ddb; font-weight: 400; } section.section-divider p { font-size: 1.1em; color: #9575cd; margin-top: 1em; }
<script type="module"> import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs'; mermaid.initialize({ startOnLoad: true, theme: 'dark', themeVariables: { background: 'transparent', primaryColor: '#7c4dff', primaryTextColor: '#e8eaf6', primaryBorderColor: '#667eea', lineColor: '#b39ddb', secondaryColor: '#302b63', tertiaryColor: '#24243e' } }); </script>

Building go-opik

A Go SDK for LLM Observability

An AI-Assisted Development Case Study

Using Claude Opus 4.5 with Claude Code


Section 1

Introduction & Overview

What is Opik and how we approached the SDK


What is Opik? 🔭

Opik is an open-source LLM observability platform by Comet ML

  • Traces & Spans - Track LLM calls and application flow
  • Datasets - Store test data for evaluation
  • Experiments - Run and compare model evaluations
  • Prompts - Version and manage prompt templates
  • Feedback - Capture quality scores and user feedback

Goal: Build a comprehensive Go SDK matching the Python SDK's capabilities


Project Scope 📋

Component Description
Core SDK Client, traces, spans, context propagation
Data Management Datasets, experiments, prompts
Evaluation Heuristic metrics + LLM judges
Integrations OpenAI, Anthropic, GoLLM
Infrastructure CLI, middleware, streaming, batching
Testing Comprehensive test suite with mocks
Documentation MkDocs site, tutorials, README

Source Analysis: 50+ Python files (~20K lines) | OpenAPI Spec: 201 operations (~15K lines)

Output: ~50+ Go source files (~15K lines) across 12 packages


Architecture Overview 🏗️

go-opik/
├── *.go                    # Core SDK (client, trace, span, config, etc.)
├── cmd/opik/               # CLI tool
├── evaluation/
│   ├── heuristic/          # BLEU, ROUGE, Levenshtein, etc.
│   └── llm/                # LLM-as-judge metrics
├── integrations/
│   ├── openai/             # OpenAI provider + tracing
│   ├── anthropic/          # Anthropic provider + tracing
│   └── gollm/              # GoLLM adapter
├── middleware/             # HTTP tracing middleware
├── internal/api/           # ogen-generated API client
└── testutil/               # Mock server + matchers

Key Design Decisions 🎯

1. ogen for API Client Generation

  • Type-safe, no reflection
  • Handles optional/nullable fields correctly
  • Generated from OpenAPI spec (14,846 lines)

2. Functional Options Pattern

client, err := opik.NewClient(
    opik.WithAPIKey("key"),
    opik.WithWorkspace("workspace"),
    opik.WithProjectName("my-project"),
)

3. Context-Based Tracing

  • Idiomatic Go using context.Context
  • Automatic parent-child span relationships

Section 2

Implementation Deep Dive

Features, Testing, Documentation & DevOps


Core Features Implemented ✨

Tracing

  • Traces and nested spans
  • LLM, Tool, General span types
  • Distributed trace propagation
  • Streaming support
  • Automatic batching

Data

  • Dataset CRUD + items
  • Experiment tracking
  • Prompt versioning
  • Template rendering

Evaluation

  • 10+ heuristic metrics
  • 8+ LLM judge metrics
  • G-EVAL implementation
  • Custom judge support
  • Concurrent evaluation engine

Integrations

  • OpenAI tracing transport
  • Anthropic tracing transport
  • GoLLM adapter for ADK
  • HTTP middleware

Testing Strategy 🧪

Mirrored Python SDK's Testing Patterns

Test Utilities (testutil/)

  • Matcher interface: Any(), AnyButNil(), AnyString(), AnyMap(), AnyFloat()
  • MockServer - HTTP mock server similar to Python's respx

Test Coverage

Package Test Files Key Tests
Core SDK 15 files Client, trace, span, config, context
Evaluation 6 files All metrics, scoring, engine
Integrations 6 files Providers, tracing transports
Test Utils 2 files Matchers, mock server

Test Example: Mock Server 💻

func TestProviderComplete(t *testing.T) {
    ms := testutil.NewMockServer()
    defer ms.Close()

    ms.OnPost("/v1/chat/completions").RespondJSON(200, map[string]any{
        "choices": []map[string]any{{
            "message": map[string]any{"content": "Hello!"},
        }},
    })

    provider := openai.NewProvider(openai.WithBaseURL(ms.URL()))
    resp, err := provider.Complete(ctx, request)

    // Verify response
    assert.Equal(t, "Hello!", resp.Content)

    // Verify request was made correctly
    assert.Equal(t, 1, ms.RequestCount())
}

Documentation Created 📚

MkDocs Site Structure

  • Getting Started: Installation, configuration, testing
  • Core Concepts: Traces, spans, context, feedback
  • Features: Datasets, experiments, prompts, streaming
  • Evaluation: Heuristic metrics, LLM judges
  • Integrations: OpenAI, Anthropic, GoLLM, middleware
  • Tutorials: Agentic observability with Google ADK & Eino

README.md

  • Comprehensive feature documentation
  • Code examples for all major features
  • CI badges, license, related links

Tutorial: Agentic Observability 🤖

Case study: Integrating Opik with multi-agent systems

Based on real-world stats-agent-team project:

flowchart LR D["🎯 Orchestrator
(Eino)"] --> A["🔍 Research
(Search)"] A --> B["🧠 Synthesis
(LLM + ADK)"] B --> C["✅ Verification
(LLM + ADK)"] C --> D style A fill:#667eea,stroke:#764ba2,color:#fff style B fill:#667eea,stroke:#764ba2,color:#fff style C fill:#667eea,stroke:#764ba2,color:#fff style D fill:#764ba2,stroke:#667eea,color:#fff

Covers:

  • Eino workflow graph tracing
  • Google ADK agent tracing
  • Tool invocation spans
  • Quality feedback scores

CI/CD Infrastructure ⚙️

GitHub Actions Workflow

jobs:
  test:
    strategy:
      matrix:
        go-version: ['1.21', '1.22', '1.23']
    steps:
      - run: go test -v -race -coverprofile=coverage.out ./...

  lint:
    steps:
      - uses: golangci/golangci-lint-action@v6

  build-cli:
    strategy:
      matrix:
        goos: [linux, darwin, windows]
        goarch: [amd64, arm64]

golangci-lint: 25+ linters enabled


Section 3

AI-Assisted Development

Claude Opus 4.5 performance, insights & lessons learned


Claude Opus 4.5 DevEx 🧠

Session Configuration

Setting Value
Model Claude Opus 4.5 (claude-opus-4-5-20250514)
Effort High
Context Extended (with summarization)
Tools Full Claude Code toolset

Development Approach

  • Iterative implementation with immediate testing
  • Parallel file reads and tool calls for efficiency
  • Todo tracking for complex multi-step tasks

Session Statistics 📊

Development Timeline

Milestone Time
Session Start ~19:00
Core SDK Complete ~20:15
Test Suite Complete + ~22:00
Docs & CI Complete + ~24:00
Total Time + ~4-5 hours

  + includes human multi-tasking

Token Usage (Estimated)

Category Estimate
Input Tokens ~800K - 1M
Output Tokens ~150K - 200K
Files Read (Python) 50+
Lines of Code Read ~20,000+
Files Created/Modified 60+
Lines of Code Written ~15,000+
Estimated Cost ~$20 - $30

Industry Benchmarks ⏱️

What SDK Companies Report for Manual Development

Source Time Estimate Cost Estimate
APIMatic 4 weeks per SDK ~$52K including maintenance
Speakeasy Months ~$90K per SDK
liblab "Weeks or months" $50K+ per language

These companies sell SDK tools, so they emphasize manual costs—but the ballpark is consistent.

Building a production SDK manually takes weeks, not hours.


Productivity Comparison 🚀

Time Comparison

Approach Time Source
Industry Benchmark 4+ weeks per SDK APIMatic, Speakeasy, liblab
Claude Opus 4.5 4-5 hours This project
  1. What Accounts for the Difference?
    1. Parallel Processing - Claude reads multiple files simultaneously
    2. No Context Switching - Continuous focus on single project
    3. Pattern Recognition - Instantly applies Go idioms
    4. Reference Implementation - Quickly maps Python → Go patterns
    5. No Typing Delay - Generates code at output speed
    6. Integrated Testing - Writes tests alongside implementation
  2. Caveats
    1. Human review still essential for production deployment
    2. Your mileage may vary based on API complexity and coverage requirements

What Claude Handled Well 💪

Strengths Demonstrated

  1. Large Codebase Navigation
    • Reading and understanding 50+ files
    • Cross-referencing Python SDK patterns
  2. Code Generation Quality
    • Idiomatic Go code
    • Proper error handling
    • Functional options pattern
  1. Test Development
    • Table-driven tests
    • Mock server implementation
    • Edge case coverage
  2. Documentation
    • MkDocs structure
    • Tutorial with code examples
    • README with badges

Challenges & Solutions 🔧

  1. Challenge 1: ogen Generated Code

    • Issue: Complex optional types in API responses
    • Solution: Careful handling of OptXxx types with .Set checks
  2. Challenge 2: golangci-lint v2 Config

    • Issue: Config format changed significantly
    • Solution: Iterative fixes with immediate validation
  3. Challenge 3: Matching Python SDK Patterns

    • Issue: Different language idioms
    • Solution: Adapted patterns (e.g., context vs decorators)
  4. Challenge 4: Test Utilities

    • Issue: Need Python's respx-like mocking
    • Solution: Built custom MockServer with route matching

Code Quality Results ✅

golangci-lint Output

$ golangci-lint run
dataset.go:174: duplicate of prompt.go:279-314 (dupl)
prompt.go:279: duplicate of dataset.go:174-209 (dupl)

Only 2 warnings - expected duplication in similar API patterns

Test Results

$ go test ./...
ok  github.com/plexusone/opik-go          0.070s
ok  github.com/plexusone/opik-go/evaluation
ok  github.com/plexusone/opik-go/evaluation/heuristic
ok  github.com/plexusone/opik-go/evaluation/llm
ok  github.com/plexusone/opik-go/integrations/...

All tests passing across 11 packages


Key Takeaways 💡

AI-Assisted Development Insights

  1. High effort setting enables deeper code analysis and better solutions
  2. Parallel tool calls significantly speed up exploration and implementation
  3. Todo tracking helps maintain focus on complex multi-file changes
  4. Iterative validation (run tests after each change) catches issues early
  5. Reference implementations (Python SDK) provide valuable patterns

Result

A production-ready Go SDK built efficiently with AI assistance


Section 4

Conclusion

Deliverables, future work & resources


Project Deliverables 📦

Deliverable Status
Core SDK ✅ Complete
CLI Tool ✅ Complete
Evaluation Framework ✅ Complete
LLM Integrations ✅ Complete
Test Suite ✅ Complete
MkDocs Documentation ✅ Complete
CI/CD Pipeline ✅ Complete
Agentic Tutorial ✅ Complete

Repository: github.com/plexusone/opik-go


Future Enhancements 🔮

Potential Additions

  • More LLM Providers: Gemini, Mistral, Cohere
  • gRPC Support: For high-performance tracing
  • OpenTelemetry Bridge: Export to OTel collectors
  • More Tutorials: RAG applications, chatbots
  • Benchmarks: Performance testing suite

Community

  • Open for contributions
  • Issues and PRs welcome
  • MIT License

Resources 🔗

Links

  • Repository: github.com/plexusone/opik-go
  • Opik: github.com/comet-ml/opik
  • Documentation: agentplexus.github.io/go-opik

Contact

  • GitHub: @agentplexus

Thank You 🙏

go-opik

A Go SDK for LLM Observability

Built with Claude Opus 4.5 + Claude Code