23 Mar 21:27

hashnadz

24cdda8

2.1.477 Latest

Latest

🚀 Arthur Engine Release

March 23, 2026

This release delivers significant improvements to observability and experiment management, introducing the Arthur Observability SDK v1.0, advanced trace filtering capabilities, and streamlined experiment creation workflows for enhanced developer productivity.

Arthur Observability SDK Launch

Python SDK for LLM Tracing

Launched Arthur Observability SDK v1.0, a comprehensive Python package that automatically instruments 33 AI frameworks including OpenAI, Anthropic, LangChain, LlamaIndex, and CrewAI
Added server-side prompt management with centralized prompt versioning and rendering directly through the Arthur GenAI Engine
Implemented zero-configuration auto-instrumentation using OpenTelemetry standards with optional dependencies for frameworks you actually use
Enabled session and user context propagation across all traces for request-level debugging and analytics

The Arthur Observability SDK eliminates the complexity of LLM application monitoring by providing automatic tracing across your entire AI stack with a single installation, while offering integrated prompt management that treats prompts as versioned code artifacts.

Trace Visibility & Analysis Enhancements

Advanced Filtering and Sorting

Added span count and token count filtering to help users find traces matching specific complexity or resource usage criteria
Implemented comprehensive sorting capabilities for both traces and spans across multiple attributes
Improved trace navigation efficiency for large datasets with range-based filtering options

Display and User Experience Improvements

Fixed annotation explanations displaying readable text instead of "[object Object]" in trace interfaces
Enhanced latency contrast in trace span details headers with color-coded duration badges for better accessibility
Resolved stale chunk loading failures when opening span drawers after deployments
Corrected LLM prompt token counts to include cached tokens, fixing wildly inaccurate counts for heavily cached requests

These improvements significantly enhance the ability to navigate and analyze large trace datasets, reducing debugging time and enabling more targeted performance analysis.

Experiment & Notebook Experience

Streamlined Experiment Creation

Introduced wizard-based experiment creation modal with step-by-step guidance, form validation, and success feedback
Added auto-selection of single available prompt versions to eliminate unnecessary manual selection steps
Implemented pagination in version selector for projects with numerous prompt versions

Unified Notebook Interface

Standardized consistent interaction patterns across Prompt, RAG, and Agent notebook types
Added inline rename functionality and unified table actions (Launch, View last run, Delete) across all notebook types
Fixed back navigation issues ensuring users return to expected locations in their workflow

These changes create a more intuitive and efficient experiment setup process while eliminating interface inconsistencies that previously required learning different interaction patterns for each notebook type.

User Experience & Personalization

Timezone and Time Format Preferences

Added user settings modal allowing customization of timezone and time format preferences (12-hour or 24-hour)
Implemented persistent preference storage with settings maintained across browser sessions
Applied consistent timestamp formatting throughout all application interfaces including tasks, experiments, datasets, and traces

Users can now personalize how timestamps appear throughout the application, providing a consistent experience tailored to individual preferences and eliminating timezone conversion friction.

Deployment & Infrastructure Enhancements

Mac and Local Development Support

Added ARM64 CPU images for improved Docker deployment on Mac systems
Fixed local deployment configuration issues including secret formatting and GPU worker tuning
Resolved ECS alarm sensitivity by adjusting CPU and memory monitoring to avoid false alerts during normal hourly job bursts

Claude Code Integration

Integrated Claude Code observability instrumentation using hook-based tracing to ship Claude Code sessions as OpenInference traces

These infrastructure improvements enhance developer experience for local development and ensure more reliable cloud deployments with appropriate monitoring thresholds.

Release notes generated by Louisa

Assets 2

11 Mar 20:25

hashnadz

2.1.456

898d5b3

2.1.456

🚀 Arthur Engine Release

March 12, 2026

This release delivers a comprehensive UI modernization, enhanced evaluation workflows, and improved agent task management alongside critical security updates and performance optimizations.

User Experience & Interface Enhancements

Navigation Consolidation

Unified all major product areas into streamlined tabbed interfaces, replacing scattered navigation with intuitive single-entry points
Consolidated RAG functionality into unified navigation with Notebooks, Experiments, and Configurations tabs
Merged Prompt capabilities into single entry point with Prompts, Notebooks, and Experiments tabs
Combined Evaluate features into tabbed view with Evals Management, Continuous Evals, and Results
Simplified Test section by merging Agentic Experiments and Agentic Notebooks into unified interface
Moved global settings (Model Providers, API Keys) from task sidebar to dedicated settings gear menu

Dark Mode & Theme Improvements

Fixed dark mode contrast issues across all UI components and standardized table styling consistency
Replaced all Tailwind color classes with MUI theme colors for automatic dark mode support
Converted native HTML elements to MUI components for better accessibility and consistent theming

The navigation redesign significantly reduces cognitive load while the theme improvements ensure a polished experience across all viewing modes.

Evaluation & Experiment Enhancements

Continuous Evaluation Workflows

Added visual span selector for continuous eval creation, allowing users to select data directly from trace viewer instead of manual typing
Introduced inline eval creation from trace viewer with side-by-side span inspection
Added evaluate traces modal accessible directly from trace overview with streamlined creation flow
Enabled submission of continuous evaluations without requiring description field
Added notification system to prompt users for review when evaluation versions are upgraded

Experiment Management

Improved experiment loading state derivation from experiment and test case status
Added clickable trace ID links in agentic experiment test cases that open trace viewer in new tab
Enhanced prompt experiment stability by converting to synchronous execution mode

These improvements streamline the evaluation creation process and provide better visibility into experiment progress.

Agent Task Management & Monitoring

Task Organization & Discovery

Added comprehensive filter, sort, and visibility controls to All Tasks page with activity window filters
Implemented task archival and unarchival capabilities with proper rule and metrics handling
Enhanced task metadata with enriched agent information including tools, sub-agents, models, and infrastructure
Added global polling system for agent tasks with proper duplicate execution prevention
Improved trace count accuracy in agent discovery with synchronous execution support

Task Interface Improvements

Merged task details into overview page as modal dialog for streamlined task management
Standardized page headers, action buttons, and empty states across all task views
Updated task navigation to show actual task names and improved subtitle copy

Agent task management is now more efficient with better filtering, archival capabilities, and comprehensive metadata visibility.

Data Management & Analysis

Dataset & Transform Operations

Enhanced dataset search to query full dataset instead of only current page results
Added wildcard transform support in both UI and backend with visual configuration
Implemented recursive search for span selector with expanded attribute matching
Prevented transform deletion when dependent entities exist with clear user messaging
Added confirmation messages after creating dataset transforms from trace viewer

Trace & Span Analysis

Made traces table sort arrows functional across traces, spans, sessions, and users tables
Fixed trace count deduplication when filters are active for accurate totals
Displayed skipped evaluations in gray in trace viewer to distinguish from failures
Enhanced span dataset addition with expanded search capabilities across multiple attributes

Data analysis workflows are now more intuitive with functional sorting, accurate counts, and better visual indicators.

Infrastructure & Performance

Security Updates

Updated pypdf to address multiple security vulnerabilities including infinite loop and memory exhaustion attacks
Updated NLTK to patch critical vulnerability in downloader component
Updated Flask to latest security release

Configuration & Optimization

Added thread pool configuration with proper defaults and environment variable support
Fixed GCP span kind storage issues in span_kind column
Improved prompt playground to fetch all paginated prompts instead of only first page
Enhanced error handling for Anthropic API compatibility requirements

These infrastructure improvements ensure better security posture and more reliable performance across the platform.

Release notes generated by Louisa

Assets 2

20 Feb 16:38

hashnadz

2.1.386

5d31d10

2026 February_B (2.1.386)

🚀 Arthur Engine Release

February 18, 2026

This release strengthens evaluation workflows, task visibility, dataset intelligence, and enterprise deployment reliability across environments.

Evaluation & Experiment Enhancements

Improved Evaluation Configuration

Added a dedicated Evals input field for clearer configuration
Introduced a new filtering mechanism for Continuous Evals
Updated Trace Viewer to use a unified filtering system
Fixed span selector navigation issues

These updates make evaluations easier to configure, filter, and debug with greater precision.

Notebook & Transform Improvements

Added request form parameter configuration within notebooks
Introduced a “Fill from Object” button to accelerate transform setup
Added dataset transform relevance insights
Expanded dataset tracking visibility
Fixed issues with new columns disappearing when applying defaults

Improves iteration speed when building transforms and ensures dataset changes behave predictably.

Task Visibility & Workflow Intelligence

Task ID Overview Dashboard

Introduced a dedicated Task ID Overview Dashboard
Added enhanced KPI visibility in All Tasks tile cards
Improved automatic task assignment with Service Name Mapping
Fixed navbar and configuration edge cases

You now have clearer operational insight into task performance and ownership.

User Experience Improvements

Added system preferences for dark and light mode
Fixed tabs visibility when data is available
General UI refinements and stability improvements

These updates improve usability and consistency across the platform.

Deployment & Infrastructure Enhancements

GCP & Kubernetes Improvements

Improved GCP deployment to load GenAI models via GCS bucket mounting
Fixed MODEL_STORAGE_PATH handling across K8s and GCP model uploads
Added GCP environment variables and service account support to Helm chart
Improved container execution to support non-root environments
Resolved GCS prefix and model directory configuration conflicts
Fixed offline model upload issues

Arthur Engine is now more reliable across Cloud Run, Kubernetes, Helm-based deployments, and airgapped environments.

Connector & Data Support

Added IDA token exchange support for Databricks connector
Added support for datasets without time columns

Improves flexibility for enterprise data environments.

Agent Discovery Improvements

Added polling mechanism for Agent Discovery

Enhances real-time visibility into deployed agents and improves discovery responsiveness.

Security & Stability Improvements

Security update for Axios
Authentication library updates
Cache and dependency security patches
Oracle driver updated to modern oracledb
Multiple AWS SDK, React, and platform dependency upgrades
Stability fixes across configuration, trace filtering, and merge edge cases

These changes improve security posture, dependency health, and operational stability.

Observability & Documentation

Updated OpenTelemetry documentation
Improved README guidance
Added Claude Code GitHub workflow for CI improvements

What This Means for Customers

Stronger evaluation control
Improved filtering, clearer inputs, and better trace navigation reduce debugging time.
Better operational visibility
The new Task ID dashboard and enhanced KPIs provide clearer insight into task performance and ownership.
More reliable enterprise deployment
GCP, Kubernetes, Helm, and connector improvements strengthen production readiness.

Assets 2

17 Feb 19:50

hashnadz

2.1.355

eff619d

2026 February (2.1.355)

🚀 Arthur Engine Release

January 26 – February 5, 2026

This release significantly expands experimentation, trace visibility, model provider support, and deployment flexibility across the Agent Development Lifecycle.

Agent Experiments & RAG Evaluation

Agent Experiments

Introduced Agent Experiments with UI enhancements
Added configurable Session ID support for reproducible experiments
Added ability to overwrite dataset rows with experiment results
Added bulk column updates for datasets
Added JSON validation for experiment inputs
Added visual readiness indicators and unsaved changes prompts
Fixed prompt experiment stability issues and 5XX errors

You can now iterate on agent behavior with stronger controls, cleaner experiment flows, and more reliable execution.

RAG Experiments & Notebooks

Added RAG notebooks and agentic notebooks
Improved RAG configuration empty states
Added test case outcome visibility for RAG experiments
Adjusted RAG panel run conditions with clearer status indicators
Fixed RAG experiment bugs and filtering issues

These improvements make it easier to design, test, and debug retrieval-based agents with clearer experiment results and stronger UX.

Trace Visibility & Debugging

Introduced a new Trace Viewer experience
Added new trace table for improved inspection
Added span status badges for quick failure identification
Improved span filtering and metadata filtering
Highlighted token counts and surfaced cost visibility
Fixed onboarding trace display and filtering issues
Added analytics instrumentation across tracing workflows

You now have clearer insight into what your agents are doing, how much they cost, and where failures occur.

Expanded Model Provider Support

Added Vertex AI provider support
Added AWS Bedrock provider support
Enabled vLLM as a model provider
Fixed Gemini chat completion issues
Made Bedrock fields optional for more flexible configuration
Improved provider handling and authentication flows

Arthur now supports a broader set of enterprise model backends with more reliable execution and configuration flexibility.

Model Upload & Deployment Enhancements

Added GCP model upload workflow with CI/CD support
Added OpenShift PVC version of model upload job
Improved Docker image tagging for genai-engine-models
Added option to skip downloading all models
Fixed duplicate model download logs
Enabled airgapped deployments for Gliner model loading

These updates improve deployment flexibility across cloud, Kubernetes, and airgapped enterprise environments.

Data, Connectors & Transform Improvements

Added CSV loading support for bucket-based connectors
Fixed parquet time type filtering in bucket connectors
Added Databricks connector support
Added transform table pagination
Removed 10-item limit in transform list
Added ability to map transform variables to evaluation variables
Continuous Evals now error on missing variables but skip when spans are missing

Improves reliability and flexibility when preparing data for experiments and evaluations.

Synthetic Data & Dataset Enhancements

Added synthetic data generation workflow
Added default dataset column values
Enabled dataset row overwrites from experiment outputs
Improved bulk editing capabilities
Fixed export bugs

This makes dataset management more powerful for evaluation, experimentation, and test case iteration.

Agent Discovery & Governance Foundations

Introduced early Agent Discovery service
Added agent metadata support
Added agentic annotation analytics endpoint

These capabilities lay the groundwork for better agent inventory management and governance across environments.

Security & Stability Improvements

Security updates to pypdf and python-multipart
Stability improvements across experiment execution, segmentation, filtering, and span metadata
Code quality improvements including stronger type validation and frontend validation workflows
Dependency upgrades across AWS SDK, React, Axios, TanStack, Material UI, and related libraries

These changes improve reliability, security posture, and overall system stability.

UX & Interface Improvements

Updated top header in Tasks view for better consistency
Improved evaluator detail view layout
Updated favicon
Improved skipped state visualization
Improved RAG experiment UX feedback and indicators

Cleaner workflows reduce friction across experimentation and debugging.

What This Means for You

Experiment with confidence
Agent Experiments and RAG workflows are more stable, configurable, and transparent.
Understand behavior and cost
Improved trace inspection, token visibility, and span diagnostics give you deeper insight into agent execution.
Deploy flexibly across environments
Expanded provider support, GCP uploads, OpenShift jobs, and airgap compatibility improve enterprise readiness.

Assets 2

14 Jan 17:40

madeleinelane

2.1.286

3f4c461

2026 January_A (2.1.286)

Enhancements:

Users can now configure where GenAI models are sourced from, enabling models to be pulled from an approved, customer-managed repository instead of the public Hugging Face Hub.
Metrics can now be segmented by user ID and conversation ID for more granular analysis.
Enhanced ODBC Connector Support: Improved handling of database views, more reliable primary key detection, and configurable connection and login timeouts.
Improved GenAI model bootstrapping reliability.

Assets 2

05 Dec 19:29

madeleinelane

2.1.237

2d02522

2025 December_A (2.1.237)

New Features:

Test & Preview Custom Metrics Before Saving: Users can now validate their custom metrics directly within the creation and editing workflow. Users can run the metric against available datasets to preview results and confirm the logic behaves as expected before saving.

Bug fixes:

Custom metrics:
- Sketch metrics can now be created and calculated without specifying any dimension columns.
- Frontend No Longer Overwrites User-Defined Metadata for Reported Metrics.

Assets 2

21 Nov 00:53

madeleinelane

2.1.209

6f53ff3

2025 November_B (2.1.209)

Bug Fix/Enhancements:

Fixed an issue where some metrics were missing from the selection list for custom datasets.
Increase ML engine aggregation timeout to support segmentation of larger & more complex datasets.

Assets 2

06 Nov 18:15

madeleinelane

2.1.135

7a42eb0

2025 November_A (2.1.135)

Enhancements

Made enhancements to PII detection model to improve date/time identification.
Docker configuration has been updated to use Postgres version 15, ensuring compatibility & preventing initialization errors during new engine setup.

Assets 2

15 Oct 13:49

madeleinelane

2.1.94

7109a20

2025 October_B (2.1.94)

Enhancements:

Updated telemetry ORM models, update migrations to enforce non-null timestamps.
Improved pagination handling for MSSQL.
Added status_code and session_id to spans.

Assets 2

07 Oct 19:31

madeleinelane

2.1.93

dee7b10

2025 October_A (2.1.93)

New Features

Custom Metrics: You can now define and manage custom metrics using SQL. Custom metrics can be reused across models and projects, and integrate seamlessly with dashboards, alerts, and queries in the Arthur platform. Versioning ensures you can update metric logic while preserving historical data accuracy. [Learn more]

Enhancements

Agent Trace Viewer: Improved filters — users can now filter by metric evaluation results, span type, and more.
Snowflake Connector: Added support for selecting Snowflake as a data source in the connector workflow.
Added support for creating custom metrics on data with nested columns.
GenAI Engine now runs as a non-root user.

Assets 2

Releases: arthur-ai/arthur-engine

2.1.477

🚀 Arthur Engine Release

Arthur Observability SDK Launch

Python SDK for LLM Tracing

Trace Visibility & Analysis Enhancements

Advanced Filtering and Sorting

Display and User Experience Improvements

Experiment & Notebook Experience

Streamlined Experiment Creation

Unified Notebook Interface

User Experience & Personalization

Timezone and Time Format Preferences

Deployment & Infrastructure Enhancements

Mac and Local Development Support

Claude Code Integration

Uh oh!

2.1.456

🚀 Arthur Engine Release

User Experience & Interface Enhancements

Navigation Consolidation

Dark Mode & Theme Improvements

Evaluation & Experiment Enhancements

Continuous Evaluation Workflows

Experiment Management

Agent Task Management & Monitoring

Task Organization & Discovery

Task Interface Improvements

Data Management & Analysis

Dataset & Transform Operations

Trace & Span Analysis

Infrastructure & Performance

Security Updates

Configuration & Optimization

Uh oh!

2026 February_B (2.1.386)

🚀 Arthur Engine Release

Evaluation & Experiment Enhancements

Improved Evaluation Configuration

Notebook & Transform Improvements

Task Visibility & Workflow Intelligence

Task ID Overview Dashboard

User Experience Improvements

Deployment & Infrastructure Enhancements

GCP & Kubernetes Improvements

Connector & Data Support

Agent Discovery Improvements

Security & Stability Improvements

Observability & Documentation

What This Means for Customers

Uh oh!

2026 February (2.1.355)

🚀 Arthur Engine Release

Agent Experiments & RAG Evaluation

Agent Experiments

RAG Experiments & Notebooks

Trace Visibility & Debugging

Expanded Model Provider Support

Model Upload & Deployment Enhancements

Data, Connectors & Transform Improvements

Synthetic Data & Dataset Enhancements

Agent Discovery & Governance Foundations

Security & Stability Improvements

UX & Interface Improvements

What This Means for You

Uh oh!

2026 January_A (2.1.286)

Uh oh!

2025 December_A (2.1.237)

Uh oh!

2025 November_B (2.1.209)

Uh oh!

2025 November_A (2.1.135)

Uh oh!

2025 October_B (2.1.94)

Uh oh!

2025 October_A (2.1.93)

Uh oh!