Skip to content

Releases: arthur-ai/arthur-engine

2.1.477

23 Mar 21:27
24cdda8

Choose a tag to compare

🚀 Arthur Engine Release

March 23, 2026

This release delivers significant improvements to observability and experiment management, introducing the Arthur Observability SDK v1.0, advanced trace filtering capabilities, and streamlined experiment creation workflows for enhanced developer productivity.


Arthur Observability SDK Launch

Python SDK for LLM Tracing

  • Launched Arthur Observability SDK v1.0, a comprehensive Python package that automatically instruments 33 AI frameworks including OpenAI, Anthropic, LangChain, LlamaIndex, and CrewAI
  • Added server-side prompt management with centralized prompt versioning and rendering directly through the Arthur GenAI Engine
  • Implemented zero-configuration auto-instrumentation using OpenTelemetry standards with optional dependencies for frameworks you actually use
  • Enabled session and user context propagation across all traces for request-level debugging and analytics

The Arthur Observability SDK eliminates the complexity of LLM application monitoring by providing automatic tracing across your entire AI stack with a single installation, while offering integrated prompt management that treats prompts as versioned code artifacts.


Trace Visibility & Analysis Enhancements

Advanced Filtering and Sorting

  • Added span count and token count filtering to help users find traces matching specific complexity or resource usage criteria
  • Implemented comprehensive sorting capabilities for both traces and spans across multiple attributes
  • Improved trace navigation efficiency for large datasets with range-based filtering options

Display and User Experience Improvements

  • Fixed annotation explanations displaying readable text instead of "[object Object]" in trace interfaces
  • Enhanced latency contrast in trace span details headers with color-coded duration badges for better accessibility
  • Resolved stale chunk loading failures when opening span drawers after deployments
  • Corrected LLM prompt token counts to include cached tokens, fixing wildly inaccurate counts for heavily cached requests

These improvements significantly enhance the ability to navigate and analyze large trace datasets, reducing debugging time and enabling more targeted performance analysis.


Experiment & Notebook Experience

Streamlined Experiment Creation

  • Introduced wizard-based experiment creation modal with step-by-step guidance, form validation, and success feedback
  • Added auto-selection of single available prompt versions to eliminate unnecessary manual selection steps
  • Implemented pagination in version selector for projects with numerous prompt versions

Unified Notebook Interface

  • Standardized consistent interaction patterns across Prompt, RAG, and Agent notebook types
  • Added inline rename functionality and unified table actions (Launch, View last run, Delete) across all notebook types
  • Fixed back navigation issues ensuring users return to expected locations in their workflow

These changes create a more intuitive and efficient experiment setup process while eliminating interface inconsistencies that previously required learning different interaction patterns for each notebook type.


User Experience & Personalization

Timezone and Time Format Preferences

  • Added user settings modal allowing customization of timezone and time format preferences (12-hour or 24-hour)
  • Implemented persistent preference storage with settings maintained across browser sessions
  • Applied consistent timestamp formatting throughout all application interfaces including tasks, experiments, datasets, and traces

Users can now personalize how timestamps appear throughout the application, providing a consistent experience tailored to individual preferences and eliminating timezone conversion friction.


Deployment & Infrastructure Enhancements

Mac and Local Development Support

  • Added ARM64 CPU images for improved Docker deployment on Mac systems
  • Fixed local deployment configuration issues including secret formatting and GPU worker tuning
  • Resolved ECS alarm sensitivity by adjusting CPU and memory monitoring to avoid false alerts during normal hourly job bursts

Claude Code Integration

  • Integrated Claude Code observability instrumentation using hook-based tracing to ship Claude Code sessions as OpenInference traces

These infrastructure improvements enhance developer experience for local development and ensure more reliable cloud deployments with appropriate monitoring thresholds.


Release notes generated by Louisa

2.1.456

11 Mar 20:25
898d5b3

Choose a tag to compare

🚀 Arthur Engine Release

March 12, 2026

This release delivers a comprehensive UI modernization, enhanced evaluation workflows, and improved agent task management alongside critical security updates and performance optimizations.


User Experience & Interface Enhancements

Navigation Consolidation

  • Unified all major product areas into streamlined tabbed interfaces, replacing scattered navigation with intuitive single-entry points
  • Consolidated RAG functionality into unified navigation with Notebooks, Experiments, and Configurations tabs
  • Merged Prompt capabilities into single entry point with Prompts, Notebooks, and Experiments tabs
  • Combined Evaluate features into tabbed view with Evals Management, Continuous Evals, and Results
  • Simplified Test section by merging Agentic Experiments and Agentic Notebooks into unified interface
  • Moved global settings (Model Providers, API Keys) from task sidebar to dedicated settings gear menu

Dark Mode & Theme Improvements

  • Fixed dark mode contrast issues across all UI components and standardized table styling consistency
  • Replaced all Tailwind color classes with MUI theme colors for automatic dark mode support
  • Converted native HTML elements to MUI components for better accessibility and consistent theming

The navigation redesign significantly reduces cognitive load while the theme improvements ensure a polished experience across all viewing modes.


Evaluation & Experiment Enhancements

Continuous Evaluation Workflows

  • Added visual span selector for continuous eval creation, allowing users to select data directly from trace viewer instead of manual typing
  • Introduced inline eval creation from trace viewer with side-by-side span inspection
  • Added evaluate traces modal accessible directly from trace overview with streamlined creation flow
  • Enabled submission of continuous evaluations without requiring description field
  • Added notification system to prompt users for review when evaluation versions are upgraded

Experiment Management

  • Improved experiment loading state derivation from experiment and test case status
  • Added clickable trace ID links in agentic experiment test cases that open trace viewer in new tab
  • Enhanced prompt experiment stability by converting to synchronous execution mode

These improvements streamline the evaluation creation process and provide better visibility into experiment progress.


Agent Task Management & Monitoring

Task Organization & Discovery

  • Added comprehensive filter, sort, and visibility controls to All Tasks page with activity window filters
  • Implemented task archival and unarchival capabilities with proper rule and metrics handling
  • Enhanced task metadata with enriched agent information including tools, sub-agents, models, and infrastructure
  • Added global polling system for agent tasks with proper duplicate execution prevention
  • Improved trace count accuracy in agent discovery with synchronous execution support

Task Interface Improvements

  • Merged task details into overview page as modal dialog for streamlined task management
  • Standardized page headers, action buttons, and empty states across all task views
  • Updated task navigation to show actual task names and improved subtitle copy

Agent task management is now more efficient with better filtering, archival capabilities, and comprehensive metadata visibility.


Data Management & Analysis

Dataset & Transform Operations

  • Enhanced dataset search to query full dataset instead of only current page results
  • Added wildcard transform support in both UI and backend with visual configuration
  • Implemented recursive search for span selector with expanded attribute matching
  • Prevented transform deletion when dependent entities exist with clear user messaging
  • Added confirmation messages after creating dataset transforms from trace viewer

Trace & Span Analysis

  • Made traces table sort arrows functional across traces, spans, sessions, and users tables
  • Fixed trace count deduplication when filters are active for accurate totals
  • Displayed skipped evaluations in gray in trace viewer to distinguish from failures
  • Enhanced span dataset addition with expanded search capabilities across multiple attributes

Data analysis workflows are now more intuitive with functional sorting, accurate counts, and better visual indicators.


Infrastructure & Performance

Security Updates

  • Updated pypdf to address multiple security vulnerabilities including infinite loop and memory exhaustion attacks
  • Updated NLTK to patch critical vulnerability in downloader component
  • Updated Flask to latest security release

Configuration & Optimization

  • Added thread pool configuration with proper defaults and environment variable support
  • Fixed GCP span kind storage issues in span_kind column
  • Improved prompt playground to fetch all paginated prompts instead of only first page
  • Enhanced error handling for Anthropic API compatibility requirements

These infrastructure improvements ensure better security posture and more reliable performance across the platform.


Release notes generated by Louisa

2026 February_B (2.1.386)

20 Feb 16:38
5d31d10

Choose a tag to compare

🚀 Arthur Engine Release

February 18, 2026

This release strengthens evaluation workflows, task visibility, dataset intelligence, and enterprise deployment reliability across environments.


Evaluation & Experiment Enhancements

Improved Evaluation Configuration

  • Added a dedicated Evals input field for clearer configuration
  • Introduced a new filtering mechanism for Continuous Evals
  • Updated Trace Viewer to use a unified filtering system
  • Fixed span selector navigation issues

These updates make evaluations easier to configure, filter, and debug with greater precision.

Notebook & Transform Improvements

  • Added request form parameter configuration within notebooks
  • Introduced a “Fill from Object” button to accelerate transform setup
  • Added dataset transform relevance insights
  • Expanded dataset tracking visibility
  • Fixed issues with new columns disappearing when applying defaults

Improves iteration speed when building transforms and ensures dataset changes behave predictably.


Task Visibility & Workflow Intelligence

Task ID Overview Dashboard

  • Introduced a dedicated Task ID Overview Dashboard
  • Added enhanced KPI visibility in All Tasks tile cards
  • Improved automatic task assignment with Service Name Mapping
  • Fixed navbar and configuration edge cases

You now have clearer operational insight into task performance and ownership.


User Experience Improvements

  • Added system preferences for dark and light mode
  • Fixed tabs visibility when data is available
  • General UI refinements and stability improvements

These updates improve usability and consistency across the platform.


Deployment & Infrastructure Enhancements

GCP & Kubernetes Improvements

  • Improved GCP deployment to load GenAI models via GCS bucket mounting
  • Fixed MODEL_STORAGE_PATH handling across K8s and GCP model uploads
  • Added GCP environment variables and service account support to Helm chart
  • Improved container execution to support non-root environments
  • Resolved GCS prefix and model directory configuration conflicts
  • Fixed offline model upload issues

Arthur Engine is now more reliable across Cloud Run, Kubernetes, Helm-based deployments, and airgapped environments.

Connector & Data Support

  • Added IDA token exchange support for Databricks connector
  • Added support for datasets without time columns

Improves flexibility for enterprise data environments.


Agent Discovery Improvements

  • Added polling mechanism for Agent Discovery

Enhances real-time visibility into deployed agents and improves discovery responsiveness.


Security & Stability Improvements

  • Security update for Axios
  • Authentication library updates
  • Cache and dependency security patches
  • Oracle driver updated to modern oracledb
  • Multiple AWS SDK, React, and platform dependency upgrades
  • Stability fixes across configuration, trace filtering, and merge edge cases

These changes improve security posture, dependency health, and operational stability.


Observability & Documentation

  • Updated OpenTelemetry documentation
  • Improved README guidance
  • Added Claude Code GitHub workflow for CI improvements

What This Means for Customers

  1. Stronger evaluation control
    Improved filtering, clearer inputs, and better trace navigation reduce debugging time.
  2. Better operational visibility
    The new Task ID dashboard and enhanced KPIs provide clearer insight into task performance and ownership.
  3. More reliable enterprise deployment
    GCP, Kubernetes, Helm, and connector improvements strengthen production readiness.

2026 February (2.1.355)

17 Feb 19:50
eff619d

Choose a tag to compare

🚀 Arthur Engine Release

January 26 – February 5, 2026

This release significantly expands experimentation, trace visibility, model provider support, and deployment flexibility across the Agent Development Lifecycle.


Agent Experiments & RAG Evaluation

Agent Experiments

  • Introduced Agent Experiments with UI enhancements
  • Added configurable Session ID support for reproducible experiments
  • Added ability to overwrite dataset rows with experiment results
  • Added bulk column updates for datasets
  • Added JSON validation for experiment inputs
  • Added visual readiness indicators and unsaved changes prompts
  • Fixed prompt experiment stability issues and 5XX errors

You can now iterate on agent behavior with stronger controls, cleaner experiment flows, and more reliable execution.

RAG Experiments & Notebooks

  • Added RAG notebooks and agentic notebooks
  • Improved RAG configuration empty states
  • Added test case outcome visibility for RAG experiments
  • Adjusted RAG panel run conditions with clearer status indicators
  • Fixed RAG experiment bugs and filtering issues

These improvements make it easier to design, test, and debug retrieval-based agents with clearer experiment results and stronger UX.


Trace Visibility & Debugging

  • Introduced a new Trace Viewer experience
  • Added new trace table for improved inspection
  • Added span status badges for quick failure identification
  • Improved span filtering and metadata filtering
  • Highlighted token counts and surfaced cost visibility
  • Fixed onboarding trace display and filtering issues
  • Added analytics instrumentation across tracing workflows

You now have clearer insight into what your agents are doing, how much they cost, and where failures occur.


Expanded Model Provider Support

  • Added Vertex AI provider support
  • Added AWS Bedrock provider support
  • Enabled vLLM as a model provider
  • Fixed Gemini chat completion issues
  • Made Bedrock fields optional for more flexible configuration
  • Improved provider handling and authentication flows

Arthur now supports a broader set of enterprise model backends with more reliable execution and configuration flexibility.


Model Upload & Deployment Enhancements

  • Added GCP model upload workflow with CI/CD support
  • Added OpenShift PVC version of model upload job
  • Improved Docker image tagging for genai-engine-models
  • Added option to skip downloading all models
  • Fixed duplicate model download logs
  • Enabled airgapped deployments for Gliner model loading

These updates improve deployment flexibility across cloud, Kubernetes, and airgapped enterprise environments.


Data, Connectors & Transform Improvements

  • Added CSV loading support for bucket-based connectors
  • Fixed parquet time type filtering in bucket connectors
  • Added Databricks connector support
  • Added transform table pagination
  • Removed 10-item limit in transform list
  • Added ability to map transform variables to evaluation variables
  • Continuous Evals now error on missing variables but skip when spans are missing

Improves reliability and flexibility when preparing data for experiments and evaluations.


Synthetic Data & Dataset Enhancements

  • Added synthetic data generation workflow
  • Added default dataset column values
  • Enabled dataset row overwrites from experiment outputs
  • Improved bulk editing capabilities
  • Fixed export bugs

This makes dataset management more powerful for evaluation, experimentation, and test case iteration.


Agent Discovery & Governance Foundations

  • Introduced early Agent Discovery service
  • Added agent metadata support
  • Added agentic annotation analytics endpoint

These capabilities lay the groundwork for better agent inventory management and governance across environments.


Security & Stability Improvements

  • Security updates to pypdf and python-multipart
  • Stability improvements across experiment execution, segmentation, filtering, and span metadata
  • Code quality improvements including stronger type validation and frontend validation workflows
  • Dependency upgrades across AWS SDK, React, Axios, TanStack, Material UI, and related libraries

These changes improve reliability, security posture, and overall system stability.


UX & Interface Improvements

  • Updated top header in Tasks view for better consistency
  • Improved evaluator detail view layout
  • Updated favicon
  • Improved skipped state visualization
  • Improved RAG experiment UX feedback and indicators

Cleaner workflows reduce friction across experimentation and debugging.


What This Means for You

  1. Experiment with confidence
    Agent Experiments and RAG workflows are more stable, configurable, and transparent.
  2. Understand behavior and cost
    Improved trace inspection, token visibility, and span diagnostics give you deeper insight into agent execution.
  3. Deploy flexibly across environments
    Expanded provider support, GCP uploads, OpenShift jobs, and airgap compatibility improve enterprise readiness.

2026 January_A (2.1.286)

14 Jan 17:40

Choose a tag to compare

Enhancements:

  • Users can now configure where GenAI models are sourced from, enabling models to be pulled from an approved, customer-managed repository instead of the public Hugging Face Hub.
  • Metrics can now be segmented by user ID and conversation ID for more granular analysis.
  • Enhanced ODBC Connector Support: Improved handling of database views, more reliable primary key detection, and configurable connection and login timeouts.
  • Improved GenAI model bootstrapping reliability.

2025 December_A (2.1.237)

05 Dec 19:29
2d02522

Choose a tag to compare

New Features:

  • Test & Preview Custom Metrics Before Saving: Users can now validate their custom metrics directly within the creation and editing workflow. Users can run the metric against available datasets to preview results and confirm the logic behaves as expected before saving.

Bug fixes:

  • Custom metrics:
    • Sketch metrics can now be created and calculated without specifying any dimension columns.
    • Frontend No Longer Overwrites User-Defined Metadata for Reported Metrics.

2025 November_B (2.1.209)

21 Nov 00:53

Choose a tag to compare

Bug Fix/Enhancements:

  • Fixed an issue where some metrics were missing from the selection list for custom datasets.
  • Increase ML engine aggregation timeout to support segmentation of larger & more complex datasets.

2025 November_A (2.1.135)

06 Nov 18:15
7a42eb0

Choose a tag to compare

Enhancements

  • Made enhancements to PII detection model to improve date/time identification.
  • Docker configuration has been updated to use Postgres version 15, ensuring compatibility & preventing initialization errors during new engine setup.

2025 October_B (2.1.94)

15 Oct 13:49
7109a20

Choose a tag to compare

Enhancements:

  • Updated telemetry ORM models, update migrations to enforce non-null timestamps.
  • Improved pagination handling for MSSQL.
  • Added status_code and session_id to spans.

2025 October_A (2.1.93)

07 Oct 19:31
dee7b10

Choose a tag to compare

New Features

  • Custom Metrics: You can now define and manage custom metrics using SQL. Custom metrics can be reused across models and projects, and integrate seamlessly with dashboards, alerts, and queries in the Arthur platform. Versioning ensures you can update metric logic while preserving historical data accuracy. [Learn more]

Enhancements

  • Agent Trace Viewer: Improved filters — users can now filter by metric evaluation results, span type, and more.
  • Snowflake Connector: Added support for selecting Snowflake as a data source in the connector workflow.
  • Added support for creating custom metrics on data with nested columns.
  • GenAI Engine now runs as a non-root user.