Releases: arthur-ai/arthur-engine
2.1.477
🚀 Arthur Engine Release
March 23, 2026
This release delivers significant improvements to observability and experiment management, introducing the Arthur Observability SDK v1.0, advanced trace filtering capabilities, and streamlined experiment creation workflows for enhanced developer productivity.
Arthur Observability SDK Launch
Python SDK for LLM Tracing
- Launched Arthur Observability SDK v1.0, a comprehensive Python package that automatically instruments 33 AI frameworks including OpenAI, Anthropic, LangChain, LlamaIndex, and CrewAI
- Added server-side prompt management with centralized prompt versioning and rendering directly through the Arthur GenAI Engine
- Implemented zero-configuration auto-instrumentation using OpenTelemetry standards with optional dependencies for frameworks you actually use
- Enabled session and user context propagation across all traces for request-level debugging and analytics
The Arthur Observability SDK eliminates the complexity of LLM application monitoring by providing automatic tracing across your entire AI stack with a single installation, while offering integrated prompt management that treats prompts as versioned code artifacts.
Trace Visibility & Analysis Enhancements
Advanced Filtering and Sorting
- Added span count and token count filtering to help users find traces matching specific complexity or resource usage criteria
- Implemented comprehensive sorting capabilities for both traces and spans across multiple attributes
- Improved trace navigation efficiency for large datasets with range-based filtering options
Display and User Experience Improvements
- Fixed annotation explanations displaying readable text instead of "[object Object]" in trace interfaces
- Enhanced latency contrast in trace span details headers with color-coded duration badges for better accessibility
- Resolved stale chunk loading failures when opening span drawers after deployments
- Corrected LLM prompt token counts to include cached tokens, fixing wildly inaccurate counts for heavily cached requests
These improvements significantly enhance the ability to navigate and analyze large trace datasets, reducing debugging time and enabling more targeted performance analysis.
Experiment & Notebook Experience
Streamlined Experiment Creation
- Introduced wizard-based experiment creation modal with step-by-step guidance, form validation, and success feedback
- Added auto-selection of single available prompt versions to eliminate unnecessary manual selection steps
- Implemented pagination in version selector for projects with numerous prompt versions
Unified Notebook Interface
- Standardized consistent interaction patterns across Prompt, RAG, and Agent notebook types
- Added inline rename functionality and unified table actions (Launch, View last run, Delete) across all notebook types
- Fixed back navigation issues ensuring users return to expected locations in their workflow
These changes create a more intuitive and efficient experiment setup process while eliminating interface inconsistencies that previously required learning different interaction patterns for each notebook type.
User Experience & Personalization
Timezone and Time Format Preferences
- Added user settings modal allowing customization of timezone and time format preferences (12-hour or 24-hour)
- Implemented persistent preference storage with settings maintained across browser sessions
- Applied consistent timestamp formatting throughout all application interfaces including tasks, experiments, datasets, and traces
Users can now personalize how timestamps appear throughout the application, providing a consistent experience tailored to individual preferences and eliminating timezone conversion friction.
Deployment & Infrastructure Enhancements
Mac and Local Development Support
- Added ARM64 CPU images for improved Docker deployment on Mac systems
- Fixed local deployment configuration issues including secret formatting and GPU worker tuning
- Resolved ECS alarm sensitivity by adjusting CPU and memory monitoring to avoid false alerts during normal hourly job bursts
Claude Code Integration
- Integrated Claude Code observability instrumentation using hook-based tracing to ship Claude Code sessions as OpenInference traces
These infrastructure improvements enhance developer experience for local development and ensure more reliable cloud deployments with appropriate monitoring thresholds.
Release notes generated by Louisa
2.1.456
🚀 Arthur Engine Release
March 12, 2026
This release delivers a comprehensive UI modernization, enhanced evaluation workflows, and improved agent task management alongside critical security updates and performance optimizations.
User Experience & Interface Enhancements
Navigation Consolidation
- Unified all major product areas into streamlined tabbed interfaces, replacing scattered navigation with intuitive single-entry points
- Consolidated RAG functionality into unified navigation with Notebooks, Experiments, and Configurations tabs
- Merged Prompt capabilities into single entry point with Prompts, Notebooks, and Experiments tabs
- Combined Evaluate features into tabbed view with Evals Management, Continuous Evals, and Results
- Simplified Test section by merging Agentic Experiments and Agentic Notebooks into unified interface
- Moved global settings (Model Providers, API Keys) from task sidebar to dedicated settings gear menu
Dark Mode & Theme Improvements
- Fixed dark mode contrast issues across all UI components and standardized table styling consistency
- Replaced all Tailwind color classes with MUI theme colors for automatic dark mode support
- Converted native HTML elements to MUI components for better accessibility and consistent theming
The navigation redesign significantly reduces cognitive load while the theme improvements ensure a polished experience across all viewing modes.
Evaluation & Experiment Enhancements
Continuous Evaluation Workflows
- Added visual span selector for continuous eval creation, allowing users to select data directly from trace viewer instead of manual typing
- Introduced inline eval creation from trace viewer with side-by-side span inspection
- Added evaluate traces modal accessible directly from trace overview with streamlined creation flow
- Enabled submission of continuous evaluations without requiring description field
- Added notification system to prompt users for review when evaluation versions are upgraded
Experiment Management
- Improved experiment loading state derivation from experiment and test case status
- Added clickable trace ID links in agentic experiment test cases that open trace viewer in new tab
- Enhanced prompt experiment stability by converting to synchronous execution mode
These improvements streamline the evaluation creation process and provide better visibility into experiment progress.
Agent Task Management & Monitoring
Task Organization & Discovery
- Added comprehensive filter, sort, and visibility controls to All Tasks page with activity window filters
- Implemented task archival and unarchival capabilities with proper rule and metrics handling
- Enhanced task metadata with enriched agent information including tools, sub-agents, models, and infrastructure
- Added global polling system for agent tasks with proper duplicate execution prevention
- Improved trace count accuracy in agent discovery with synchronous execution support
Task Interface Improvements
- Merged task details into overview page as modal dialog for streamlined task management
- Standardized page headers, action buttons, and empty states across all task views
- Updated task navigation to show actual task names and improved subtitle copy
Agent task management is now more efficient with better filtering, archival capabilities, and comprehensive metadata visibility.
Data Management & Analysis
Dataset & Transform Operations
- Enhanced dataset search to query full dataset instead of only current page results
- Added wildcard transform support in both UI and backend with visual configuration
- Implemented recursive search for span selector with expanded attribute matching
- Prevented transform deletion when dependent entities exist with clear user messaging
- Added confirmation messages after creating dataset transforms from trace viewer
Trace & Span Analysis
- Made traces table sort arrows functional across traces, spans, sessions, and users tables
- Fixed trace count deduplication when filters are active for accurate totals
- Displayed skipped evaluations in gray in trace viewer to distinguish from failures
- Enhanced span dataset addition with expanded search capabilities across multiple attributes
Data analysis workflows are now more intuitive with functional sorting, accurate counts, and better visual indicators.
Infrastructure & Performance
Security Updates
- Updated pypdf to address multiple security vulnerabilities including infinite loop and memory exhaustion attacks
- Updated NLTK to patch critical vulnerability in downloader component
- Updated Flask to latest security release
Configuration & Optimization
- Added thread pool configuration with proper defaults and environment variable support
- Fixed GCP span kind storage issues in span_kind column
- Improved prompt playground to fetch all paginated prompts instead of only first page
- Enhanced error handling for Anthropic API compatibility requirements
These infrastructure improvements ensure better security posture and more reliable performance across the platform.
Release notes generated by Louisa
2026 February_B (2.1.386)
🚀 Arthur Engine Release
February 18, 2026
This release strengthens evaluation workflows, task visibility, dataset intelligence, and enterprise deployment reliability across environments.
Evaluation & Experiment Enhancements
Improved Evaluation Configuration
- Added a dedicated Evals input field for clearer configuration
- Introduced a new filtering mechanism for Continuous Evals
- Updated Trace Viewer to use a unified filtering system
- Fixed span selector navigation issues
These updates make evaluations easier to configure, filter, and debug with greater precision.
Notebook & Transform Improvements
- Added request form parameter configuration within notebooks
- Introduced a “Fill from Object” button to accelerate transform setup
- Added dataset transform relevance insights
- Expanded dataset tracking visibility
- Fixed issues with new columns disappearing when applying defaults
Improves iteration speed when building transforms and ensures dataset changes behave predictably.
Task Visibility & Workflow Intelligence
Task ID Overview Dashboard
- Introduced a dedicated Task ID Overview Dashboard
- Added enhanced KPI visibility in All Tasks tile cards
- Improved automatic task assignment with Service Name Mapping
- Fixed navbar and configuration edge cases
You now have clearer operational insight into task performance and ownership.
User Experience Improvements
- Added system preferences for dark and light mode
- Fixed tabs visibility when data is available
- General UI refinements and stability improvements
These updates improve usability and consistency across the platform.
Deployment & Infrastructure Enhancements
GCP & Kubernetes Improvements
- Improved GCP deployment to load GenAI models via GCS bucket mounting
- Fixed MODEL_STORAGE_PATH handling across K8s and GCP model uploads
- Added GCP environment variables and service account support to Helm chart
- Improved container execution to support non-root environments
- Resolved GCS prefix and model directory configuration conflicts
- Fixed offline model upload issues
Arthur Engine is now more reliable across Cloud Run, Kubernetes, Helm-based deployments, and airgapped environments.
Connector & Data Support
- Added IDA token exchange support for Databricks connector
- Added support for datasets without time columns
Improves flexibility for enterprise data environments.
Agent Discovery Improvements
- Added polling mechanism for Agent Discovery
Enhances real-time visibility into deployed agents and improves discovery responsiveness.
Security & Stability Improvements
- Security update for Axios
- Authentication library updates
- Cache and dependency security patches
- Oracle driver updated to modern
oracledb - Multiple AWS SDK, React, and platform dependency upgrades
- Stability fixes across configuration, trace filtering, and merge edge cases
These changes improve security posture, dependency health, and operational stability.
Observability & Documentation
- Updated OpenTelemetry documentation
- Improved README guidance
- Added Claude Code GitHub workflow for CI improvements
What This Means for Customers
-
Stronger evaluation control
Improved filtering, clearer inputs, and better trace navigation reduce debugging time. -
Better operational visibility
The new Task ID dashboard and enhanced KPIs provide clearer insight into task performance and ownership. -
More reliable enterprise deployment
GCP, Kubernetes, Helm, and connector improvements strengthen production readiness.
2026 February (2.1.355)
🚀 Arthur Engine Release
January 26 – February 5, 2026
This release significantly expands experimentation, trace visibility, model provider support, and deployment flexibility across the Agent Development Lifecycle.
Agent Experiments & RAG Evaluation
Agent Experiments
- Introduced Agent Experiments with UI enhancements
- Added configurable Session ID support for reproducible experiments
- Added ability to overwrite dataset rows with experiment results
- Added bulk column updates for datasets
- Added JSON validation for experiment inputs
- Added visual readiness indicators and unsaved changes prompts
- Fixed prompt experiment stability issues and 5XX errors
You can now iterate on agent behavior with stronger controls, cleaner experiment flows, and more reliable execution.
RAG Experiments & Notebooks
- Added RAG notebooks and agentic notebooks
- Improved RAG configuration empty states
- Added test case outcome visibility for RAG experiments
- Adjusted RAG panel run conditions with clearer status indicators
- Fixed RAG experiment bugs and filtering issues
These improvements make it easier to design, test, and debug retrieval-based agents with clearer experiment results and stronger UX.
Trace Visibility & Debugging
- Introduced a new Trace Viewer experience
- Added new trace table for improved inspection
- Added span status badges for quick failure identification
- Improved span filtering and metadata filtering
- Highlighted token counts and surfaced cost visibility
- Fixed onboarding trace display and filtering issues
- Added analytics instrumentation across tracing workflows
You now have clearer insight into what your agents are doing, how much they cost, and where failures occur.
Expanded Model Provider Support
- Added Vertex AI provider support
- Added AWS Bedrock provider support
- Enabled vLLM as a model provider
- Fixed Gemini chat completion issues
- Made Bedrock fields optional for more flexible configuration
- Improved provider handling and authentication flows
Arthur now supports a broader set of enterprise model backends with more reliable execution and configuration flexibility.
Model Upload & Deployment Enhancements
- Added GCP model upload workflow with CI/CD support
- Added OpenShift PVC version of model upload job
- Improved Docker image tagging for genai-engine-models
- Added option to skip downloading all models
- Fixed duplicate model download logs
- Enabled airgapped deployments for Gliner model loading
These updates improve deployment flexibility across cloud, Kubernetes, and airgapped enterprise environments.
Data, Connectors & Transform Improvements
- Added CSV loading support for bucket-based connectors
- Fixed parquet time type filtering in bucket connectors
- Added Databricks connector support
- Added transform table pagination
- Removed 10-item limit in transform list
- Added ability to map transform variables to evaluation variables
- Continuous Evals now error on missing variables but skip when spans are missing
Improves reliability and flexibility when preparing data for experiments and evaluations.
Synthetic Data & Dataset Enhancements
- Added synthetic data generation workflow
- Added default dataset column values
- Enabled dataset row overwrites from experiment outputs
- Improved bulk editing capabilities
- Fixed export bugs
This makes dataset management more powerful for evaluation, experimentation, and test case iteration.
Agent Discovery & Governance Foundations
- Introduced early Agent Discovery service
- Added agent metadata support
- Added agentic annotation analytics endpoint
These capabilities lay the groundwork for better agent inventory management and governance across environments.
Security & Stability Improvements
- Security updates to
pypdfandpython-multipart - Stability improvements across experiment execution, segmentation, filtering, and span metadata
- Code quality improvements including stronger type validation and frontend validation workflows
- Dependency upgrades across AWS SDK, React, Axios, TanStack, Material UI, and related libraries
These changes improve reliability, security posture, and overall system stability.
UX & Interface Improvements
- Updated top header in Tasks view for better consistency
- Improved evaluator detail view layout
- Updated favicon
- Improved skipped state visualization
- Improved RAG experiment UX feedback and indicators
Cleaner workflows reduce friction across experimentation and debugging.
What This Means for You
-
Experiment with confidence
Agent Experiments and RAG workflows are more stable, configurable, and transparent. -
Understand behavior and cost
Improved trace inspection, token visibility, and span diagnostics give you deeper insight into agent execution. -
Deploy flexibly across environments
Expanded provider support, GCP uploads, OpenShift jobs, and airgap compatibility improve enterprise readiness.
2026 January_A (2.1.286)
Enhancements:
- Users can now configure where GenAI models are sourced from, enabling models to be pulled from an approved, customer-managed repository instead of the public Hugging Face Hub.
- Metrics can now be segmented by user ID and conversation ID for more granular analysis.
- Enhanced ODBC Connector Support: Improved handling of database views, more reliable primary key detection, and configurable connection and login timeouts.
- Improved GenAI model bootstrapping reliability.
2025 December_A (2.1.237)
New Features:
- Test & Preview Custom Metrics Before Saving: Users can now validate their custom metrics directly within the creation and editing workflow. Users can run the metric against available datasets to preview results and confirm the logic behaves as expected before saving.
Bug fixes:
- Custom metrics:
- Sketch metrics can now be created and calculated without specifying any dimension columns.
- Frontend No Longer Overwrites User-Defined Metadata for Reported Metrics.
2025 November_B (2.1.209)
Bug Fix/Enhancements:
- Fixed an issue where some metrics were missing from the selection list for custom datasets.
- Increase ML engine aggregation timeout to support segmentation of larger & more complex datasets.
2025 November_A (2.1.135)
Enhancements
- Made enhancements to PII detection model to improve date/time identification.
- Docker configuration has been updated to use Postgres version 15, ensuring compatibility & preventing initialization errors during new engine setup.
2025 October_B (2.1.94)
Enhancements:
- Updated telemetry ORM models, update migrations to enforce non-null timestamps.
- Improved pagination handling for MSSQL.
- Added
status_codeandsession_idto spans.
2025 October_A (2.1.93)
New Features
- Custom Metrics: You can now define and manage custom metrics using SQL. Custom metrics can be reused across models and projects, and integrate seamlessly with dashboards, alerts, and queries in the Arthur platform. Versioning ensures you can update metric logic while preserving historical data accuracy. [Learn more]
Enhancements
- Agent Trace Viewer: Improved filters — users can now filter by metric evaluation results, span type, and more.
- Snowflake Connector: Added support for selecting Snowflake as a data source in the connector workflow.
- Added support for creating custom metrics on data with nested columns.
- GenAI Engine now runs as a non-root user.