Releases: lablup/backend.ai
25.14.0rc1
Breaking Changes
- Implement password hashing system with multiple algorithms (#5753)
Features
- Add data migration script from VFolder to RBAC tables (#5340)
- Migrate existing user/project records to RBAC data (#5417)
- Expand RBAC tables by adding
permission_groups
table to group permissions with the same target (#5465) - Add
reservoir_registries
DB table, service, and CRUD GQL mutations (#5644) - Add
artifact_registries
DB table to store common information of the various artifact registries (#5656) - Implement Reservoir registry, and Sync APIs between Managers (#5660)
- Add config to set GPFS fileset name prefix (#5684)
- Add RBAC repositoy functions to manage scopes and entity DB records (#5699)
- Allow
__typename
type query for advanced GraphQL features(GQL Federation,@connection
directive) by introducing custom introspection rule (#5705) - Align the reported agent memory size so that the reserved memory absorbs tiny memory size deviation originating from various firmware/kernel settings and make the memory size of same-hardware agents consistent, preventing inadvertent resource allocation failures in large clusters (#5729)
- Set expiry to set records of Bgtask metadata ids (#5736)
- Add
defaultArtifactRegistry
GQL resolver to fetch the default artifact registry information (#5739) - Add
Artifact
,ArtifactRegistry
REST API (#5747) - Add session mode(Client, Proxy) based error handling in FetchContextManager (#5774)
- Add VFolder delete Background Task (#5778)
- Apply cache layer for resource presets (#5781)
- Add PBKDF2-SHA3-256 password hashing implementation and update supported algorithms (#5785)
- Remove obsolete max slot limit validation during session creation (#5807)
- Implement health check functionality for route management (#5811)
Improvements
- Migrate legacy redis clients to valkey clients in App Proxy (#5741)
Fixes
- Add missing
restart: unless-stopped
policy to all services in docker compose file (#5694) - Add missing session live stat update (#5762)
- Make overlay network creation idempotent so that retries due to errors after this step does not make infinite retry loop (#5765)
- Clean up dangling docker networks (#5770)
- Add missing update when app proxy registered endpoint (#5783)
- Restrict
limit
toscan_artifacts
API (#5801) - Update client SDK to reflect UUID-only restriction in session dependencies (#5809)
Miscellaneous
- Refactor scheduler handlers: Split into individual files and create a base handler class (#5766)
Full Changelog
Check out the full changelog until this release (25.14.0rc1).
Full Commit Logs
Check out the full commit logs between release (25.13.4) and (25.14.0rc1).
25.13.4
Fixes
- Add missing scheduler options to AllowedScalingGroup and update related components (#5730)
Full Changelog
Check out the full changelog until this release (25.13.4).
Full Commit Logs
Check out the full commit logs between release (25.13.3) and (25.13.4).
25.13.3
Fixes
- Improve HTTP request proxying in the webserver to be transparent with content-encoding (#5709)
- Add null-user check in resource usage query (#5712)
- Ensure id parameter of chown function is an int (#5713)
- Refresh agent fields in kernel when rescheduling (#5717)
- Fix issue where App-Proxy failed to query worker circuits due to incorrect variable reference (#5718)
- Add missing network cleanup when creating overlay network (#5721)
Full Changelog
Check out the full changelog until this release (25.13.3).
Full Commit Logs
Check out the full commit logs between release (25.13.2) and (25.13.3).
25.13.2
Features
- The mouse-selected or copy-mode selected texts in the intrinsic ttyd app with tmux are now directly copied to the user-side clipboard, without needing to
set mouse=off
in the tmux session (#5688) - feat: Improvement redis keys command to scan_iter for manager cli (#5704)
Fixes
- Add missing all-smi manpage file in the wheel packages (#5685)
- Updated RedisProfileTarget to handle cases where 'addr' is missing or None in the input data, preventing errors during address parsing. (#5695)
- fixes a duplicate joins issue during serialization when using pydantic by removing the join filter from the TOMLStringListField's _transform method. (#5700)
- Fix coordinator not performing health check for all endpoints (#5702)
- Fix session creation failing with
not allowed scaling group
error (#5706) - Enhance endpoint creation logic to update existing records and handle circuits (#5707)
Full Changelog
Check out the full changelog until this release (25.13.2).
Full Commit Logs
Check out the full commit logs between release (25.13.1) and (25.13.2).
25.13.1
Fixes
- Fix session ordering in session_pending_queue query resolver (#5682)
- fix: Ensure redis address is nullable (#5683)
Full Changelog
Check out the full changelog until this release (25.13.1).
Full Commit Logs
Check out the full commit logs between release (25.13.0) and (25.13.1).
25.13.0
Features
- Introduce
strawberry
, and strawberry-basedArtifactRegistry
GQL types (#5232) - Add
ModelDeployment
,ModelRevision
strawberry GQL types migrated from existing federated graphene schema (#5249) - Open-source and integrate Backend.AI App Proxy into the main codebase (#5275)
- Add
storages
API to storage proxy (#5286) - Add OpenTelemetry and service discovery configuration to appproxy (#5296)
- Implement connection monitoring and reconnection logic in ValkeyStandaloneClient (#5298)
- Implement Sokovan orchestrator architecture (#5361)
- Add
HuggingFace
scanner, and API to storage proxy (#5362) - Split out container log processing to a more concrete
ValkeyContainerLogClient
(based onValkeyClient
with default behavior) and use a separate Redis instance dedicated for log streaming (#5375) - Implement scheduling prioritizers (#5378)
- Add validators for scheduling (#5380)
- Ship
all-smi
so that users can execute it inside any session container (#5381) - Implement sokovan scheduler agent selectors (#5383)
- Integrate Agent selector with allocator in sokovan orchestrator (#5393)
- Add
UserNode
as a field ofComputeSessionNode
(#5403) - Enhance Scheduler allocation logic and add comprehensive tests (#5404)
- Add allocation methods in scheduler repository (#5406)
- Add TTL support to Redis key operations in AppProxy (#5416)
- Unify separate GraphQL subgraph endpoints into single Apollo Router supergraph with web-server proxy integration to enable single endpoint access for clients (#5419)
- Integrate sokovan orchestrator in manager (#5421)
- Add
source
field to roles table to distinguish system-defined roles from custom-defined roles, enabling automatic permission grants for system roles when new entity types or operations are introduced (#5440) - Add phase tracking in scheduling (#5441)
- Implement scheduler coordinator in sokovan orchestrator (#5455)
- Changed the behavior to terminate "terminating session" in batch processing (#5467)
- Implement session sweeping functionality and related handlers (#5485)
- Inject
storages
config to storage-proxy (#5491) - Add
object_storages
table to DB (#5498) - Add request_timeout configuration for Redis clients (#5502)
- Add decrement_keypair_concurrencies method and update session termination logic (#5504)
- Add
hugging_registries
DB table, and GQL schema (#5508) - Replace the existing
ArtifactGroup
model withArtifact
, and replaceArtifact
withArtifactRevision
(#5510) - Integrate
Artifact
service to Manager (#5514) - Add Valkey client for Background Task Manager (#5519)
- Improve
logging.BraceStyleAdapter
to support user-defined kwargs and access toextra
data including contextual fields. (#5523) - Add Background Task heartbeat loop to refresh TTL (#5531)
- Modify value reading to avoid cache-based scheduling (#5533)
- Implement scheduling controller (#5547)
- Implement kernel state engine (#5551)
- Add Background Task retry loop (#5555)
- Allow specifying multiple endpoint addresses in the etcd config (#5564)
- Update session limits to allow None and 0 as indicators for unlimited concurrent sessions (#5567)
- Add configuration option for Sokovan orchestrator usage (#5568)
- Implement health monitoring for scheduling operations (#5569)
- Enhance session management by adding checks for truly stuck pulling and creating sessions (#5570)
- Add Valkey Client TLS configuration (#5573)
- Implement Generalized pagination on Strawberry GQL API (#5575)
- Implement session transition hooks for various session types (#5579)
- Implement deployment management with Sokovan integration (#5580)
- Implement batch scheduling events and event propagation through Event Hub (#5589)
- Apply centralized distributed locking for Sokovan scheduling operations (#5592)
- Implement cache-through pattern for keypair concurrency management in SchedulerRepository (#5594)
- Apply READ COMMITTED isolation level for scheduler operations (#5600)
- Add Volume Pool field to
RootContext
of Storage-Proxy (#5603) - Add Bgtask handler Registry (#5606)
- Implement Valkey-based leader election in manager (#5607)
- Apply retry feature to VFolder clone bgtask (#5611)
- Add
object_storage_meta
DB table for managing buckets (#5617) - Add operation metrics observer for session termination tracking (#5623)
- Implement EventPropagatorMetricObserver for tracking event propagator metrics (#5630)
- Apply cache propagator when broadcasting scheduling event (#5638)
- Implement deployment controller and integrate with sokovan orchestrator (#5639)
- Added automated GraphQL supergraph generation using rover CLI to CI pipeline for improved schema management (#5645)
- Add
--wait
option tobackend.ai events
command for easier scripting and automation (#5650) - Implement session wait logic in AgentRegistry for improved scheduling handling (#5659)
- Manage object storage buckets using
storage_namespace
(#5667) - Add scheduling detail info for pending sessions (#5676)
Fixes
- Correct the asyncio connection sharing pattern in alembic
env.py
so that we could usealembic-rebase.py
script and other alembic-based automation seamlessly. (#5151) - Use persistent
aiohttp.ClientSession
instances per route in App Proxy circuits to benefit from keep-alive connections and resource reuse (#5287) - Add missing resolver of VFolder permissions field in Compute session node (#5322)
- Let insepct.signature handle stringified types generated by
__future__
annotations by setting theeval_str
option to True (#5325) - Handle None user when request context setup in auth middleware (#5327)
- Add missing database transaction retry logic when setting network ID of new sessions (#5329)
- Apply memoization to the scheduler plugin loaders to reduce runtime overheads when running the scheduler loop (#5342)
- Broken Agent, Webserver in HA development environment (#5343)
- Add missing components in HA development environment (#5345)
- Make
--log-level
and--debug
flag behavior and description consistent across all `start-...
25.11.3
No significant changes.
Full Changelog
Check out the full changelog until this release (25.11.3).
Full Commit Logs
Check out the full commit logs between release (25.11.2) and (25.11.3).
24.09.12
Features
- Add expiration time to login history Redis keys to reduce Redis memory usage. (#4939)
- Built-in WSProxy exposes advertised address (#4975)
Fixes
- Status code is missing when the
Accept
header is not set toapplication/json
in the wsproxy exception middleware (#4788) - Fix Agent Memory plugin to handle multiple IO device stat (#4789)
- Fix invalid state error when setting kernel termination future (#4791)
- Fix wrong
Accept
Header onHarborRegistryV2._process_oci_index()
(#4807) - Prevent model service creation with project type vfolder (#4852)
- Handle
NoSuchProcess
properly when gather process memory stat (#4922) - Skip kernel destroy when agent shutdown (#4923)
- Check if Agent is daemon process before query docker netstat (#4929)
- Wrong indent in Agent container stat function (#4946)
- Calculate correct VFolder permissions when admins query (#4963)
- Fix issue preventing admins from leaving invited vfolders (#4964)
Full Changelog
Check out the full changelog until this release (24.09.12).
Full Commit Logs
Check out the full commit logs between release (24.09.11) and (24.09.12).
25.13.0rc1
Features
- Introduce
strawberry
, and strawberry-basedArtifactRegistry
GQL types (#5232) - Add
ModelDeployment
,ModelRevision
strawberry GQL types migrated from existing federated graphene schema (#5249) - Open-source and integrate Backend.AI App Proxy into the main codebase (#5275)
- Add
storages
API to storage proxy (#5286) - Add OpenTelemetry and service discovery configuration to appproxy (#5296)
- Implement connection monitoring and reconnection logic in ValkeyStandaloneClient (#5298)
- Implement Sokovan orchestrator architecture (#5361)
- Implement scheduling prioritizers (#5378)
- Add validators for scheduling (#5380)
- Ship
all-smi
so that users can execute it inside any session container (#5381) - Implement sokovan scheduler agent selectors (#5383)
- Integrate Agent selector with allocator in sokovan orchestrator (#5393)
- Add
UserNode
as a field ofComputeSessionNode
(#5403) - Enhance Scheduler allocation logic and add comprehensive tests (#5404)
- Add allocation methods in scheduler repository (#5406)
- Add TTL support to Redis key operations in AppProxy (#5416)
- Integrate sokovan orchestrator in manager (#5421)
Fixes
- Correct the asyncio connection sharing pattern in alembic
env.py
so that we could usealembic-rebase.py
script and other alembic-based automation seamlessly. (#5151) - Add missing resolver of VFolder permissions field in Compute session node (#5322)
- Let insepct.signature handle stringified types generated by
__future__
annotations by setting theeval_str
option to True (#5325) - Handle None user when request context setup in auth middleware (#5327)
- Add missing database transaction retry logic when setting network ID of new sessions (#5329)
- Apply memoization to the scheduler plugin loaders to reduce runtime overheads when running the scheduler loop (#5342)
- Broken Agent, Webserver in HA development environment (#5343)
- Add missing components in HA development environment (#5345)
- Make
--log-level
and--debug
flag behavior and description consistent across allstart-server
commands (#5366) - Defer imports in the CLI and server entrypoints to reduce CLI startup times and avoid unnecessary cross-component imports (#5372)
- Fix and improve optimization to glob-based BUILD file scanning when loading CLI entrypoints, improving the CLI command initialization latency for about 15% (e.g., 3.5 sec -> 3.0 sec) (#5377)
- Fix missing
event_logs
table creation when populating the database schema withmgr schema oneshot
, which may have caused issues in fresh installations (#5391) - Add Docker image rescan exception handling logic when the image config is
None
(#5394)
Miscellaneous
- Refactor the import structure for
RepositoryArgs
by moving it to a dedicatedai.backend.manager.repositories.types
module (#5409)
Full Changelog
Check out the full changelog until this release (25.13.0rc1).
Full Commit Logs
Check out the full commit logs between release (25.12.1) and (25.13.0rc1).
25.12.1
Features
- Agent heartbeat handler queries Kernel ids instead of Agent id (#4766)
- Implement ActionValidator (#5244)
- Implement reconnection logic in ValkeySentinelClient (#5276)
Improvements
- Apply simple model query pattern for readability (#4767)
Fixes
- Fix model service creation failure when
service-definition.toml
is missing (#5264) - Fix model service deletion failure for non super-admin users (#5266)
- Broken VFolder
Clone
service (#5269) - Fixed a problem with deserializing dataclass (#5271)
- Fix broken VFolder
GetTaskLogs
service (#5272) - Add missing TRACE log-level option in ai.backend.logging package (#5274)
status_data
not initialized properly when creating multi node session (#5280)- Apply a workaround to avoid segfault upon fast termination of
mgr etcd
CLI commands that queries and updates etcd configurations (#5283)
Full Changelog
Check out the full changelog until this release (25.12.1).
Full Commit Logs
Check out the full commit logs between release (25.12.0) and (25.12.1).