Releases: ai-dynamo/grove
Releases · ai-dynamo/grove
v0.1.0-alpha.6
What's Changed
- fix startup ordering checks to match implementation by @gflarity in #323
- Auto-MNNVL: add MNNVL configuration and startup validation by @shmuel-runai in #346
- only warn about MTU/PMTU issues by @gflarity in #360
- add tests for updateObservedGeneration functions by @gflarity in #354
- add command for creating k3d debug cluster identical to one used in E2E by @gflarity in #361
- retry cluster creation when it fails by @gflarity in #327
- fix typo in SO1 and SO2 tests by @gflarity in #365
- Introduce GREP template and a refactored TAS GREP by @unmarshall in #362
- Danbar/e2e rolling update 10 by @danbar2 in #356
- test: add TAS simple level and constraint tests by @Ronkahn21 in #349
- Auto-MNNVL: Update PCS's webhook support auto-mnnvl by @shmuel-runai in #370
- Auto-MNNVL: add ComputeDomain component for PCS controller by @shmuel-runai in #363
- doc: Add pod-naming and environment-variables docs by @nvrohanv in #355
- Fix tas label by @kangclzjc in #380
- run e2e only if code changed and build pass by @danbar2 in #381
- fix: use patch for topology label repplay on replaced nodes by @Ronkahn21 in #384
- Auto-MNNVL: add PodSpec injection for MNNVL resourceClaims by @shmuel-runai in #385
- Gather all operator logs on e2e test failure by @gflarity in #358
- Enable ai-dynamo copy-pr-bot for Grove by @sanjaychatterjee in #382
- Auto-MNNVL: use correct ComputeDomain CRD field paths by @shmuel-runai in #391
- Support external certificate management for webhooks by @gflarity in #344
- Auto-MNNVL: validate annotation values and sync design doc by @shmuel-runai in #386
Full Changelog: v0.1.0-alpha.5...v0.1.0-alpha.6
v0.1.0-alpha.5
What's Changed
- skip patch ObservedGeneration if no change by @xulinfei1996 in #337
- handle clean up failures better by @gflarity in #326
- fix: add PCS topology constraints to scaled PodGangs by @Ronkahn21 in #347
- fix: correct PCSG topology constraint handling for scaled PodGangs by @Ronkahn21 in #357
- test: add TAS e2e test infrastructure and basic tests by @Ronkahn21 in #348
New Contributors
- @xulinfei1996 made their first contribution in #337
Full Changelog: v0.1.0-alpha.4...v0.1.0-alpha.5
v0.1.0-alpha.4
What's Changed
- document internal/utils by @gflarity in #211
- document internal/logger and internal/utils by @gflarity in #208
- document internal/controller by @gflarity in #205
- Changes for migration from @NVIDIA to @ai-dynamo by @renormalize in #225
- Introduce badges in
README.md. by @renormalize in #227 - Remove Ask DeepWiki badge from README and add Go report badge by @unmarshall in #234
- bump CRD_REF_DOCS_VERSION by @gflarity in #232
- E2E Test Foundations by @gflarity in #207
- Grove proposal/topology by @Ronkahn21 in #224
- api: add Topology aware support by @Ronkahn21 in #235
- Bump github.com/docker/docker from 28.2.2+incompatible to 28.3.3+incompatible in /operator by @dependabot[bot] in #237
- added missed missed PR feedback by @gflarity in #238
- check for existing cluster and delete if it already exists by @gflarity in #239
- test coverage for internal/logger by @gflarity in #229
- Add Core Concepts Tutorial by @nvrohanv in #217
- Update Grove discord link with a permanent link. by @renormalize in #249
- Bump github.com/containerd/containerd from 1.7.28 to 1.7.29 in /operator by @dependabot[bot] in #253
- Fixed documentation links and formatting in README and installation by @sanjaychatterjee in #250
- test coverage for internal/webhooks by @gflarity in #230
- test coverage for internal/utils by @gflarity in #231
- Disallow reducing
PodCliqueSetTemplateSpec.PodCliqueScalingGroupConfig.Replicasto0. by @renormalize in #256 - add support for prepulling images to speed up tests on slow networks by @gflarity in #241
- Fix indentation in
docs/designs/topology.md. by @renormalize in #257 - Feat/Topology Configuration Infrastructure by @Ronkahn21 in #247
- Add validation webhook for ClusterTopology resource by @shmuel-runai in #251
- Dependency version upgrades and fixes by @unmarshall in #263
- fix(charts): correct webhook configuration scope and metadata by @shmuel-runai in #258
- docs: update topology configuration naming in design doc by @Ronkahn21 in #266
- Bump golang.org/x/crypto from 0.44.0 to 0.45.0 in /operator by @dependabot[bot] in #268
- e2e tests gang scheduling by @gflarity in #242
- prepend a g to the github has in package version to avoid semver issues by @gflarity in #272
- Update Go image in
Dockerfileto1.25.3, and other tools. by @renormalize in #254 - E2E tests for startup ordering by @gflarity in #269
- ci: add e2e tests to GitHub Actions by @shmuel-runai in #282
- E2E: Fix flaky Helm installation failures due to "cannot re-use a nam… by @shmuel-runai in #284
- improve test coverage for internal/controller by @gflarity in #252
- Stablize E2E Tests by @gflarity in #287
- New TAS Design by @Ronkahn21 in #288
- feat: chart add ns by @ls-2018 in #290
- Cleaned and updated indirect go mod deps by @unmarshall in #295
- feat: remove useless branch condition by @ls-2018 in #289
- Feat/create cluster topology and KAI topology by @Ronkahn21 in #298
- Api/add name to topology constraint group by @Ronkahn21 in #299
- add mnnvl requirements GREP file by @danbar2 in #296
- Remove
YEARin generated files, adhering to the community convention. by @renormalize in #307 - Added more code owners for Grove by @sanjaychatterjee in #306
- E2e tests rolling updates by @gflarity in #280
- Added new code owner for Grove by @sanjaychatterjee in #308
- E2E stability fixes by @gflarity in #312
- Reconcile PodClique TopologyConstraints by @unmarshall in #302
- Introduce validations for TopologyConstraints in PodCliqueSet by @unmarshall in #317
- MNNVL support design doc by @shmuel-runai in #297
- cancel stale E2E runs by @gflarity in #322
- Fix get selector labels for pod by @gflarity in #318
- E2E Failure Diagnostics by @gflarity in #314
- Fixes for topology aware scheduling validation webhook by @unmarshall in #324
- Fixes TopologyConstraints for scaled PodGangs by @unmarshall in #340
Full Changelog: https://github.com/ai-dynamo/grove/commits/v0.1.0-alpha.4
v0.1.0-alpha.3
What's Changed
- Gflarity/allow kai by @gflarity in #219
- Allow system:kube-controller-manager to update init container secret by @unmarshall in #221
- fixed the owner name to be lower case nvidia by @unmarshall in #222
- Disable Authorizer webhook by default. by @renormalize in #223
Full Changelog: v0.1.0-alpha.2...v0.1.0-alpha.3
v0.1.0-alpha.2
What's Changed
- Add a attribution file for all the licenses used in Grove by @sanjaychatterjee in #189
- Remove LastOperation from CRDs and restructure component operators by @unmarshall in #192
- Remove scheduler development doc by @sanjaychatterjee in #197
- Remove validation which prevents setting NodeSelector on PodSpec by @unmarshall in #203
- Increase Default Value of TerminationDelay by @nvrohanv in #199
- Add unit-tests for initc and improve in-line doc strings by @gflarity in #204
- remove redundancy in initial grove readme paragraphs by @nvrohanv in #213
- Remove deadlock when deploying PCS with ComputeDomain by @unmarshall in #215
- document in internal/webooks by @gflarity in #210
- Introduce the Authroizer Webhook. by @renormalize in #214
- Rename leftover
*podgangset*\.goto*podcliqueset*\.gofrom #186. by @renormalize in #216
New Contributors
Full Changelog: v0.1.0-alpha.1...v0.1.0-alpha.2
v0.1.0-alpha.1
What's Changed
- Skeleton code and scripts for grove operator by @unmarshall in #5
- Adding the operator config api which got overwritten by @unmarshall in #6
- update license files and headers by @dmitsh in #7
- adapted license header to include The Grove Authors by @unmarshall in #8
- Added Dockerfile, Skaffold, Helm Charts and other misc changes by @unmarshall in #13
- allow setting object meta in PodClique; fix typos in types.go by @dmitsh in #14
- Removed PodGang CRD by @unmarshall in #15
- implement podgangset validating webhook by @dmitsh in #9
- Bump github.com/opencontainers/runc from 1.1.13 to 1.1.14 in /scheduler-plugins by @dependabot[bot] in #3
- Bump golang.org/x/crypto from 0.24.0 to 0.31.0 in /scheduler-plugins by @dependabot[bot] in #16
- Small fixes to hack scripts directory by @unmarshall in #20
- API changes and changes to validating webhook by @unmarshall in #23
- simplify PodCliqueSpec by @dmitsh in #24
- Introduced scheduler-api, modified PodGangSet API, re-generated code by @unmarshall in #25
- Bump golang.org/x/net from 0.26.0 to 0.33.0 in /scheduler-plugins by @dependabot[bot] in #26
- update podgangset crd by @dmitsh in #28
- Add Mutating Webhooks for PodGangSet by @ritikasrivastava in #17
- Configuration and deployment of webhooks by @unmarshall in #29
- Sample NIM LLM deployment specs using LWS and Grove by @sanjaychatterjee in #30
- update API by @dmitsh in #31
- implement validation for update operation by @dmitsh in #27
- Adds skeleton reconciler code and minor modifications to API by @dmitsh in #21
- Fixes for defaulting and validating webhooks by @unmarshall in #32
- fixed typos by @dmitsh in #33
- Add Default webhook unit test by @ritikasrivastava in #35
- Introduces miscellaneous changes by @unmarshall in #38
- Fixed helm charts and added default for podclique reconciler by @unmarshall in #39
- Fixes API, controller-runtime manager scheme and charts by @unmarshall in #42
- Bump k8s.io/kubernetes from 1.31.1 to 1.31.6 in /scheduler-plugins by @dependabot[bot] in #34
- implement basic reconciliation loop by @dmitsh in #40
- Bump golang.org/x/net from 0.33.0 to 0.36.0 in /scheduler-plugins by @dependabot[bot] in #46
- Refactor PodGangSet reconciler by @unmarshall in #51
- fixed roles for events and fixed test by @unmarshall in #52
- implement pclq status update by @dmitsh in #48
- move pclq status update to reconciler by @dmitsh in #53
- Added validation for pclq metadata by @unmarshall in #54
- Added changes to the PodGang API spec by @unmarshall in #55
- Update scheduler API by @unmarshall in #56
- Added TerminationDelay to PodGangTemplateSpec by @unmarshall in #58
- Ritika/headlessservice by @ritikasrivastava in #50
- Bump golang.org/x/net from 0.34.0 to 0.36.0 in /operator by @dependabot[bot] in #47
- Renamed PodClique to PodGroup in scheduler-api by @sanjaychatterjee in #61
- Refactoring operator by @unmarshall in #62
- Added ServiceAccount, Role and RoleBinding components by @unmarshall in #63
- Bump golang.org/x/net from 0.34.0 to 0.36.0 in /scheduler-api by @dependabot[bot] in #59
- Add stub functions for Grove scheduler plugin by @sanjaychatterjee in #60
- Fix broken test, upgrade to
golangci-lint@v2.1.1and fix numerous lint errors, upgrade tool versions, removehack/tools.go, etc. by @renormalize in #64 - Replaced types.NamespacedName with own NamespacedName by @unmarshall in #65
- Corrections to API by @unmarshall in #67
- Bump golang.org/x/net from 0.35.0 to 0.38.0 in /operator by @dependabot[bot] in #66
- Introduced scheduling policy configuration in PodGangTemplateSpec by @unmarshall in #69
- Introduces scheduler policy config and other changes in operator and scheduler API by @unmarshall in #71
- Fix broken build targets,
ld-flagsduring docker builds, work-tree state during docker builds, etc. by @renormalize in #70 - Added test utilities and other misc changes by @unmarshall in #74
- Reorg scheduler-plugins dir by @sanjaychatterjee in #75
- Upgraded k8s dependencies by @unmarshall in #76
- Updated operator and scheduler-api by @unmarshall in #77
- Introduced PodCliqueScalingGroup and refactored API modules by @unmarshall in #78
- PodCliqueScalingGroupConfig enhancement and validations by @unmarshall in #79
- Introduce make targets and charts for the development, and deployment of
grove-kube-scheduler. by @renormalize in #80 - Introduce the
reset-schedulertarget which resets the kube-scheduler running in kind with the default. by @renormalize in #82 - introducing generated scheduler client by @unmarshall in #81
- Corrections in PGS components by @unmarshall in #83
- MinReplicas moves out of AutoScalingConfig to PodCliqueSpec by @unmarshall in #84
- Enhancements to API and reconcilers by @unmarshall in #86
- Allow usage of hyphen in the pgs and pclq names by @unmarshall in #87
- Specify defaults using annotations, use validating functions exposed by
apimachinery, etc. by @renormalize in #89 - Misc fixes and partial implementation of Pod component by @unmarshall in #90
- Add validation to ensure all
PodSpecs specify the sameschedulerName. by @renormalize in #91 - Implement the init container. by @renormalize in #88
- Fix scale-in issues when
PodGangSetreplicas are changed. by @renormalize in #94 - Update Readme by @nvrohanv in #93
- Enhance PGS and PCLQ reconcilers to support PodGang lifecycle management by @unmarshall in #95
- Docs and API updates by @unmarshall in #97
- Updated diagrams and docs by @unmarshall in #99
- Introduce
docs/getting-started.md. by @renormalize in #100 - Refactor PodGangSet and PodGang APIs by @unmarshall in #101
- Add test and coverage targets to Makefile by @Ronkahn21 in #102
- Modify PodCliqueScalingGroup behavior to create new PodCliques for each replica by @unmarshall in #103
- Fix HPA selector labels for
PodCliques created byPodCliqueScalingGroups. by @renormalize in #104 - Enable GitHub Actions by @renormalize in #106
- API changes for Gang termination by @unmarshall in #107
- Pod name validation pgs by @Ronkahn21 in #105
- fix: fix example by @julienmancuso in #108
- Correct example in
docs/getting-started.md. by @renormalize in #110 - Bump k8s.io/kubernetes from 1.33.1 to 1.33.2 in /scheduler by @dependabot[bot] in #92
- Add validation target to Makefile and update build-and-test.yaml by @Ronkahn21 in #111
- Gang Termination by @unmarshall in #114
- Pod discovery env vars by @Ronkahn21 in #119
- update release schedule to reflect dynamo alignment by @nvrohanv in #120
- feat: Add replicas and minAvailable fields for PodCliquesScalingGroups by @julienmancuso in #116
- Integrate `grove-initc...