Skip to content
Merged
Show file tree
Hide file tree
Changes from 59 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
4ad793f
feat(regionserver): add graceful shutdown configuration
razvan Oct 2, 2024
cb232df
Make UnifiedRoleConfiguration a sub-trait of Send
razvan Oct 2, 2024
dea179d
Replace trait with enum.
razvan Oct 2, 2024
eecaf23
implement region mover command
razvan Oct 2, 2024
0b14f92
fix: crd field names
razvan Oct 14, 2024
71793ea
unit tests and shell escaping
razvan Oct 14, 2024
1644aff
update docs
razvan Oct 14, 2024
1903f36
spelling
razvan Oct 14, 2024
5e8201f
cargo update
razvan Oct 14, 2024
e76166a
added shutdown test & hbase-entrypoint.sh
razvan Oct 16, 2024
3c63da1
cleanup and set region mover opts env var
razvan Oct 17, 2024
69a6f49
main merge
razvan Oct 17, 2024
8dbde9b
first successful integration test
razvan Oct 17, 2024
68756ab
main merge
razvan Oct 17, 2024
43abf6d
fix image pull policy for the kerberos tests
razvan Oct 17, 2024
2b6e89b
add RUN_REGION_MOVER env var
razvan Oct 17, 2024
c53497a
Update docs/modules/hbase/pages/usage-guide/operations/graceful-shutd…
razvan Oct 17, 2024
4e31a3c
remove trailing whitespace in docs
razvan Oct 17, 2024
a10caa0
rust : remove unused dep
razvan Oct 17, 2024
f42ab05
fix shellcheck lint
razvan Oct 17, 2024
0e9e37e
update shutdown test and run it successfuly
razvan Oct 18, 2024
c2c92c5
update docs
razvan Oct 18, 2024
8d7265e
Update rust/crd/src/lib.rs
razvan Oct 18, 2024
28a1395
fix const arithmetic
razvan Oct 18, 2024
f059e7f
switch to LazyLock
razvan Oct 18, 2024
67f3f1b
configure gracefulShutdownTimeout in (almost) all tests
razvan Oct 18, 2024
7e118ab
region mover args
razvan Oct 21, 2024
34a5ddb
Merge branch 'main' into feat/region-mover
razvan Oct 23, 2024
f9a769b
Update CHANGELOG.md
razvan Oct 23, 2024
420ba36
Update rust/crd/src/lib.rs
razvan Oct 24, 2024
2b0d63b
Update rust/crd/src/lib.rs
razvan Oct 24, 2024
5d5d5e9
Update rust/crd/src/lib.rs
razvan Oct 24, 2024
228ad4f
Update docs/modules/hbase/pages/usage-guide/operations/graceful-shutd…
razvan Oct 24, 2024
039c22a
Update docs/modules/hbase/pages/usage-guide/operations/graceful-shutd…
razvan Oct 24, 2024
60b9dc8
Update docs/modules/hbase/pages/usage-guide/operations/graceful-shutd…
razvan Oct 24, 2024
fd8331e
Update docs/modules/hbase/pages/usage-guide/operations/graceful-shutd…
razvan Oct 24, 2024
5378f11
Update rust/crd/src/lib.rs
razvan Oct 24, 2024
7b08a26
main merge
razvan Oct 25, 2024
6f087db
note on constant paths and the entrypoint script
razvan Oct 25, 2024
0f32e59
remove unnecessary configOverrides
razvan Oct 25, 2024
109e877
wip: use Fragment for the RegionMover
razvan Oct 25, 2024
05f4303
fix crd generation
razvan Oct 25, 2024
19fed55
test: fail if the regionmover fails (only with 2.6)
razvan Oct 28, 2024
8a8d26a
refactor to reduce (some) duplication
razvan Oct 28, 2024
e0aaa27
tests: use dev images
razvan Oct 28, 2024
eb52267
feat: remove hard-coded cluster.local from the domain name
razvan Oct 29, 2024
c051fb5
main merge
razvan Oct 29, 2024
40ae497
Merge branch 'main' into feat/region-mover
razvan Oct 29, 2024
d6d5fe4
fix: RegionMover fields should not be Optional
razvan Oct 30, 2024
fa239e5
main merge
razvan Jan 15, 2025
cb76f4e
add STACKABLE_LOG_DIR env var
razvan Jan 15, 2025
e86b446
ref introduce const CONTAINERDEBUG_LOG_DIRECTORY
razvan Jan 17, 2025
cd22ba8
main merge
razvan Jan 30, 2025
ab00b89
make shutdown test more resilient
razvan Jan 31, 2025
19db6a9
main merge
razvan Feb 3, 2025
a6facad
Merge branch 'main' into feat/region-mover
razvan Feb 3, 2025
6cbe265
tmp test def
razvan Feb 3, 2025
7c9a5bd
update rustfmt
razvan Feb 4, 2025
2123fbc
main merge
razvan Feb 4, 2025
ace488a
Update tests/templates/kuttl/shutdown/30-install-hbase.yaml.j2
razvan Feb 5, 2025
ca3f734
revert test definition
razvan Feb 5, 2025
76001c5
update changelog
razvan Feb 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,17 @@

## [24.11.1] - 2025-01-09

### Changed

- Support moving regions to other Pods during graceful shutdown of region servers ([#570]).

### Fixed

- BREAKING: Use distinct ServiceAccounts for the Stacklets, so that multiple Stacklets can be
deployed in one namespace. Existing Stacklets will use the newly created ServiceAccounts after
restart ([#594]).

[#570]: https://github.com/stackabletech/hbase-operator/pull/570
[#594]: https://github.com/stackabletech/hbase-operator/pull/594

## [24.11.0] - 2024-11-18
Expand Down
44 changes: 25 additions & 19 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

23 changes: 19 additions & 4 deletions Cargo.nix

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ rstest = "0.24"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
serde_yaml = "0.9"
shell-escape = "0.1"
snafu = "0.8"
stackable-operator = { git = "https://github.com/stackabletech/operator-rs.git", tag = "stackable-operator-0.85.0" }
product-config = { git = "https://github.com/stackabletech/product-config.git", tag = "0.7.0" }
Expand Down
62 changes: 62 additions & 0 deletions deploy/helm/hbase-operator/crds/crds.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -688,6 +688,9 @@ spec:
description: Time period Pods have to gracefully shut down, e.g. `30m`, `1h` or `2d`. Consult the operator documentation for details.
nullable: true
type: string
hbaseOpts:
nullable: true
type: string
Comment on lines +691 to +693
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should not be added again (applies to all roles).
#620 removed it, can you please remove it again?

hbaseRootdir:
nullable: true
type: string
Expand Down Expand Up @@ -775,6 +778,34 @@ spec:
nullable: true
type: boolean
type: object
regionMover:
default:
ack: null
maxThreads: null
runBeforeShutdown: null
description: Before terminating a region server pod, the RegionMover tool can be invoked to transfer local regions to other servers. This may cause a lot of network traffic in the Kubernetes cluster if the entire HBase stacklet is being restarted. The operator will compute a timeout period for the region move that will not exceed the graceful shutdown timeout.
properties:
ack:
description: If enabled (default), the region mover will confirm that regions are available on the source as well as the target pods before and after the move.
nullable: true
type: boolean
additionalMoverOptions:
default: []
description: Additional options to pass to the region mover.
items:
type: string
type: array
maxThreads:
description: Maximum number of threads to use for moving regions.
format: uint16
minimum: 0.0
nullable: true
type: integer
runBeforeShutdown:
description: Move local regions to other servers before terminating a region server's pod.
nullable: true
type: boolean
type: object
requestedSecretLifetime:
description: Request secret (currently only autoTls certificates) lifetime from the secret operator, e.g. `7d`, or `30d`. Please note that this can be shortened by the `maxCertificateLifetime` setting on the SecretClass issuing the TLS certificate.
nullable: true
Expand Down Expand Up @@ -938,6 +969,9 @@ spec:
description: Time period Pods have to gracefully shut down, e.g. `30m`, `1h` or `2d`. Consult the operator documentation for details.
nullable: true
type: string
hbaseOpts:
nullable: true
type: string
hbaseRootdir:
nullable: true
type: string
Expand Down Expand Up @@ -1025,6 +1059,34 @@ spec:
nullable: true
type: boolean
type: object
regionMover:
default:
ack: null
maxThreads: null
runBeforeShutdown: null
description: Before terminating a region server pod, the RegionMover tool can be invoked to transfer local regions to other servers. This may cause a lot of network traffic in the Kubernetes cluster if the entire HBase stacklet is being restarted. The operator will compute a timeout period for the region move that will not exceed the graceful shutdown timeout.
properties:
ack:
description: If enabled (default), the region mover will confirm that regions are available on the source as well as the target pods before and after the move.
nullable: true
type: boolean
additionalMoverOptions:
default: []
description: Additional options to pass to the region mover.
items:
type: string
type: array
maxThreads:
description: Maximum number of threads to use for moving regions.
format: uint16
minimum: 0.0
nullable: true
type: integer
runBeforeShutdown:
description: Move local regions to other servers before terminating a region server's pod.
nullable: true
type: boolean
type: object
requestedSecretLifetime:
description: Request secret (currently only autoTls certificates) lifetime from the secret operator, e.g. `7d`, or `30d`. Please note that this can be shortened by the `maxCertificateLifetime` setting on the SecretClass issuing the TLS certificate.
nullable: true
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
= Graceful shutdown

You can configure the graceful shutdown as described in xref:concepts:operations/graceful_shutdown.adoc[].
You can configure the graceful shutdown grace period as described in xref:concepts:operations/graceful_shutdown.adoc[].

== Masters

Expand All @@ -15,7 +15,7 @@ However, there is no message in the log acknowledging the graceful shutdown.

== RegionServers

As a default, RegionServers have `60 minutes` to shut down gracefully.
By default, RegionServers have `60 minutes` to shut down gracefully.

They use the same mechanism described above.
In contrast to the Master servers, they will, however, acknowledge the graceful shutdown with a message in the logs:
Expand All @@ -26,6 +26,61 @@ In contrast to the Master servers, they will, however, acknowledge the graceful
2023-10-11 12:38:05,060 INFO [shutdown-hook-0] regionserver.HRegionServer: ***** STOPPING region server 'test-hbase-regionserver-default-0.test-hbase-regionserver-default.kuttl-test-topical-parakeet.svc.cluster.local,16020,1697027870348' *****
----

The operator allows for finer control over the shutdown process of region servers.
For each region server pod, the region mover tool may be invoked before terminating the region server's pod.
The affected regions are transferred to other pods thus ensuring that the data is still available.

Here is an example:

[source,yaml]
----
spec:
regionServers:
config:
regionMover:
runBeforeShutdown: true # <1>
maxThreads: 5 # <2>
ack: false # <3>
additionalMoverOptions: ["--designatedFile", "/path/to/designatedFile"] # <4>
----
<1>: Run the region mover tool before shutting down the region server. Default is `false`.
<2>: Maximum number of threads to use for moving regions. Default is 1.
<3>: Enable or disable region confirmation on the present and target servers. Default is `true`.
<4>: Extra options to pass to the region mover tool.

For a list of additional options accepted by the region mover use the `--help` option first:

[source]
----
$ /stackable/hbase/bin/hbase org.apache.hadoop.hbase.util.RegionMover --help
usage: hbase org.apache.hadoop.hbase.util.RegionMover <options>
Options:
-r,--regionserverhost <arg> region server <hostname>|<hostname:port>
-o,--operation <arg> Expected: load/unload/unload_from_rack/isolate_regions
-m,--maxthreads <arg> Define the maximum number of threads to use to unload and reload the regions
-i,--isolateRegionIds <arg> Comma separated list of Region IDs hash to isolate on a RegionServer and put region
server in draining mode. This option should only be used with '-o isolate_regions'. By
putting region server in decommission/draining mode, master can't assign any new region
on this server. If one or more regions are not found OR failed to isolate successfully,
utility will exist without putting RS in draining/decommission mode. Ex.
--isolateRegionIds id1,id2,id3 OR -i id1,id2,id3
-x,--excludefile <arg> File with <hostname:port> per line to exclude as unload targets; default excludes only
target host; useful for rack decommisioning.
-d,--designatedfile <arg> File with <hostname:port> per line as unload targets;default is all online hosts
-f,--filename <arg> File to save regions list into unloading, or read from loading; default
/tmp/<usernamehostname:port>
-n,--noack Turn on No-Ack mode(default: false) which won't check if region is online on target
RegionServer, hence best effort. This is more performant in unloading and loading but
might lead to region being unavailable for some time till master reassigns it in case the
move failed
-t,--timeout <arg> timeout in seconds after which the tool will exit irrespective of whether it finished or
not;default Integer.MAX_VALUE
----

NOTE: There is no need to explicitly specify a timeout for the region movement. The operator will compute an appropriate timeout that cannot exceed the `gracefulShutdownTimeout` for region servers.

IMPORTANT: The ZooKeeper connection must be available during the time the region mover is running for the graceful shutdown process to succeed.

== RestServers

As a default, RestServers have `5 minutes` to shut down gracefully.
Expand Down
1 change: 1 addition & 0 deletions rust/crd/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ publish = false
product-config.workspace = true
serde.workspace = true
serde_json.workspace = true
shell-escape.workspace = true
snafu.workspace = true
stackable-operator.workspace = true
strum.workspace = true
Expand Down
Loading