-
Notifications
You must be signed in to change notification settings - Fork 50
add preStop hook for autoscaling down #142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a preStop hook for ingester pods to support graceful autoscaling down by allowing data to be properly flushed before pod termination. The key changes include adding a preStop lifecycle hook that disables the node, flushes data from memory to WAL, and waits for S3 synchronization, along with extending termination grace periods to accommodate the longer shutdown process.
- Adds preStop lifecycle hook to ingester pods for graceful shutdown during autoscaling
- Extends terminationGracePeriodSeconds from 30 seconds to 20 minutes for data flushing
- Minor documentation improvements to chart README
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
File | Description |
---|---|
charts/openobserve/values.yaml | Increases termination grace periods to allow time for data flushing |
charts/openobserve/templates/ingester-statefulset.yaml | Implements preStop hook with node disable, data flush, and wait logic |
charts/openobserve/README.md | Minor formatting improvements to headings |
charts/openobserve/values.yaml
Outdated
terminationGracePeriodSeconds: 1200 # 20 minutes for now, since we are using pre-stop hook to flush data andit takes up to 10 minutes to flush data to s3 | ||
livenessProbe: | ||
initialDelaySeconds: 10 | ||
periodSeconds: 10 | ||
timeoutSeconds: 5 | ||
successThreshold: 1 | ||
failureThreshold: 3 | ||
terminationGracePeriodSeconds: 30 | ||
terminationGracePeriodSeconds: 1200 # 20 minutes for now, since we are using pre-stop hook to flush data andit takes up to 10 minutes to flush data to s3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a spelling error in the comment: 'andit' should be 'and it'.
Copilot uses AI. Check for mistakes.
# Wait for 900 seconds after flush to ensure data is moved to s3 | ||
# 15 minutes for now, since file movement to s3 may take up to 10 minutes | ||
echo "Waiting 900 seconds to flush data..." | ||
sleep 900 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sleep duration of 900 seconds is a magic number that should be made configurable through values.yaml to allow for environment-specific tuning.
Copilot uses AI. Check for mistakes.
curl -X PUT "http://localhost:{{ .Values.config.ZO_HTTP_PORT }}/node/enable?value=false" \ | ||
-H "Authorization: Basic ${AUTH_HEADER}" | ||
|
||
# returns 200 if successful and "true" if the node is disabled | ||
|
||
# Flush all data from memory to WAL. This does not flush data from ingester to s3. | ||
echo "Flushing data from ingester..." | ||
curl -X PUT "http://localhost:{{ .Values.config.ZO_HTTP_PORT }}/node/flush" \ | ||
-H "Authorization: Basic ${AUTH_HEADER}" | ||
|
||
# returns 200 if successful and "true" if the node is flushed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The curl commands lack error handling. If the API calls fail, the script continues without knowing if the operations succeeded, which could lead to data loss.
curl -X PUT "http://localhost:{{ .Values.config.ZO_HTTP_PORT }}/node/enable?value=false" \ | |
-H "Authorization: Basic ${AUTH_HEADER}" | |
# returns 200 if successful and "true" if the node is disabled | |
# Flush all data from memory to WAL. This does not flush data from ingester to s3. | |
echo "Flushing data from ingester..." | |
curl -X PUT "http://localhost:{{ .Values.config.ZO_HTTP_PORT }}/node/flush" \ | |
-H "Authorization: Basic ${AUTH_HEADER}" | |
# returns 200 if successful and "true" if the node is flushed | |
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" -X PUT "http://localhost:{{ .Values.config.ZO_HTTP_PORT }}/node/enable?value=false" \ | |
-H "Authorization: Basic ${AUTH_HEADER}") | |
if [ "$RESPONSE" -ne 200 ]; then | |
echo "Error: Failed to disable ingester node. HTTP response code: $RESPONSE" | |
exit 1 | |
fi | |
# Flush all data from memory to WAL. This does not flush data from ingester to s3. | |
echo "Flushing data from ingester..." | |
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" -X PUT "http://localhost:{{ .Values.config.ZO_HTTP_PORT }}/node/flush" \ | |
-H "Authorization: Basic ${AUTH_HEADER}") | |
if [ "$RESPONSE" -ne 200 ]; then | |
echo "Error: Failed to flush data from ingester. HTTP response code: $RESPONSE" | |
exit 1 | |
fi |
Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
add preStop hook for autoscaling down