You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
pipeline: outputs: opensearch: Document troubleshooting for index shard availability issues (#1160)
* pipeline: outputs: opensearch: document how to troubleshoot issues with OpenSearch cluster index shard availability
Signed-off-by: Adam DePollo <[email protected]>
* Edit slightly
Signed-off-by: Adam DePollo <[email protected]>
* Updated with feedback from maintainers
Signed-off-by: Adam DePollo <[email protected]>
* Add link to debug mode docs
Signed-off-by: Adam DePollo <[email protected]>
---------
Signed-off-by: Adam DePollo <[email protected]>
Co-authored-by: Adam DePollo <[email protected]>
Copy file name to clipboardExpand all lines: pipeline/outputs/opensearch.md
+48Lines changed: 48 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -195,3 +195,51 @@ aoss:UpdateIndex
195
195
aoss:WriteDocument
196
196
```
197
197
With data access permissions, IAM policies are not needed to access the collection.
198
+
199
+
### Issues with the OpenSearch cluster
200
+
201
+
Occasionally the Fluent Bit service may generate errors without any additional detail in the logs to explain the source of the issue, even with the service's log_level attribute set to [Debug](https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/classic-mode/configuration-file).
202
+
203
+
For example, in this scenario the logs show that a connection was successfully established with the OpenSearch domain, and yet an error is still returned:
204
+
```
205
+
[2023/07/10 19:26:00] [debug] [http_client] not using http_proxy for header
206
+
[2023/07/10 19:26:00] [debug] [output:opensearch:opensearch.5] Signing request with AWS Sigv4
207
+
[2023/07/10 19:26:00] [debug] [aws_credentials] Requesting credentials from the EC2 provider..
This behavior could be indicative of a hard-to-detect issue with index shard usage in the OpenSearch domain.
217
+
218
+
While OpenSearch index shards and disk space are related, they are not directly tied to one another.
219
+
220
+
OpenSearch domains are limited to 1000 index shards per data node, regardless of the size of the nodes. And, importantly, shard usage is not proportional to disk usage: an individual index shard can hold anywhere from a few kilobytes to dozens of gigabytes of data.
221
+
222
+
In other words, depending on the way index creation and shard allocation are configured in the OpenSearch domain, all of the available index shards could be used long before the data nodes run out of disk space and begin exhibiting disk-related performance issues (e.g. nodes crashing, data corruption, or the dashboard going offline).
223
+
224
+
The primary issue that arises when a domain is out of available index shards is that new indexes can no longer be created (though logs can still be added to existing indexes).
225
+
226
+
When that happens, the Fluent Bit OpenSearch output may begin showing confusing behavior. For example:
227
+
- Errors suddenly appear (outputs were previously working and there were no changes to the Fluent Bit configuration when the errors began)
228
+
- Errors are not consistently occurring (some logs are still reaching the OpenSearch domain)
229
+
- The Fluent Bit service logs show errors, but without any detail as to the root cause
230
+
231
+
If any of those symptoms are present, consider using the OpenSearch domain's API endpoints to troubleshoot possible shard issues.
232
+
233
+
Running this command will show both the shard count and disk usage on all of the nodes in the domain.
234
+
```
235
+
GET _cat/allocation?v
236
+
```
237
+
238
+
Index creation issues will begin to appear if any hot data nodes have around 1000 shards OR if the total number of shards spread across hot and ultrawarm data nodes in the cluster is greater than 1000 times the total number of nodes (e.g., in a cluster with 6 nodes, the maximum shard count would be 6000).
239
+
240
+
Alternatively, running this command to manually create a new index will return an explicit error related to shard count if the maximum has been exceeded.
241
+
```
242
+
PUT <index-name>
243
+
```
244
+
245
+
There are multiple ways to resolve excessive shard usage in an OpenSearch domain such as deleting or combining indexes, adding more data nodes to the cluster, or updating the domain's index creation and sharding strategy. Consult the OpenSearch documentation for more information on how to use these strategies.
0 commit comments