diff --git a/docs/reference/elasticsearch/rest-apis/reindex-data-stream.md b/docs/reference/elasticsearch/rest-apis/reindex-data-stream.md index 60638dbfac987..920d0ae348d4a 100644 --- a/docs/reference/elasticsearch/rest-apis/reindex-data-stream.md +++ b/docs/reference/elasticsearch/rest-apis/reindex-data-stream.md @@ -1,62 +1,29 @@ --- -navigation_title: "Reindex data stream" +navigation_title: "Reindex data streams" mapped_pages: - https://www.elastic.co/guide/en/elasticsearch/reference/current/data-stream-reindex-api.html - # That link will 404 until 8.18 is current - # (see https://www.elastic.co/guide/en/elasticsearch/reference/8.18/data-stream-reindex-api.html) applies_to: stack: all --- -# Reindex data stream API [data-stream-reindex-api] +# Reindex data stream examples [data-stream-reindex-api] - -::::{admonition} New API reference For the most up-to-date API details, refer to [Migration APIs](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-migration). -:::: - - ::::{tip} -These APIs are designed for indirect use by {{kib}}'s **Upgrade Assistant**. We strongly recommend you use the **Upgrade Assistant** to upgrade from 8.17 to {{version}}. For upgrade instructions, refer to [Upgrading to Elastic {{version}}](docs-content://deploy-manage/upgrade/deployment-or-cluster.md). +The reindex data stream API are designed for indirect use by {{kib}}'s **Upgrade Assistant**. We strongly recommend you use the **Upgrade Assistant** to perform upgrades. Refer to [](docs-content://deploy-manage/upgrade.md). :::: - The reindex data stream API is used to upgrade the backing indices of a data stream to the most recent major version. It works by reindexing each backing index into a new index, then replacing the original backing index with its replacement and deleting the original backing index. The settings and mappings from the original backing indices are copied to the resulting backing indices. -This api runs in the background because reindexing all indices in a large data stream is expected to take a large amount of time and resources. The endpoint will return immediately and a persistent task will be created to run in the background. The current status of the task can be checked with the [reindex status API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-get-migrate-reindex-status). This status will be available for 24 hours after the task completes, whether it finished successfully or failed. If the status is still available for a task, the task must be cancelled before it can be re-run. A running or recently completed data stream reindex task can be cancelled using the [reindex cancel API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-cancel-migrate-reindex). - -## {{api-request-title}} [data-stream-reindex-api-request] - -`POST /_migration/reindex` - - -## {{api-prereq-title}} [data-stream-reindex-api-prereqs] - -* If the {{es}} {{security-features}} are enabled, you must have the `manage` [index privilege](docs-content://deploy-manage/users-roles/cluster-or-deployment-auth/elasticsearch-privileges.md#privileges-list-indices) for the data stream. - - -## {{api-request-body-title}} [data-stream-reindex-body] - -`source` -: `index` -: (Required, string) The name of the data stream to upgrade. - - -`mode` -: (Required, enum) Set to `upgrade` to upgrade the data stream in-place, using the same source and destination data stream. Each out-of-date backing index will be reindexed. Then the new backing index is swapped into the data stream and the old index is deleted. Currently, the only allowed value for this parameter is `upgrade`. - +This API runs in the background because reindexing all indices in a large data stream is expected to take a large amount of time and resources. The endpoint will return immediately and a persistent task will be created to run in the background. The current status of the task can be checked with the [reindex status API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-get-migrate-reindex-status). This status will be available for 24 hours after the task completes, whether it finished successfully or failed. If the status is still available for a task, the task must be cancelled before it can be re-run. A running or recently completed data stream reindex task can be cancelled using the [reindex cancel API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-cancel-migrate-reindex). ## Settings [reindex-data-stream-api-settings] -You can use the following settings to control the behavior of the reindex data stream API: - +You can use settings to control the behavior of the reindex data stream API. +Refer to [](/reference/elasticsearch/configuration-reference/data-stream-lifecycle-settings.md) $$$migrate_max_concurrent_indices_reindexed_per_data_stream-setting$$$ -`migrate.max_concurrent_indices_reindexed_per_data_stream` ([Dynamic](docs-content://deploy-manage/deploy/self-managed/configure-elasticsearch.md#dynamic-cluster-setting)) The number of backing indices within a given data stream which will be reindexed concurrently. Defaults to `1`. - $$$migrate_data_stream_reindex_max_request_per_second-setting$$$ -`migrate.data_stream_reindex_max_request_per_second` ([Dynamic](docs-content://deploy-manage/deploy/self-managed/configure-elasticsearch.md#dynamic-cluster-setting)) The average maximum number of documents within a given backing index to reindex per second. Defaults to `1000`, though can be any decimal number greater than `0`. To remove throttling, set to `-1`. This setting can be used to throttle the reindex process and manage resource usage. Consult the [reindex throttle docs](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-reindex#docs-reindex-throttle) for more information. - ## {{api-examples-title}} [reindex-data-stream-api-example] @@ -66,7 +33,7 @@ Assume we have a data stream `my-data-stream` with the following backing indices * .ds-my-data-stream-2025.01.23-000002 * .ds-my-data-stream-2025.01.23-000003 -Let’s also assume that `.ds-my-data-stream-2025.01.23-000003` is the write index. If {{es}} is version 8.x and we wish to upgrade to major version 9.x, the version 7.x indices must be upgraded in preparation. We can use this API to reindex a data stream with version 7.x backing indices and make them version 8 backing indices. +Let's also assume that `.ds-my-data-stream-2025.01.23-000003` is the write index. If {{es}} is version 8.x and we wish to upgrade to major version 9.x, the version 7.x indices must be upgraded in preparation. We can use this API to reindex a data stream with version 7.x backing indices and make them version 8 backing indices. Start by calling the API: @@ -84,7 +51,7 @@ POST _migration/reindex As this task runs in the background this API will return immediately. The task will do the following. -First, the data stream is rolled over. So that no documents are lost during the reindex, we add [write blocks](/reference/elasticsearch/index-settings/index-block.md) to the existing backing indices before reindexing them. Since a data stream’s write index cannot have a write block, the data stream is must be rolled over. This will produce a new write index, `.ds-my-data-stream-2025.01.23-000004`; which has an 8.x version and thus does not need to be upgraded. +First, the data stream is rolled over. So that no documents are lost during the reindex, we add [write blocks](/reference/elasticsearch/index-settings/index-block.md) to the existing backing indices before reindexing them. Since a data stream's write index cannot have a write block, the data stream is must be rolled over. This will produce a new write index, `.ds-my-data-stream-2025.01.23-000004`; which has an 8.x version and thus does not need to be upgraded. Once the data stream has a write index with an 8.x version we can proceed with reindexing the old indices. For each of the version 7.x indices, we now do the following: @@ -97,7 +64,7 @@ Once the data stream has a write index with an 8.x version we can proceed with r * Replace the old index in the data stream with the new index, using the [modify data streams API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-modify-data-stream). * Finally, the old backing index is deleted. -By default only one backing index will be processed at a time. This can be modified using the [`migrate_max_concurrent_indices_reindexed_per_data_stream-setting` setting](#migrate_max_concurrent_indices_reindexed_per_data_stream-setting). +By default only one backing index will be processed at a time. This can be modified using the [`migrate_max_concurrent_indices_reindexed_per_data_stream-setting` setting](/reference/elasticsearch/configuration-reference/data-stream-lifecycle-settings.md). While the reindex data stream task is running, we can inspect the current status using the [reindex status API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-get-migrate-reindex-status): @@ -126,11 +93,11 @@ For the above example, the following would be a possible status: } ``` -This output means that the first backing index, `.ds-my-data-stream-2025.01.23-000001`, is currently being processed, and none of the backing indices have yet completed. Notice that `total_indices_in_data_stream` has a value of `4`, because after the rollover, there are 4 indices in the data stream. But the new write index has an 8.x version, and thus doesn’t need to be reindexed, so `total_indices_requiring_upgrade` is only 3. +This output means that the first backing index, `.ds-my-data-stream-2025.01.23-000001`, is currently being processed, and none of the backing indices have yet completed. Notice that `total_indices_in_data_stream` has a value of `4`, because after the rollover, there are 4 indices in the data stream. But the new write index has an 8.x version, and thus doesn't need to be reindexed, so `total_indices_requiring_upgrade` is only 3. -### Cancelling and Restarting [reindex-data-stream-cancel-restart] +### Cancelling and restarting [reindex-data-stream-cancel-restart] -The [reindex datastream settings](#reindex-data-stream-api-settings) provide a few ways to control the performance and resource usage of a reindex task. This example shows how we can stop a running reindex task, modify the settings, and restart the task. +The reindex datastream settings provide a few ways to control the performance and resource usage of a reindex task. This example shows how we can stop a running reindex task, modify the settings, and restart the task. Continuing with the above example, assume the reindexing task has not yet completed, and the [reindex status API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-get-migrate-reindex-status) returns the following: @@ -153,9 +120,9 @@ Continuing with the above example, assume the reindexing task has not yet comple } ``` -Let’s assume the task has been running for a long time. By default, we throttle how many requests the reindex operation can execute per second. This keeps the reindex process from consuming too many resources. But the default value of `1000` request per second will not be correct for all use cases. The [`migrate.data_stream_reindex_max_request_per_second` setting](#migrate_data_stream_reindex_max_request_per_second-setting) can be used to increase or decrease the number of requests per second, or to remove the throttle entirely. +Let's assume the task has been running for a long time. By default, we throttle how many requests the reindex operation can execute per second. This keeps the reindex process from consuming too many resources. But the default value of `1000` request per second will not be correct for all use cases. The [`migrate.data_stream_reindex_max_request_per_second` setting](/reference/elasticsearch/configuration-reference/data-stream-lifecycle-settings.md) can be used to increase or decrease the number of requests per second, or to remove the throttle entirely. -Changing this setting won’t have an effect on the backing index that is currently being reindexed. For example, changing the setting won’t have an effect on `.ds-my-data-stream-2025.01.23-000002`, but would have an effect on the next backing index. +Changing this setting won't have an effect on the backing index that is currently being reindexed. For example, changing the setting won't have an effect on `.ds-my-data-stream-2025.01.23-000002`, but would have an effect on the next backing index. But in the above status, `.ds-my-data-stream-2025.01.23-000002` has values of 1000 and 10M for the `reindexed_doc_count` and `total_doc_count`, respectively. This means it has only reindexed 0.01% of the documents in the index. It might be a good time to cancel the run and optimize some settings without losing much work. So we call the [cancel API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-cancel-migrate-reindex): @@ -174,7 +141,7 @@ PUT /_cluster/settings } ``` -The [original reindex command](#reindex-data-stream-start) can now be used to restart reindexing. Because the first backing index, `.ds-my-data-stream-2025.01.23-000001`, has already been reindexed and thus is already version 8.x, it will be skipped. The task will start by reindexing `.ds-my-data-stream-2025.01.23-000002` again from the beginning. +The original reindex command can now be used to restart reindexing. Because the first backing index, `.ds-my-data-stream-2025.01.23-000001`, has already been reindexed and thus is already version 8.x, it will be skipped. The task will start by reindexing `.ds-my-data-stream-2025.01.23-000002` again from the beginning. Later, once all the backing indices have finished, the [reindex status API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-get-migrate-reindex-status) will return something like the following: @@ -226,7 +193,7 @@ which returns: } ``` -Index `.ds-my-data-stream-2025.01.23-000004` is the write index and didn’t need to be upgraded because it was created with version 8.x. The other three backing indices are now prefixed with `.migrated` because they have been upgraded. +Index `.ds-my-data-stream-2025.01.23-000004` is the write index and didn't need to be upgraded because it was created with version 8.x. The other three backing indices are now prefixed with `.migrated` because they have been upgraded. We can now check the indices and verify that they have version 8.x: @@ -250,8 +217,7 @@ which returns: } ``` - -### Handling Failures [reindex-data-stream-handling-failure] +### Handling failures [reindex-data-stream-handling-failure] Since the reindex data stream API runs in the background, failure information can be obtained through the [reindex status API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-get-migrate-reindex-status). For example, if the backing index `.ds-my-data-stream-2025.01.23-000002` was accidentally deleted by a user, we would see a status like the following: @@ -273,4 +239,4 @@ Since the reindex data stream API runs in the background, failure information ca } ``` -Once the issue has been fixed, the failed reindex task can be re-run. First, the failed run’s status must be cleared using the [reindex cancel API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-cancel-migrate-reindex). Then the [original reindex command](#reindex-data-stream-start) can be called to pick up where it left off. +Once the issue has been fixed, the failed reindex task can be re-run. First, the failed run's status must be cleared using the [reindex cancel API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-cancel-migrate-reindex). Then the original reindex command can be called to pick up where it left off. diff --git a/docs/reference/elasticsearch/rest-apis/reindex-indices.md b/docs/reference/elasticsearch/rest-apis/reindex-indices.md new file mode 100644 index 0000000000000..c9825ca99f0c8 --- /dev/null +++ b/docs/reference/elasticsearch/rest-apis/reindex-indices.md @@ -0,0 +1,632 @@ +--- +navigation_title: "Reindex indices" +mapped_pages: + - https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html +applies_to: + stack: all +--- + +# Reindex indices examples + +For the most up-to-date API details, refer to [Document APIs](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-document). + +## Running reindex asynchronously [docs-reindex-task-api] + +If the request contains `wait_for_completion=false`, {{es}} performs some preflight checks, launches the request, and returns a `task` you can use to cancel or get the status of the task. {{es}} creates a record of this task as a document at `_tasks/`. + +## Reindex from multiple sources [docs-reindex-from-multiple-sources] + +If you have many sources to reindex it is generally better to reindex them one at a time rather than using a glob pattern to pick up multiple sources. +That way you can resume the process if there are any errors by removing the partially completed source and starting over. +It also makes parallelizing the process fairly simple: split the list of sources to reindex and run each list in parallel. + +One-off bash scripts seem to work nicely for this: + +```bash +for index in i1 i2 i3 i4 i5; do + curl -HContent-Type:application/json -XPOST localhost:9200/_reindex?pretty -d'{ + "source": { + "index": "'$index'" + }, + "dest": { + "index": "'$index'-reindexed" + } + }' +done +``` + +## Throttling [docs-reindex-throttle] + +Set `requests_per_second` to any positive decimal number (for example, `1.4`, `6`, or `1000`) to throttle the rate at which the reindex API issues batches of index operations. +Requests are throttled by padding each batch with a wait time. +To disable throttling, set `requests_per_second` to `-1`. + +The throttling is done by waiting between batches so that the `scroll` that the reindex API uses internally can be given a timeout that takes into account the padding. The padding time is the difference between the batch size divided by the `requests_per_second` and the time spent writing. By default the batch size is `1000`, so if `requests_per_second` is set to `500`: + +```txt +target_time = 1000 / 500 per second = 2 seconds +wait_time = target_time - write_time = 2 seconds - .5 seconds = 1.5 seconds +``` + +Since the batch is issued as a single `_bulk` request, large batch sizes cause {{es}} to create many requests and then wait for a while before starting the next set. This is "bursty" instead of "smooth". + +## Rethrottling [docs-reindex-rethrottle] + +The value of `requests_per_second` can be changed on a running reindex using the `_rethrottle` API. For example: + +```console +POST _reindex/r1A2WoRbTwKZ516z6NEs5A:36619/_rethrottle?requests_per_second=-1 +``` + +The task ID can be found using the [task management APIs](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-tasks). + +Just like when setting it on the Reindex API, `requests_per_second` can be either `-1` to disable throttling or any decimal number like `1.7` or `12` to throttle to that level. +Rethrottling that speeds up the query takes effect immediately, but rethrottling that slows down the query will take effect after completing the current batch. +This prevents scroll timeouts. + +## Slicing [docs-reindex-slice] + +Reindex supports [sliced scroll](paginate-search-results.md#slice-scroll) to parallelize the reindexing process. +This parallelization can improve efficiency and provide a convenient way to break the request down into smaller parts. + +::::{note} +Reindexing from remote clusters does not support manual or automatic slicing. +:::: + +### Manual slicing [docs-reindex-manual-slice] + +Slice a reindex request manually by providing a slice id and total number of slices to each request: + +```console +POST _reindex +{ + "source": { + "index": "my-index-000001", + "slice": { + "id": 0, + "max": 2 + } + }, + "dest": { + "index": "my-new-index-000001" + } +} +POST _reindex +{ + "source": { + "index": "my-index-000001", + "slice": { + "id": 1, + "max": 2 + } + }, + "dest": { + "index": "my-new-index-000001" + } +} +``` + +You can verify this works by: + +```console +GET _refresh +POST my-new-index-000001/_search?size=0&filter_path=hits.total +``` + +which results in a sensible `total` like this one: + +```console-result +{ + "hits": { + "total" : { + "value": 120, + "relation": "eq" + } + } +} +``` + +### Automatic slicing [docs-reindex-automatic-slice] + +You can also let the reindex API automatically parallelize using [sliced scroll](paginate-search-results.md#slice-scroll) to slice on `_id`. +Use `slices` to specify the number of slices to use: + +```console +POST _reindex?slices=5&refresh +{ + "source": { + "index": "my-index-000001" + }, + "dest": { + "index": "my-new-index-000001" + } +} +``` + +You can also verify this works by: + +```console +POST my-new-index-000001/_search?size=0&filter_path=hits.total +``` + +which results in a sensible `total` like this one: + +```console-result +{ + "hits": { + "total" : { + "value": 120, + "relation": "eq" + } + } +} +``` + +Setting `slices` to `auto` will let {{es}} choose the number of slices to use. +This setting will use one slice per shard, up to a certain limit. +If there are multiple sources, it will choose the number of slices based on the index or backing index with the smallest number of shards. + +Adding `slices` to the reindex API just automates the manual process used in the section above, creating sub-requests which means it has some quirks: + +* You can see these requests in the [task management APIs](https://www.elastic.co/docs/api/doc/elasticsearch/group/endpoint-tasks). These sub-requests are "child" tasks of the task for the request with `slices`. +* Fetching the status of the task for the request with `slices` only contains the status of completed slices. +* These sub-requests are individually addressable for things like cancellation and rethrottling. +* Rethrottling the request with `slices` will rethrottle the unfinished sub-request proportionally. +* Canceling the request with `slices` will cancel each sub-request. +* Due to the nature of `slices` each sub-request won't get a perfectly even portion of the documents. All documents will be addressed, but some slices may be larger than others. Expect larger slices to have a more even distribution. +* Parameters like `requests_per_second` and `max_docs` on a request with `slices` are distributed proportionally to each sub-request. Combine that with the point above about distribution being uneven and you should conclude that using `max_docs` with `slices` might not result in exactly `max_docs` documents being reindexed. +* Each sub-request gets a slightly different snapshot of the source, though these are all taken at approximately the same time. + +### Picking the number of slices [docs-reindex-picking-slices] + +If slicing automatically, setting `slices` to `auto` will choose a reasonable number for most indices. If slicing manually or otherwise tuning automatic slicing, use these guidelines. + +Query performance is most efficient when the number of `slices` is equal to the number of shards in the index. If that number is large (for example, 500), choose a lower number as too many `slices` will hurt performance. Setting `slices` higher than the number of shards generally does not improve efficiency and adds overhead. + +Indexing performance scales linearly across available resources with the number of slices. + +Whether query or indexing performance dominates the runtime depends on the documents being reindexed and cluster resources. + +## Reindex routing [docs-reindex-routing] + +By default if the reindex API sees a document with routing then the routing is preserved unless it's changed by the script. You can set `routing` on the `dest` request to change this. +For example: + +`keep` +: Sets the routing on the bulk request sent for each match to the routing on the match. This is the default value. + +`discard` +: Sets the routing on the bulk request sent for each match to `null`. + +`=` +: Sets the routing on the bulk request sent for each match to all text after the `=`. + +You can use the following request to copy all documents from the `source` with the company name `cat` into the `dest` with routing set to `cat`: + +```console +POST _reindex +{ + "source": { + "index": "source", + "query": { + "match": { + "company": "cat" + } + } + }, + "dest": { + "index": "dest", + "routing": "=cat" + } +} +``` + +By default the reindex API uses scroll batches of 1000. You can change the batch size with the `size` field in the `source` element: + +```console +POST _reindex +{ + "source": { + "index": "source", + "size": 100 + }, + "dest": { + "index": "dest", + "routing": "=cat" + } +} +``` + +## Reindex with an ingest pipeline [reindex-with-an-ingest-pipeline] + +Reindex can also use the [ingest pipelines](docs-content://manage-data/ingest/transform-enrich/ingest-pipelines.md) feature by specifying a `pipeline` like this: + +```console +POST _reindex +{ + "source": { + "index": "source" + }, + "dest": { + "index": "dest", + "pipeline": "some_ingest_pipeline" + } +} +``` + +## Reindex select documents with a query [docs-reindex-select-query] + +You can limit the documents by adding a query to the `source`. For example, the following request only copies documents with a `user.id` of `kimchy` into `my-new-index-000001`: + +```console +POST _reindex +{ + "source": { + "index": "my-index-000001", + "query": { + "term": { + "user.id": "kimchy" + } + } + }, + "dest": { + "index": "my-new-index-000001" + } +} +``` + +### Reindex select documents with `max_docs` [docs-reindex-select-max-docs] + +You can limit the number of processed documents by setting `max_docs`. +For example, this request copies a single document from `my-index-000001` to `my-new-index-000001`: + +```console +POST _reindex +{ + "max_docs": 1, + "source": { + "index": "my-index-000001" + }, + "dest": { + "index": "my-new-index-000001" + } +} +``` + +### Reindex from multiple sources [docs-reindex-multiple-sources] + +The `index` attribute in `source` can be a list, allowing you to copy from lots of sources in one request. +This will copy documents from the `my-index-000001` and `my-index-000002` indices: + +```console +POST _reindex +{ + "source": { + "index": ["my-index-000001", "my-index-000002"] + }, + "dest": { + "index": "my-new-index-000002" + } +} +``` + +::::{note} +The reindex API makes no effort to handle ID collisions so the last document written will "win" but the order isn't usually predictable so it is not a good idea to rely on this behavior. Instead, make sure that IDs are unique using a script. +:::: + +### Reindex select fields with a source filter [docs-reindex-filter-source] + +You can use source filtering to reindex a subset of the fields in the original documents. +For example, the following request only reindexes the `user.id` and `_doc` fields of each document: + +```console +POST _reindex +{ + "source": { + "index": "my-index-000001", + "_source": ["user.id", "_doc"] + }, + "dest": { + "index": "my-new-index-000001" + } +} +``` + +### Reindex to change the name of a field [docs-reindex-change-name] + +The reindex API can be used to build a copy of an index with renamed fields. +Say you create an index containing documents that look like this: + +```console +POST my-index-000001/_doc/1?refresh +{ + "text": "words words", + "flag": "foo" +} +``` + +If you don't like the name `flag` and want to replace it with `tag`, the reindex API can create the other index for you: + +```console +POST _reindex +{ + "source": { + "index": "my-index-000001" + }, + "dest": { + "index": "my-new-index-000001" + }, + "script": { + "source": "ctx._source.tag = ctx._source.remove(\"flag\")" + } +} +``` + +Now you can get the new document: + +```console +GET my-new-index-000001/_doc/1 +``` + +...which will return: + +```console-result +{ + "found": true, + "_id": "1", + "_index": "my-new-index-000001", + "_version": 1, + "_seq_no": 44, + "_primary_term": 1, + "_source": { + "text": "words words", + "tag": "foo" + } +} +``` + +## Reindex daily indices [docs-reindex-daily-indices] + +You can use the reindex API in combination with [Painless](/reference/scripting-languages/painless/painless.md) to reindex daily indices to apply a new template to the existing documents. + +Assuming you have indices that contain documents like: + +```console +PUT metricbeat-2016.05.30/_doc/1?refresh +{"system.cpu.idle.pct": 0.908} +PUT metricbeat-2016.05.31/_doc/1?refresh +{"system.cpu.idle.pct": 0.105} +``` + +The new template for the `metricbeat-*` indices is already loaded into {{es}}, but it applies only to the newly created indices. Painless can be used to reindex the existing documents and apply the new template. + +The script below extracts the date from the index name and creates a new index with `-1` appended. All data from `metricbeat-2016.05.31` will be reindexed into `metricbeat-2016.05.31-1`. + +```console +POST _reindex +{ + "source": { + "index": "metricbeat-*" + }, + "dest": { + "index": "metricbeat" + }, + "script": { + "lang": "painless", + "source": "ctx._index = 'metricbeat-' + (ctx._index.substring('metricbeat-'.length(), ctx._index.length())) + '-1'" + } +} +``` + +All documents from the previous metricbeat indices can now be found in the `*-1` indices. + +```console +GET metricbeat-2016.05.30-1/_doc/1 +GET metricbeat-2016.05.31-1/_doc/1 +``` + +The previous method can also be used in conjunction with [changing a field name](#docs-reindex-change-name) to load only the existing data into the new index and rename any fields if needed. + +## Extract a random subset of the source [docs-reindex-api-subset] + +The reindex API can be used to extract a random subset of the source for testing: + +```console +POST _reindex +{ + "max_docs": 10, + "source": { + "index": "my-index-000001", + "query": { + "function_score" : { + "random_score" : {}, + "min_score" : 0.9 <1> + } + } + }, + "dest": { + "index": "my-new-index-000001" + } +} +``` + +1. You may need to adjust the `min_score` depending on the relative amount of data extracted from source. + +## Modify documents during reindexing [reindex-scripts] + +Like `_update_by_query`, the reindex API supports a script that modifies the document. +Unlike `_update_by_query`, the script is allowed to modify the document's metadata. +This example bumps the version of the source document: + +```console +POST _reindex +{ + "source": { + "index": "my-index-000001" + }, + "dest": { + "index": "my-new-index-000001", + "version_type": "external" + }, + "script": { + "source": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}", + "lang": "painless" + } +} +``` + +Just as in `_update_by_query`, you can set `ctx.op` to change the operation that is run on the destination: + +`noop` +: Set `ctx.op = "noop"` if your script decides that the document doesn't have to be indexed in the destination. This no operation will be reported in the `noop` counter in the response body. + +`delete` +: Set `ctx.op = "delete"` if your script decides that the document must be deleted from the destination. The deletion will be reported in the `deleted` counter in the response body. + +Setting `ctx.op` to anything else will return an error, as will setting any other field in `ctx`. + +Think of the possibilities! Just be careful; you are able to change: + +* `_id` +* `_index` +* `_version` +* `_routing` + +Setting `_version` to `null` or clearing it from the `ctx` map is just like not sending the version in an indexing request; it will cause the document to be overwritten in the destination regardless of the version on the target or the version type you use in the reindex API request. + +## Reindex from remote [reindex-from-remote] + +Reindex supports reindexing from a remote {{es}} cluster: + +```console +POST _reindex +{ + "source": { + "remote": { + "host": "http://otherhost:9200", + "username": "user", + "password": "pass" + }, + "index": "my-index-000001", + "query": { + "match": { + "test": "data" + } + } + }, + "dest": { + "index": "my-new-index-000001" + } +} +``` + +The `host` parameter must contain a scheme, host, port (for example, `https://otherhost:9200`), and optional path (for example, `https://otherhost:9200/proxy`). +The `username` and `password` parameters are optional, and when they are present the reindex API will connect to the remote {{es}} node using basic auth. +Be sure to use `https` when using basic auth or the password will be sent in plain text. There are a range of settings available to configure the behaviour of the `https` connection. + +When using {{ecloud}}, it is also possible to authenticate against the remote cluster through the use of a valid API key: + +```console +POST _reindex +{ + "source": { + "remote": { + "host": "http://otherhost:9200", + "headers": { + "Authorization": "ApiKey API_KEY_VALUE" + } + }, + "index": "my-index-000001", + "query": { + "match": { + "test": "data" + } + } + }, + "dest": { + "index": "my-new-index-000001" + } +} +``` + +Remote hosts have to be explicitly allowed in `elasticsearch.yml` using the `reindex.remote.whitelist` property. +It can be set to a comma delimited list of allowed remote `host` and `port` combinations. +Scheme is ignored, only the host and port are used. For example: + +```yaml +reindex.remote.whitelist: [otherhost:9200, another:9200, 127.0.10.*:9200, localhost:*"] +``` + +The list of allowed hosts must be configured on any nodes that will coordinate the reindex. + +This feature should work with remote clusters of any version of {{es}} you are likely to find. This should allow you to upgrade from any version of {{es}} to the current version by reindexing from a cluster of the old version. + +::::{warning} +{{es}} does not support forward compatibility across major versions. For example, you cannot reindex from a 7.x cluster into a 6.x cluster. +:::: + +To enable queries sent to older versions of {{es}} the `query` parameter is sent directly to the remote host without validation or modification. + +::::{note} +Reindexing from remote clusters does not support manual or automatic slicing. +:::: + +Reindexing from a remote server uses an on-heap buffer that defaults to a maximum size of 100mb. +If the remote index includes very large documents you'll need to use a smaller batch size. +The example below sets the batch size to `10` which is very, very small. + +```console +POST _reindex +{ + "source": { + "remote": { + "host": "http://otherhost:9200", + ... + }, + "index": "source", + "size": 10, + "query": { + "match": { + "test": "data" + } + } + }, + "dest": { + "index": "dest" + } +} +``` + +It is also possible to set the socket read timeout on the remote connection with the `socket_timeout` field and the connection timeout with the `connect_timeout` field. +Both default to 30 seconds. +This example sets the socket read timeout to one minute and the connection timeout to 10 seconds: + +```console +POST _reindex +{ + "source": { + "remote": { + "host": "http://otherhost:9200", + ..., + "socket_timeout": "1m", + "connect_timeout": "10s" + }, + "index": "source", + "query": { + "match": { + "test": "data" + } + } + }, + "dest": { + "index": "dest" + } +} +``` + +## Configuring SSL parameters [reindex-ssl] + +Reindex from remote supports configurable SSL settings. +These must be specified in the `elasticsearch.yml` file, with the exception of the secure settings, which you add in the {{es}} keystore. +It is not possible to configure SSL in the body of the reindex API request. +Refer to [Reindex settings](/reference/elasticsearch/configuration-reference/index-management-settings.md#reindex-settings). diff --git a/docs/reference/elasticsearch/toc.yml b/docs/reference/elasticsearch/toc.yml index 625ec7069a024..4714057c6d193 100644 --- a/docs/reference/elasticsearch/toc.yml +++ b/docs/reference/elasticsearch/toc.yml @@ -95,6 +95,7 @@ toc: - file: rest-apis/reciprocal-rank-fusion.md - file: rest-apis/retrievers.md - file: rest-apis/reindex-data-stream.md + - file: rest-apis/reindex-indices.md - file: rest-apis/create-index-from-source.md - file: rest-apis/shard-request-cache.md - file: rest-apis/search-suggesters.md