|
| 1 | +/* |
| 2 | + * Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one |
| 3 | + * or more contributor license agreements. Licensed under the "Elastic License |
| 4 | + * 2.0", the "GNU Affero General Public License v3.0 only", and the "Server Side |
| 5 | + * Public License v 1"; you may not use this file except in compliance with, at |
| 6 | + * your election, the "Elastic License 2.0", the "GNU Affero General Public |
| 7 | + * License v3.0 only", or the "Server Side Public License, v 1". |
| 8 | + */ |
| 9 | + |
| 10 | +/** |
| 11 | + * The features infrastructure in Elasticsearch is responsible for two things: |
| 12 | + * <ol> |
| 13 | + * <li> |
| 14 | + * Determining when all nodes in a cluster have been upgraded to support some new functionality. |
| 15 | + * This is used to only utilise new behavior when all nodes in the cluster support it. |
| 16 | + * </li> |
| 17 | + * <li> |
| 18 | + * Ensuring nodes only join a cluster if they support all features already present on that cluster. |
| 19 | + * This is to ensure that once a cluster supports a feature, it then never drops support. |
| 20 | + * Conversely, when a feature is defined, it can then never be removed (but see Assumed features below). |
| 21 | + * </li> |
| 22 | + * </ol> |
| 23 | + * |
| 24 | + * <h2>Functionality</h2> |
| 25 | + * This functionality starts with {@link org.elasticsearch.features.NodeFeature}. This is a single id representing |
| 26 | + * new or a change in functionality - exactly what functionality that feature represents is up to the developer. These are expected |
| 27 | + * to be {@code public static final} variables on a relevant class. Each area of code then exposes their features |
| 28 | + * through an implementation of {@link org.elasticsearch.features.FeatureSpecification#getFeatures}, registered as an SPI implementation. |
| 29 | + * <p> |
| 30 | + * All the features exposed by a node are included in the {@link org.elasticsearch.cluster.coordination.JoinTask.NodeJoinTask} information |
| 31 | + * processed by {@link org.elasticsearch.cluster.coordination.NodeJoinExecutor}, when a node attempts to join a cluster. This checks |
| 32 | + * the joining node has all the features already present on the cluster, and then records the set of features against that node |
| 33 | + * in cluster state (in the {@link org.elasticsearch.cluster.ClusterFeatures} object). |
| 34 | + * The calculated effective cluster features are not persisted, only the per-node feature set. |
| 35 | + * <p> |
| 36 | + * Informally, the features supported by a particular node are 'node features'; when all nodes in a cluster support a particular |
| 37 | + * feature, that is then a 'cluster feature'. |
| 38 | + * <p> |
| 39 | + * Node features can then be checked by code to determine if all nodes in the cluster support that particular feature. |
| 40 | + * This is done using {@link org.elasticsearch.features.FeatureService#clusterHasFeature}. This is a fast operation - the first |
| 41 | + * time this method is called on a particular cluster state, the cluster features for a cluster are calculated from all the |
| 42 | + * node feature information, and cached in the {@link org.elasticsearch.cluster.ClusterFeatures} object. |
| 43 | + * Henceforth, all cluster feature checks are fast hash set lookups, at least until the nodes or master changes. |
| 44 | + * |
| 45 | + * <h2>Features test infrastructure</h2> |
| 46 | + * Features can be specified as conditions in YAML tests, as well as checks and conditions in code-defined rolling upgrade tests |
| 47 | + * (see the Elasticsearch development documentation for more information). |
| 48 | + * These checks are performed by the {@code TestFeatureService} interface, and its standard implementation {@code ESRestTestFeatureService}. |
| 49 | + * |
| 50 | + * <h3>Test features</h3> |
| 51 | + * Sometimes, you want to define a feature for nodes, but the only checks you need to do are as part of a test. In this case, |
| 52 | + * the feature doesn't need to be included in the production feature set, it only needs to be present for automated tests. |
| 53 | + * So alongside {@link org.elasticsearch.features.FeatureSpecification#getFeatures}, there is |
| 54 | + * {@link org.elasticsearch.features.FeatureSpecification#getTestFeatures}. This can be used to exposed node features, |
| 55 | + * but only for automated tests. It is ignored in production uses. This is determined by the {@link org.elasticsearch.features.FeatureData} |
| 56 | + * class, which uses a system property (set by the test infrastructure) to decide whether to include test features or not, |
| 57 | + * when gathering all the registered {@code FeatureSpecification} instances. |
| 58 | + * <p> |
| 59 | + * Test features can be removed at-will (with appropriate backports), |
| 60 | + * as there is no long-term upgrade guarantees required for clusters in automated tests. |
| 61 | + * |
| 62 | + * <h3>Synthetic version features</h3> |
| 63 | + * Cluster functionality checks performed on code built from the {@code main} branch can only use features to check functionality, |
| 64 | + * but we also have branch releases with a longer release cadence. Sometimes tests need to be conditional on older versions |
| 65 | + * (where there isn't a feature already defined in the right place), determined some point after the release has been finalized. |
| 66 | + * This is where synthetic version features comes in. These can be used in tests where it is sensible to use |
| 67 | + * a release version number (eg 8.12.3). The presence of these features is determined solely by the minimum |
| 68 | + * node version present in the test cluster; no actual cluster features are defined nor checked. |
| 69 | + * This is done by {@code ESRestTestFeatureService}, matching on features of the form {@code gte_v8.12.3}. |
| 70 | + * For more information on their use, see the Elasticsearch developer documentation. |
| 71 | + * |
| 72 | + * <h2>Assumed features</h2> |
| 73 | + * Once a feature is defined on a cluster, it can never be removed - this is to ensure that functionality that is available |
| 74 | + * on a cluster then never stops being available. However, this can lead to the list of features in cluster state growing ever larger. |
| 75 | + * It is possible to remove defined cluster features, but only on a compatibility boundary (normally a new major release). |
| 76 | + * To see how this can be so, it may be helpful to start with the compatibility guarantees we provide: |
| 77 | + * <ul> |
| 78 | + * <li> |
| 79 | + * The first version of a new major (eg v9.0) can only form a cluster with the highest minor |
| 80 | + * of the previous major (eg v8.18). |
| 81 | + * </li> |
| 82 | + * <li> |
| 83 | + * This means that any cluster feature that was added <em>before</em> 8.18.0 was cut will <em>always</em> be present |
| 84 | + * on any cluster that has at least one v9 node in it (as we don't support mixed-version clusters of more than two versions) |
| 85 | + * </li> |
| 86 | + * <li> |
| 87 | + * This means that the code checks for those features can be completely removed from the code in v9, |
| 88 | + * and the new behavior used all the time. |
| 89 | + * </li> |
| 90 | + * <li> |
| 91 | + * This means that the node features themselves are not required, as they are never checked in the v9 codebase. |
| 92 | + * </li> |
| 93 | + * </ul> |
| 94 | + * So, starting up a fresh v9 cluster, it does not need to have any knowledge of features added before 8.18, as the cluster |
| 95 | + * will always have the new functionality. |
| 96 | + * <p> |
| 97 | + * So then how do we do a rolling upgrade from 8.18 to 9.0, if features have been removed? Normally, that would prevent a 9.0 |
| 98 | + * node from joining an 8.18 cluster, as it will not have all the required features published. However, we can make use |
| 99 | + * of the major version difference to allow the rolling upgrade to proceed. |
| 100 | + * <p> |
| 101 | + * This is where the {@link org.elasticsearch.features.NodeFeature#assumedAfterNextCompatibilityBoundary()} field comes in. On 8.18, |
| 102 | + * we can mark all the features that will be removed in 9.0 as assumed. This means that when the features infrastructure sees a |
| 103 | + * 9.x node, it will deem that node to have all the assumed features, even if the 9.0 node doesn't actually have those features |
| 104 | + * in its published set. It will allow 9.0 nodes to join the cluster missing assumed features, |
| 105 | + * and it will say the cluster supports a particular assumed feature even if it is missing from any 9.0 nodes in the cluster. |
| 106 | + * <p> |
| 107 | + * Essentially, 8.18 nodes (or any other version that can form a cluster with 8.x or 9.x nodes) can mediate |
| 108 | + * between the 8.x and 9.x feature sets, using {@code assumedAfterNextCompatibilityBoundary} |
| 109 | + * to mark features that have been removed from 9.x, and know that 9.x nodes still meet the requirements for those features. |
| 110 | + * These assumed features need to be defined before 8.18 and 9.0 are released. |
| 111 | + * <p> |
| 112 | + * To go into more detail what happens during a rolling upgrade: |
| 113 | + * <ol> |
| 114 | + * <li>Start with a homogenous 8.18 cluster, with an 8.18 cluster feature set (including assumed features)</li> |
| 115 | + * <li> |
| 116 | + * The first 9.0 node joins the cluster. Even though it is missing the features marked as assumed in 8.18, |
| 117 | + * the 8.18 master lets the 9.0 node join because all the missing features are marked as assumed, |
| 118 | + * and it is of the next major version. |
| 119 | + * </li> |
| 120 | + * <li> |
| 121 | + * At this point, any feature checks that happen on 8.18 nodes for assumed features pass, despite the 9.0 node |
| 122 | + * not publishing those features, as the 9.0 node is assumed to meet the requirements for that feature. |
| 123 | + * 9.0 nodes do not have those checks at all, and the corresponding code running on 9.0 uses the new behaviour without checking. |
| 124 | + * </li> |
| 125 | + * <li>More 8.18 nodes get swapped for 9.0 nodes</li> |
| 126 | + * <li> |
| 127 | + * At some point, the master will change from an 8.18 node to a 9.0 node. The 9.0 node does not have the assumed |
| 128 | + * features at all, so the new cluster feature set as calculated by the 9.0 master will only contain the features |
| 129 | + * that 9.0 knows about (the calculated feature set is not persisted anywhere). |
| 130 | + * The cluster has effectively dropped all the 8.18 features assumed in 9.0, whilst maintaining all behaviour. |
| 131 | + * The upgrade carries on. |
| 132 | + * </li> |
| 133 | + * <li> |
| 134 | + * If an 8.18 node were to quit and re-join the cluster still as 8.18 at this point |
| 135 | + * (and there are other 8.18 nodes not yet upgraded), it will be able to join the cluster despite the master being 9.0. |
| 136 | + * The 8.18 node publishes all the assumed features that 9.0 does not have - but that doesn't matter, because nodes can join |
| 137 | + * with more features than are present in the cluster as a whole. The additional features are not added |
| 138 | + * to the cluster feature set because not all the nodes in the cluster have those features |
| 139 | + * (as there is at least one 9.0 node in the cluster - itself). |
| 140 | +* </li> |
| 141 | + * <li> |
| 142 | + * At some point, the last 8.18 node leaves the cluster, and the cluster is a homogenous 9.0 cluster |
| 143 | + * with only the cluster features known about by 9.0. |
| 144 | + * </li> |
| 145 | + * </ol> |
| 146 | + * |
| 147 | + * For any dynamic releases that occur from main, the cadence is much quicker - once a feature is present in a cluster, |
| 148 | + * you then only need one completed release to mark a feature as assumed, and a subsequent release to remove it from the codebase |
| 149 | + * and elide the corresponding check. |
| 150 | + */ |
| 151 | +package org.elasticsearch.features; |
0 commit comments