You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/service-fabric/service-fabric-cluster-region-move.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,20 +36,20 @@ Before engaging in any regional migration, we recommend establishing a testbed a
36
36
- For all services:
37
37
* <p>Ensure that any communication stages between clients and the services are configured similarly to the source cluster. For example, this validation may include ensuring that intermediaries like Event Hubs, Network Load Balancers, App Gateways, or API Management are set up with the rules necessary to allow traffic to flow to the cluster.</p>
38
38
39
-
3. Redirect traffic from the old region to the new region. We recommend using [Azure Traffic Manager](../traffic-manager/traffic-manager-overview.md) for migration as it offers a range of [routing methods](../traffic-manager/traffic-manager-routing-methods.md). How exactly you update your traffic routing rules will depend on whether you desire to keep the existing region or deprecate it, and will also depend on how traffic flows within your application. You may need to investigate whether private/public IPs or DNS names can be moved between different Azure resources in different regions. Service Fabric is not aware of this part of your system, so please investigate and if necessary involve the Azure teams involved in your traffic flow, particularly if it is more complex or if your workload is latency-critical. Documents such as [Configure Custom Domain](../api-management/configure-custom-domain.md), [Public IP Addresses](../virtual-network/ip-services/public-ip-addresses.md), and [DNS Zones and Records](../dns/dns-zones-records.md) may be useful, and are examples of the information you will need depending on your traffic flows and protocols. Here are two example scenarios demonstrating how one could approach updating traffic routing:
39
+
3. Redirect traffic from the old region to the new region. We recommend using [Azure Traffic Manager](../traffic-manager/traffic-manager-overview.md) for migration as it offers a range of [routing methods](../traffic-manager/traffic-manager-routing-methods.md). How exactly you update your traffic routing rules will depend on whether you desire to keep the existing region or deprecate it, and will also depend on how traffic flows within your application. You may need to investigate whether private/public IPs or DNS names can be moved between different Azure resources in different regions. Service Fabric is not aware of this part of your system, so please investigate and if necessary involve the Azure teams involved in your traffic flow, particularly if it is more complex or if your workload is latency-critical. Documents such as [Configure Custom Domain](../api-management/configure-custom-domain.md), [Public IP Addresses](../virtual-network/ip-services/public-ip-addresses.md), and [DNS Zones and Records](../dns/dns-zones-records.md) may be useful to review. Here are two example scenarios demonstrating how one could approach updating traffic routing:
40
40
* If you do not plan to keep the existing source region and you have a DNS/CNAME associated with the public IP of a Network Load Balancer that is delivering calls to your original source cluster. Update the DNS/CNAME to be associated with a new public IP of the new network load balancer in the new region. Completing that transfer would cause clients using the existing cluster to switch to using the new cluster.
41
41
42
42
* If you do plan to keep the existing source region and you have a DNS/CNAME associated with the public IP of a Network Load Balancer that was delivering calls to your original source cluster. Set up an instance of Azure Traffic Manager and then associate the DNS name with that Azure Traffic Manager Instance. The Azure Traffic Manager could be configured to then route to the individual Network Load Balancers within each region.
43
43
44
-
4. If you do plan to keep both regions, then you will usually have some sort of “back sync”, where the source of truth is kept in some remote store, such as SQL, CosmosDB, or Blob or File Storage, which is then synced between the regions. If this applies to your workload, then it is recommended to confirm that data is flowing between the regions as expected.
44
+
4. If you do plan to keep both regions, then you will usually have some sort of “back sync”, where the source of truth is kept in some remote store, such as SQL, Cosmos DB, or Blob or File Storage, which is then synced between the regions. If this applies to your workload, then it is recommended to confirm that data is flowing between the regions as expected.
45
45
46
46
## Final Validation
47
47
1. As a final validation, verify that traffic is flowing as expected and that the services in the new region (and potentially the old region) are operating as expected.
48
48
49
49
2. If you do not plan to keep the original source region, then at this point the resources in that region can be removed. We recommend waiting for some time before deleting resources, in case some issue is discovered that requires a rollback to the original source region.
50
50
51
51
## Next Steps
52
-
Now that you've moved your cluster and applications to a new region you should validate backups are setup to protect any required data.
52
+
Now that you've moved your cluster and applications to a new region you should validate backups are set up to protect any required data.
53
53
54
54
> [!div class="nextstepaction"]
55
55
> [Set up backups after migration](service-fabric-backuprestoreservice-quickstart-azurecluster.md)
Copy file name to clipboardExpand all lines: articles/service-fabric/service-fabric-cluster-resource-manager-advanced-placement-rules-placement-policies.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -100,7 +100,7 @@ Replicas are _normally_ distributed across fault and upgrade domains when the cl
100
100
> For more information on constraints and constraint priorities generally, check out [this topic](service-fabric-cluster-resource-manager-management-integration.md#constraint-priorities).
101
101
>
102
102
103
-
If you've ever seen a health message such as "`The Load Balancer has detected a Constraint Violation for this Replica:fabric:/<some service name> Secondary Partition <some partition ID> is violating the Constraint: FaultDomain`", then you've hit this condition or something like it. Usually only one or two replicas are packed together temporarily. So long as there are fewer than a quorum of replicas in a given domain, you're safe. Packing is rare, but it can happen, and usually these situations are transient since the nodes come back. If the nodes do stay down and the Cluster Resource Manager needs to build replacements, usually there are other nodes available in the ideal fault domains.
103
+
If you've ever seen a health message such as "`The Load Balancer has detected a Constraint Violation for this Replica:fabric:/<some service name> Secondary Partition <some partition ID> is violating the Constraint: FaultDomain`", then you've hit this condition or something like it. Usually only one or two replicas are packed together temporarily. So long as there is fewer than a quorum of replicas in a given domain, you're safe. Packing is rare, but it can happen, and usually these situations are transient since the nodes come back. If the nodes do stay down and the Cluster Resource Manager needs to build replacements, usually there are other nodes available in the ideal fault domains.
104
104
105
105
Some workloads would prefer always having the target number of replicas, even if they are packed into fewer domains. These workloads are betting against total simultaneous permanent domain failures and can usually recover local state. Other workloads would rather take the downtime earlier than risk correctness or loss of data. Most production workloads run with more than three replicas, more than three fault domains, and many valid nodes per fault domain. Because of this, the default behavior allows domain packing by default. The default behavior allows normal balancing and failover to handle these extreme cases, even if that means temporary domain packing.
Copy file name to clipboardExpand all lines: articles/service-fabric/service-fabric-cluster-resource-manager-balancing.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,7 +19,7 @@ There are three different categories of work that the Cluster Resource Manager p
19
19
3. Balancing – this stage checks to see if rebalancing is necessary based on the configured desired level of balance for different metrics. If so it attempts to find an arrangement in the cluster that is more balanced.
20
20
21
21
## Configuring Cluster Resource Manager Timers
22
-
The first set of controls around balancing are a set of timers. These timers govern how often the Cluster Resource Manager examines the cluster and takes corrective actions.
22
+
The first set of controls around balancing is a set of timers. These timers govern how often the Cluster Resource Manager examines the cluster and takes corrective actions.
23
23
24
24
Each of these different types of corrections the Cluster Resource Manager can make is controlled by a different timer that governs its frequency. When each timer fires, the task is scheduled. By default the Resource Manager:
25
25
@@ -165,7 +165,7 @@ via ClusterConfig.json for Standalone deployments or Template.json for Azure hos
165
165
]
166
166
```
167
167
168
-
Balancing and activity thresholds are both tied to a specific metric - balancing is triggered only if both the Balancing Threshold and Activity Threshold is exceeded for the same metric.
168
+
Balancing and activity thresholds are both tied to a specific metric - balancing is triggered only if both the Balancing Threshold and Activity Threshold are exceeded for the same metric.
169
169
170
170
> [!NOTE]
171
171
> When not specified, the Balancing Threshold for a metric is 1, and the Activity Threshold is 0. This means that the Cluster Resource Manager will try to keep that metric perfectly balanced for any given load. If you are using custom metrics it is recommended that you explicitly define your own balancing and activity thresholds for your metrics.
Copy file name to clipboardExpand all lines: articles/service-fabric/service-fabric-cluster-resource-manager-cluster-description.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -74,7 +74,7 @@ The following diagram shows three upgrade domains striped across three fault dom
74
74
75
75
There are pros and cons to having large numbers of upgrade domains. More upgrade domains mean each step of the upgrade is more granular and affects a smaller number of nodes or services. Fewer services have to move at a time, introducing less churn into the system. This tends to improve reliability, because less of the service is affected by any issue introduced during the upgrade. More upgrade domains also mean that you need less available buffer on other nodes to handle the impact of the upgrade.
76
76
77
-
For example, if you have five upgrade domains, the nodes in each are handling roughly 20 percent of your traffic. If you need to take down that upgrade domain for an upgrade, that load usually needs to go somewhere. Because you have four remaining upgrade domains, each must have room for about 25 percent of the total traffic. More upgrade domains mean that you need less buffer on the nodes in the cluster.
77
+
For example, if you have five upgrade domains, the nodes in each are handling roughly 20 percent of your traffic. If you need to take down that upgrade domain for an upgrade, the load usually needs to go somewhere. Because you have four remaining upgrade domains, each must have room for about 25 percent of the total traffic. More upgrade domains mean that you need less buffer on the nodes in the cluster.
78
78
79
79
Consider if you had 10 upgrade domains instead. In that case, each upgrade domain would be handling only about 10 percent of the total traffic. When an upgrade steps through the cluster, each domain would need to have room for only about 11 percent of the total traffic. More upgrade domains generally allow you to run your nodes at higher utilization, because you need less reserved capacity. The same is true for fault domains.
Copy file name to clipboardExpand all lines: articles/service-fabric/service-fabric-cluster-resource-manager-defragmentation-metrics.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,7 @@ If there are many services and state to move around, then it could take a long t
21
21
22
22
You can configure defragmentation metrics to have the Cluster Resource Manager to proactively try to condense the load of the services into fewer nodes. This helps ensure that there is almost always room for large services without reorganizing the cluster. Not having to reorganize the cluster allows creating large workloads quickly.
23
23
24
-
Most people don’t need defragmentation. Services are usually be small, so it’s not hard to find room for them in the cluster. When reorganization is possible, it goes quickly, again because most services are small and can be moved quickly and in parallel. However, if you have large services and need them created quickly then the defragmentation strategy is for you. We'll discuss the tradeoffs of using defragmentation next.
24
+
Most people don’t need defragmentation. Services are usually small, so it’s not hard to find room for them in the cluster. When reorganization is possible, it goes quickly, again because most services are small and can be moved quickly and in parallel. However, if you have large services and need them created quickly then the defragmentation strategy is for you. We'll discuss the tradeoffs of using defragmentation next.
25
25
26
26
## Defragmentation tradeoffs
27
27
Defragmentation can increase impactfulness of failures, since more services are running on nodes that fail. Defragmentation can also increase costs, since resources in the cluster must be held in reserve, waiting for the creation of large workloads.
@@ -44,7 +44,7 @@ So what are those other conceptual tradeoffs? Here’s a quick table of things t
44
44
| Enables lower data movement during creation |Failures can impact more services and cause more churn |
45
45
| Allows rich description of requirements and reclamation of space |More complex overall Resource Management configuration |
46
46
47
-
You can mix defragmented and normal metrics in the same cluster. The Cluster Resource Manager tries to consolidate the defragmentation metrics as much as possible while spreading out the others. The results of mixing defragmentation and balancing strategies depends on several factors, including:
47
+
You can mix defragmented and normal metrics in the same cluster. The Cluster Resource Manager tries to consolidate the defragmentation metrics as much as possible while spreading out the others. The results of mixing defragmentation and balancing strategies depend on several factors, including:
48
48
- the number of balancing metrics vs. the number of defragmentation metrics
Copy file name to clipboardExpand all lines: articles/service-fabric/service-fabric-cluster-resource-manager-introduction.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@ Suddenly managing your environment is not so simple as managing a few machines d
18
18
19
19
Because your app is no longer a series of monoliths spread across several tiers, you now have many more combinations to deal with. Who decides what types of workloads can run on which hardware, or how many? Which workloads work well on the same hardware, and which conflict? When a machine goes down how do you know what was running there on that machine? Who is in charge of making sure that workload starts running again? Do you wait for the (virtual?) machine to come back or do your workloads automatically fail over to other machines and keep running? Is human intervention required? What about upgrades in this environment?
20
20
21
-
As developers and operators dealing in this environment, we’re going to want help managing this complexity. A hiring binge and trying to hide the complexity with people is probably not the right answer, so what do we do?
21
+
As developers and operators dealing in this environment, we’re going to want help with managing this complexity. A hiring binge and trying to hide the complexity with people is probably not the right answer, so what do we do?
22
22
23
23
## Introducing orchestrators
24
24
An “Orchestrator” is the general term for a piece of software that helps administrators manage these types of environments. Orchestrators are the components that take in requests like “I would like five copies of this service running in my environment." They try to make the environment match the desired state, no matter what happens.
Copy file name to clipboardExpand all lines: articles/service-fabric/service-fabric-cluster-resource-manager-management-integration.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -68,7 +68,7 @@ Here's what this health message is telling us is:
68
68
5. The distribution policy for this service: "Distribution Policy -- Packing". This is governed by the `RequireDomainDistribution`[placement policy](service-fabric-cluster-resource-manager-advanced-placement-rules-placement-policies.md#requiring-replica-distribution-and-disallowing-packing). *Packing* indicates that in this case DomainDistribution was _not_ required, so we know that placement policy was not specified for this service.
69
69
6. When the report happened - 8/10/2015 7:13:02 PM
70
70
71
-
Information like this powers alerts that fire in production to let you know something has gone wrong and is also used to detect and halt bad upgrades. In this case, we’d want to see if we can figure out why the Resource Manager had to pack the replicas into the Upgrade Domain. Usually packing is transient because the nodes in the other Upgrade Domains were down, for example.
71
+
Information like this powers alerting. You can use alerts in production to let you know something has gone wrong. Alerting is also used to detect and halt bad upgrades. In this case, we’d want to see if we can figure out why the Resource Manager had to pack the replicas into the Upgrade Domain. Usually packing is transient because the nodes in the other Upgrade Domains were down, for example.
72
72
73
73
Let’s say the Cluster Resource Manager is trying to place some services, but there aren't any solutions that work. When services can't be placed, it is usually for one of the following reasons:
Copy file name to clipboardExpand all lines: articles/service-fabric/service-fabric-cluster-resource-manager-metrics.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -148,7 +148,7 @@ The whole point of defining metrics is to represent some load. *Load* is how muc
148
148
All of these strategies can be used within the same service over its lifetime.
149
149
150
150
## Default load
151
-
*Default load* is how much of the metric each service object (stateless instance or stateful replica) of this service consumes. The Cluster Resource Manager uses this number for the load of the service object until it receives other information, such as a dynamic load report. For simpler services, the default load is a static definition. The default load is never updated and is used for the lifetime of the service. Default loads works great for simple capacity planning scenarios where certain amounts of resources are dedicated to different workloads and do not change.
151
+
*Default load* is how much of the metric each service object (stateless instance or stateful replica) of this service consumes. The Cluster Resource Manager uses this number for the load of the service object until it receives other information, such as a dynamic load report. For simpler services, the default load is a static definition. The default load is never updated and is used for the lifetime of the service. Default loads work great for simple capacity planning scenarios where certain amounts of resources are dedicated to different workloads and do not change.
152
152
153
153
> [!NOTE]
154
154
> For more information on capacity management and defining capacities for the nodes in your cluster, please see [this article](service-fabric-cluster-resource-manager-cluster-description.md#capacity).
0 commit comments