You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The `XrdAdaptor` package is the CMSSW implementation of CMS' AAA infrastructure. The main features on top of the stock XRootD client library are
6
+
* Recovery from some errors via re-tries
7
+
* Use of multiple XRootD sources (described further [here](doc/multisource_algorithm_design.md))
8
+
9
+
## Short description of components
10
+
11
+
### `ClientRequest`
12
+
13
+
The `ClientRequest` implements `XrdCl::ResponseHandler`, and represents a single read request(?).
14
+
15
+
### `QualityMetric`
16
+
17
+
The `QualityMetric` implements a measurement of the server's recent performance for the client. It's based on the time taken to service requests (under the assumption that, since requests are split into potentially smaller-chunks by the request manager, the time to service them should be roughly the same) with a small amount of exponential weighting to prefer data points from the most recent request.
18
+
19
+
Since the metric is based on time to serve requests, a lower value is better.
20
+
21
+
Potential improvements include:
22
+
* Actually weighting the scores based on the size (or complexity) of the reads. The assumption that latency dominates transfer time may be OK in some cases -- but we've seen for large files (e.g., heavy ion), that bandwidth really is relevant and that large vector reads can cause much more server stress due to read amplification for erasure-coded systems than a simple read.
23
+
* Switching from the hand-calculated exponential weighting to a more typical exponentially weighted moving average setup.
24
+
25
+
26
+
### `RequestManager`
27
+
28
+
The `RequestManager` containes the actual implementation of the retries and the multi-source algorithm. There is one `RequestManager` object for one PFN, and it contains one or more `Source` objects.
29
+
30
+
#### `RequestManager::OpenHandler`
31
+
32
+
The `OpenHandler` implements XRootD's `XrdCl::ResponseHandler` in an asynchronous way. An instance is created in `RequestManager::initialize()`, and used when additional Sources are opened, either as part of the multi-source comparisons (`RequestManager::checkSourcesImpl()`) or read error recovery (`RequestManager::requestFailure()`).
33
+
34
+
Most of the internal operations for the xrootd client are asynchronous while ROOT expects a synchronous interface from the adaptor. The difference between the two here is the asynchronous one is used for "background probing" to find additional sources.
35
+
36
+
### `Source`
37
+
38
+
The `Source` represents a connection to one storage server. There can be more than one `Source` for one PFN.
39
+
40
+
### `SyncHostResponseHandler`
41
+
42
+
The `SyncHostResponseHandler` implements XRootD's `XrdCl::ResponseHandler` in a synchronous way(?). It is used in `RequestManager::initialize()` for the initial file open.
43
+
44
+
### `XrdFile`
45
+
46
+
The `XrdFile` implements `edm::storage::Storage` (see [Utilities/StorageFactory/README.md](../../Utilities/StorageFactory/README.md). In CMS' terminology it represents one Physical File Name (PFN), and acts as a glue between the `edm::storage::Storage` API and `RequestManager`.
47
+
48
+
### `XrdStatistics`
49
+
50
+
The `XrdStatistics` provides per-"site" counters (bytes read, total time), providing the `XrdStatisticsService` with a viewpoint of how individual sites are performing for a given client. The intent is to provide more visibility into the performance encountered by grouping I/O operations into a per-site basis, under the theory that performance within a site is similar but could differ between two different sites.
51
+
52
+
### `XrdStatisticsService`
53
+
54
+
The `XrdStatisticsService` is a Service to report XrootD-related statistics centrally. It is one of the default Services that are enabled in `cmsRun`.
55
+
56
+
### `XrdStorageMaker`
57
+
58
+
The `XrdStorageMaker` is a plugin in the `StorageMaker` hierarchy. See [Utilities/StorageFactory/README.md](../../Utilities/StorageFactory/README.md) for more information. Among other things it creates `XrdFile` objects.
Copy file name to clipboardExpand all lines: Utilities/XrdAdaptor/doc/multisource_algorithm_design.md
+42-26Lines changed: 42 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,26 +1,31 @@
1
-
Introduction
1
+
# Multi-source algorithm design
2
+
3
+
## Introduction
4
+
2
5
The existing AAA infrastructure has a relatively high penalty for poor redirection decisions. A client which is redirected to a "poor" server in terms of network locality (in terms of high latency or low bandwidth) will continue to use that server as long as it doesn't fail outright, even if there is a much better source available.
3
6
4
7
Improving network locality argues for improving the redirector's redirection logic. This is certainly needed, however network locality is not the only concern. A site's storage can degrade in the middle of a file transfer (due to a device loss in the RAID, or sudden spurt of new transfers), making a "reasonable choice" at redirection time perform poorly overall.
5
8
6
9
To reduce the penalty for poor redirection decisions, we aim to upgrade the CMSSW XrdAdaptor to be multi-source capable. By actively using multiple sources and probing for additional ones, the client will be able to discover the fastest sources, even if the designation of "fastest" changes through the job's lifetime.
7
10
8
-
Design Goals
11
+
## Design Goals
12
+
9
13
Design goals of the multi-source Xrootd client:
10
-
(0) Determine a metric for quality of the source server.
11
-
(1) Actively balance transfers over multiple links in order to determine several high-quality sources of the file.
12
-
(2) Recover from transient IO errors at a single source.
13
-
(3) Probe for additional possible sources during the file transfer.
14
-
(4) Minimize the impact on the source site versus a single-source client. Understand both average case and the worst case scenarios.
15
-
- In particular, actively utilizing too many sources can cause TCP windows to stay small and OS read-ahead to be inefficient.
16
-
- Any speculative probes for additional sources should be a small percentage of the total traffic.
17
-
- Any retry mechanisms must have reasonable delays prior to failure.
18
-
(5) Have the number of requests per source be proportional to source quality.
19
-
- Servers or sites experiencing a "soft failure" causing a degrade in quality will receive the least amount of traffic.
20
-
21
-
Implementation
22
-
23
-
Quality metric
14
+
0. Determine a metric for quality of the source server.
15
+
1. Actively balance transfers over multiple links in order to determine several high-quality sources of the file.
16
+
2. Recover from transient IO errors at a single source.
17
+
3. Probe for additional possible sources during the file transfer.
18
+
4. Minimize the impact on the source site versus a single-source client. Understand both average case and the worst case scenarios.
19
+
- In particular, actively utilizing too many sources can cause TCP windows to stay small and OS read-ahead to be inefficient.
20
+
- Any speculative probes for additional sources should be a small percentage of the total traffic.
21
+
- Any retry mechanisms must have reasonable delays prior to failure.
22
+
5. Have the number of requests per source be proportional to source quality.
23
+
- Servers or sites experiencing a "soft failure" causing a degrade in quality will receive the least amount of traffic.
24
+
25
+
## Implementation
26
+
27
+
### Quality metric
28
+
24
29
A source's quality is to be defined, per-file, to be the average request response time over the last 5 time intervals. Each time interval is one minute long; time intervals with no data points are discarded.
25
30
26
31
If there is no previously recorded data for a given source, it is assumed to have an average of 260ms (assumes a 256KB request, 1MB/s server speed, and 10ms of latency) in one time interval. When a new file is opened on a source, the prior average for that source is used for the first time interval.
@@ -29,7 +34,8 @@ Notes:
29
34
- The request splitting algorithm outlined below will split all client requests into a series of requests at most 256KB (similar to what the Xrootd client does internally). Since the request has a maximum size, it makes looking at the unweighted time per request more reasonable.
30
35
- It may seem strange to the reader to not differentiate between bandwidth and latency, or somehow factoring in request size to the quality metric. I believe it is acceptable to ignore this as the distribution of small and large requests will remain approximately constant throughout the job lifetime.
31
36
32
-
Source selection algorithm
37
+
### Source selection algorithm
38
+
33
39
The client will maintain a set of up to two "active servers" and an arbitrary number of inactive servers. When the client opens a file, the initial data server it receives from the redirector becomes the first active server.
34
40
35
41
For a 5-second grace period, this initial server remains the only active one. (The use of "5 seconds" as the grace period is motivated by the redirector's implementation; any internal file location requests triggered by the initial file open should be finished after 5 seconds.) After the grace period, the client enters source search mode.
@@ -46,50 +52,60 @@ If a source encounters an error (either a file IO error or a disconnect), then i
46
52
47
53
If an inactive source's quality metric is better than an active source's metric, the two are swapped. This swap is not performed if the inactive source itself has been removed from the active set in the last two minutes. The "Active probe algorithm" section below describes one mechanism for updating an inactive source's quality metric.
48
54
49
-
Request splitting algorithm
55
+
### Request splitting algorithm
56
+
50
57
When a client performs a new request, the request is balanced amongst the active servers using the following algorithm:
51
58
52
-
1) Source A removes 256KB from the beginning of the request and places it at the end of its request queue.
53
-
2) Source B removes 256KB from the end 256KB of the request and places it at the beginning of its request queue.
54
-
3) Steps (1) and (2) are repeated until the original request has been completely split into the two queues.
59
+
1. Source A removes 256KB from the beginning of the request and places it at the end of its request queue.
60
+
2. Source B removes 256KB from the end 256KB of the request and places it at the beginning of its request queue.
61
+
3. Steps (1) and (2) are repeated until the original request has been completely split into the two queues.
55
62
56
63
After each client request has been fully split, the labels "A" and "B" are swapped. This allows a series of small client requests to be load-balanced between the two sources.
57
64
58
65
Examples:
59
66
60
67
For example, suppose a client requests to read 1024KB starting at offset 0. The client request and queues look like:
61
-
68
+
```
62
69
A: []
63
70
B: []
64
71
client request: [1024KB @ 0]
72
+
```
65
73
66
74
After steps 1 and 2, the sources have the following queues:
67
-
75
+
```
68
76
A: [256KB @ 0]
69
77
B: [256KB @ 768KB]
70
78
client request: [512KB @ 256KB]
79
+
```
71
80
72
81
After steps 1 and 2 are applied again, we have the following queues:
73
-
82
+
```
74
83
A: [256KB @ 0, 256KB @ 256KB]
75
84
B: [256KB @ 512KB, 256KB @ 768KB]
76
85
client request: []
86
+
```
77
87
78
88
Consider the following example for a vectored read request:
- The algorithm halts because, after each iteration, the client request is reduced by 512KB and the algorithm terminates once the client request is empty.
@@ -98,7 +114,7 @@ Notes:
98
114
- If the client request, when performed in order, results in the server performing non-overlapping reads with monotonically-increasing offsets, then the split requests will also have this property.
99
115
- TODO: We should have the initial assignment done with respect to the quality metric. When the two servers are heavily out-of-balance, one may steal a lot of work from the other. When work is stolen, it is performed via backward reads, which won't cause ideal filesystem performance.
100
116
101
-
Load-balance algorithm
117
+
### Load-balance algorithm
102
118
103
119
When a client request is made, it is split into two sets of requests as described previously if there are two active servers. Each source performs the IO operations in its queue in order to completion.
104
120
@@ -113,7 +129,7 @@ Notes:
113
129
- TODO: calculate the worst case over-read for very slow sources if they get their work stolen all the time.
114
130
- The over-read is probably not as bad as having to take quite some time to give up on the source. We can probably introduce a penalty for sources that are the "victim" of a successful speculative read.
115
131
116
-
Active probe algorithm
132
+
### Active probe algorithm
117
133
118
134
If the client is not in search mode, for every 1024KB of data read, generate a random number such that there will be a .25% chance the request will also be issued as a speculative read to an inactive source. If it has been at least 2 minutes since the last file-open request to the redirector, there is also a .25% chance the request will trigger a file-open and a speculative read in the redirector.
0 commit comments