Skip to content

Commit c80b7ef

Browse files
authored
Merge pull request #47524 from makortel/xrdAdaptorDocumentation
Small improvements to XrdAdaptor documentation
2 parents 69d052f + a2c1aa5 commit c80b7ef

File tree

4 files changed

+117
-28
lines changed

4 files changed

+117
-28
lines changed

Utilities/XrdAdaptor/README.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# XrdAdaptor
2+
3+
## Introduction
4+
5+
The `XrdAdaptor` package is the CMSSW implementation of CMS' AAA infrastructure. The main features on top of the stock XRootD client library are
6+
* Recovery from some errors via re-tries
7+
* Use of multiple XRootD sources (described further [here](doc/multisource_algorithm_design.md))
8+
9+
## Short description of components
10+
11+
### `ClientRequest`
12+
13+
The `ClientRequest` implements `XrdCl::ResponseHandler`, and represents a single read request(?).
14+
15+
### `QualityMetric`
16+
17+
The `QualityMetric` implements a measurement of the server's recent performance for the client. It's based on the time taken to service requests (under the assumption that, since requests are split into potentially smaller-chunks by the request manager, the time to service them should be roughly the same) with a small amount of exponential weighting to prefer data points from the most recent request.
18+
19+
Since the metric is based on time to serve requests, a lower value is better.
20+
21+
Potential improvements include:
22+
* Actually weighting the scores based on the size (or complexity) of the reads. The assumption that latency dominates transfer time may be OK in some cases -- but we've seen for large files (e.g., heavy ion), that bandwidth really is relevant and that large vector reads can cause much more server stress due to read amplification for erasure-coded systems than a simple read.
23+
* Switching from the hand-calculated exponential weighting to a more typical exponentially weighted moving average setup.
24+
25+
26+
### `RequestManager`
27+
28+
The `RequestManager` containes the actual implementation of the retries and the multi-source algorithm. There is one `RequestManager` object for one PFN, and it contains one or more `Source` objects.
29+
30+
#### `RequestManager::OpenHandler`
31+
32+
The `OpenHandler` implements XRootD's `XrdCl::ResponseHandler` in an asynchronous way. An instance is created in `RequestManager::initialize()`, and used when additional Sources are opened, either as part of the multi-source comparisons (`RequestManager::checkSourcesImpl()`) or read error recovery (`RequestManager::requestFailure()`).
33+
34+
Most of the internal operations for the xrootd client are asynchronous while ROOT expects a synchronous interface from the adaptor. The difference between the two here is the asynchronous one is used for "background probing" to find additional sources.
35+
36+
### `Source`
37+
38+
The `Source` represents a connection to one storage server. There can be more than one `Source` for one PFN.
39+
40+
### `SyncHostResponseHandler`
41+
42+
The `SyncHostResponseHandler` implements XRootD's `XrdCl::ResponseHandler` in a synchronous way(?). It is used in `RequestManager::initialize()` for the initial file open.
43+
44+
### `XrdFile`
45+
46+
The `XrdFile` implements `edm::storage::Storage` (see [Utilities/StorageFactory/README.md](../../Utilities/StorageFactory/README.md). In CMS' terminology it represents one Physical File Name (PFN), and acts as a glue between the `edm::storage::Storage` API and `RequestManager`.
47+
48+
### `XrdStatistics`
49+
50+
The `XrdStatistics` provides per-"site" counters (bytes read, total time), providing the `XrdStatisticsService` with a viewpoint of how individual sites are performing for a given client. The intent is to provide more visibility into the performance encountered by grouping I/O operations into a per-site basis, under the theory that performance within a site is similar but could differ between two different sites.
51+
52+
### `XrdStatisticsService`
53+
54+
The `XrdStatisticsService` is a Service to report XrootD-related statistics centrally. It is one of the default Services that are enabled in `cmsRun`.
55+
56+
### `XrdStorageMaker`
57+
58+
The `XrdStorageMaker` is a plugin in the `StorageMaker` hierarchy. See [Utilities/StorageFactory/README.md](../../Utilities/StorageFactory/README.md) for more information. Among other things it creates `XrdFile` objects.

Utilities/XrdAdaptor/doc/multisource_algorithm_design.txt renamed to Utilities/XrdAdaptor/doc/multisource_algorithm_design.md

Lines changed: 42 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,31 @@
1-
Introduction
1+
# Multi-source algorithm design
2+
3+
## Introduction
4+
25
The existing AAA infrastructure has a relatively high penalty for poor redirection decisions. A client which is redirected to a "poor" server in terms of network locality (in terms of high latency or low bandwidth) will continue to use that server as long as it doesn't fail outright, even if there is a much better source available.
36

47
Improving network locality argues for improving the redirector's redirection logic. This is certainly needed, however network locality is not the only concern. A site's storage can degrade in the middle of a file transfer (due to a device loss in the RAID, or sudden spurt of new transfers), making a "reasonable choice" at redirection time perform poorly overall.
58

69
To reduce the penalty for poor redirection decisions, we aim to upgrade the CMSSW XrdAdaptor to be multi-source capable. By actively using multiple sources and probing for additional ones, the client will be able to discover the fastest sources, even if the designation of "fastest" changes through the job's lifetime.
710

8-
Design Goals
11+
## Design Goals
12+
913
Design goals of the multi-source Xrootd client:
10-
(0) Determine a metric for quality of the source server.
11-
(1) Actively balance transfers over multiple links in order to determine several high-quality sources of the file.
12-
(2) Recover from transient IO errors at a single source.
13-
(3) Probe for additional possible sources during the file transfer.
14-
(4) Minimize the impact on the source site versus a single-source client. Understand both average case and the worst case scenarios.
15-
- In particular, actively utilizing too many sources can cause TCP windows to stay small and OS read-ahead to be inefficient.
16-
- Any speculative probes for additional sources should be a small percentage of the total traffic.
17-
- Any retry mechanisms must have reasonable delays prior to failure.
18-
(5) Have the number of requests per source be proportional to source quality.
19-
- Servers or sites experiencing a "soft failure" causing a degrade in quality will receive the least amount of traffic.
20-
21-
Implementation
22-
23-
Quality metric
14+
0. Determine a metric for quality of the source server.
15+
1. Actively balance transfers over multiple links in order to determine several high-quality sources of the file.
16+
2. Recover from transient IO errors at a single source.
17+
3. Probe for additional possible sources during the file transfer.
18+
4. Minimize the impact on the source site versus a single-source client. Understand both average case and the worst case scenarios.
19+
- In particular, actively utilizing too many sources can cause TCP windows to stay small and OS read-ahead to be inefficient.
20+
- Any speculative probes for additional sources should be a small percentage of the total traffic.
21+
- Any retry mechanisms must have reasonable delays prior to failure.
22+
5. Have the number of requests per source be proportional to source quality.
23+
- Servers or sites experiencing a "soft failure" causing a degrade in quality will receive the least amount of traffic.
24+
25+
## Implementation
26+
27+
### Quality metric
28+
2429
A source's quality is to be defined, per-file, to be the average request response time over the last 5 time intervals. Each time interval is one minute long; time intervals with no data points are discarded.
2530

2631
If there is no previously recorded data for a given source, it is assumed to have an average of 260ms (assumes a 256KB request, 1MB/s server speed, and 10ms of latency) in one time interval. When a new file is opened on a source, the prior average for that source is used for the first time interval.
@@ -29,7 +34,8 @@ Notes:
2934
- The request splitting algorithm outlined below will split all client requests into a series of requests at most 256KB (similar to what the Xrootd client does internally). Since the request has a maximum size, it makes looking at the unweighted time per request more reasonable.
3035
- It may seem strange to the reader to not differentiate between bandwidth and latency, or somehow factoring in request size to the quality metric. I believe it is acceptable to ignore this as the distribution of small and large requests will remain approximately constant throughout the job lifetime.
3136

32-
Source selection algorithm
37+
### Source selection algorithm
38+
3339
The client will maintain a set of up to two "active servers" and an arbitrary number of inactive servers. When the client opens a file, the initial data server it receives from the redirector becomes the first active server.
3440

3541
For a 5-second grace period, this initial server remains the only active one. (The use of "5 seconds" as the grace period is motivated by the redirector's implementation; any internal file location requests triggered by the initial file open should be finished after 5 seconds.) After the grace period, the client enters source search mode.
@@ -46,50 +52,60 @@ If a source encounters an error (either a file IO error or a disconnect), then i
4652

4753
If an inactive source's quality metric is better than an active source's metric, the two are swapped. This swap is not performed if the inactive source itself has been removed from the active set in the last two minutes. The "Active probe algorithm" section below describes one mechanism for updating an inactive source's quality metric.
4854

49-
Request splitting algorithm
55+
### Request splitting algorithm
56+
5057
When a client performs a new request, the request is balanced amongst the active servers using the following algorithm:
5158

52-
1) Source A removes 256KB from the beginning of the request and places it at the end of its request queue.
53-
2) Source B removes 256KB from the end 256KB of the request and places it at the beginning of its request queue.
54-
3) Steps (1) and (2) are repeated until the original request has been completely split into the two queues.
59+
1. Source A removes 256KB from the beginning of the request and places it at the end of its request queue.
60+
2. Source B removes 256KB from the end 256KB of the request and places it at the beginning of its request queue.
61+
3. Steps (1) and (2) are repeated until the original request has been completely split into the two queues.
5562

5663
After each client request has been fully split, the labels "A" and "B" are swapped. This allows a series of small client requests to be load-balanced between the two sources.
5764

5865
Examples:
5966

6067
For example, suppose a client requests to read 1024KB starting at offset 0. The client request and queues look like:
61-
68+
```
6269
A: []
6370
B: []
6471
client request: [1024KB @ 0]
72+
```
6573

6674
After steps 1 and 2, the sources have the following queues:
67-
75+
```
6876
A: [256KB @ 0]
6977
B: [256KB @ 768KB]
7078
client request: [512KB @ 256KB]
79+
```
7180

7281
After steps 1 and 2 are applied again, we have the following queues:
73-
82+
```
7483
A: [256KB @ 0, 256KB @ 256KB]
7584
B: [256KB @ 512KB, 256KB @ 768KB]
7685
client request: []
86+
```
7787

7888
Consider the following example for a vectored read request:
7989
Start:
90+
```
8091
A: []
8192
B: []
8293
client request: [192KB @ 0, 128KB @ 256KB, 128KB @ 512KB, 192KB @ 768KB]
94+
```
8395

8496
Iteration 1:
97+
```
8598
A: [192KB @ 0, 64KB @ 256KB]
8699
B: [64KB @ 576KB, 192KB @ 768KB]
87100
client request: [64KB @ 320KB, 64KB @ 512KB]
101+
```
88102

89103
Iteration 2:
104+
```
90105
A: [192KB @ 0, 64KB @ 256KB, 64KB @ 320KB, 64KB @ 512KB]
91106
B: [64KB @ 576KB, 192KB @ 768KB]
92107
client request: []
108+
```
93109

94110
Notes:
95111
- The algorithm halts because, after each iteration, the client request is reduced by 512KB and the algorithm terminates once the client request is empty.
@@ -98,7 +114,7 @@ Notes:
98114
- If the client request, when performed in order, results in the server performing non-overlapping reads with monotonically-increasing offsets, then the split requests will also have this property.
99115
- TODO: We should have the initial assignment done with respect to the quality metric. When the two servers are heavily out-of-balance, one may steal a lot of work from the other. When work is stolen, it is performed via backward reads, which won't cause ideal filesystem performance.
100116

101-
Load-balance algorithm
117+
### Load-balance algorithm
102118

103119
When a client request is made, it is split into two sets of requests as described previously if there are two active servers. Each source performs the IO operations in its queue in order to completion.
104120

@@ -113,7 +129,7 @@ Notes:
113129
- TODO: calculate the worst case over-read for very slow sources if they get their work stolen all the time.
114130
- The over-read is probably not as bad as having to take quite some time to give up on the source. We can probably introduce a penalty for sources that are the "victim" of a successful speculative read.
115131

116-
Active probe algorithm
132+
### Active probe algorithm
117133

118134
If the client is not in search mode, for every 1024KB of data read, generate a random number such that there will be a .25% chance the request will also be issued as a speculative read to an inactive source. If it has been at least 2 minutes since the last file-open request to the redirector, there is also a .25% chance the request will trigger a file-open and a speculative read in the redirector.
119135

Utilities/XrdAdaptor/src/XrdRequestManager.h

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,11 @@ namespace XrdAdaptor {
4141
uint16_t m_code;
4242
};
4343

44+
/**
45+
* The RequestManager manages the requests concerning one PFN.
46+
*
47+
* It implements retries, and can use multiple Sources
48+
*/
4449
class RequestManager {
4550
public:
4651
using IOSize = edm::storage::IOSize;
@@ -218,8 +223,13 @@ namespace XrdAdaptor {
218223
std::vector<std::shared_ptr<Source>> m_activeSources;
219224
std::vector<std::shared_ptr<Source>> m_inactiveSources;
220225

226+
/// Contains the "DataServer" property for disabled Sources and
227+
/// for connections for which the Open() call failed
221228
oneapi::tbb::concurrent_unordered_set<std::string> m_disabledSourceStrings;
229+
/// Contains Source::determineExcludeString() for disabled Sources and
230+
/// for connections for which the Open() call failed
222231
oneapi::tbb::concurrent_unordered_set<std::string> m_disabledExcludeStrings;
232+
/// Sources that have been disabled
223233
oneapi::tbb::concurrent_unordered_set<std::shared_ptr<Source>, SourceHash> m_disabledSources;
224234

225235
// StatisticsSenderService wants to know what our current server is;
@@ -260,8 +270,10 @@ namespace XrdAdaptor {
260270
~OpenHandler() override;
261271

262272
/**
263-
* Handle the file-open response
264-
*/
273+
* Handle the file-open response
274+
*
275+
* Called by XRootD
276+
*/
265277
void HandleResponseWithHosts(XrdCl::XRootDStatus *status,
266278
XrdCl::AnyObject *response,
267279
XrdCl::HostList *hostList) override;

Utilities/XrdAdaptor/src/XrdSource.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,9 @@ namespace XrdAdaptor {
2020
class XrdSiteStatistics;
2121
class XrdStatisticsService;
2222

23+
/**
24+
* A Source represents a connection to one storage server
25+
*/
2326
class Source : public std::enable_shared_from_this<Source> {
2427
public:
2528
Source(const Source &) = delete;

0 commit comments

Comments
 (0)