You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/setup/arch/external_storage/README.md
+71-7Lines changed: 71 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,14 +1,31 @@
1
1
# Data Transfer
2
2
3
-
## Architecture
4
-
FirecREST enables users to upload and download large data files of [up to 5TB each](https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html), utilizing S3 buckets as a data buffer.
3
+
## Motivation
4
+
5
+
There are 2 endpoints that performs data transfer (upload and download) using FirecREST:
6
+
7
+
1.`filesystem/<system>/ops/upload[|download]`, which is meant for small files data transfer, which blocks the interface of FirecREST, and
8
+
2.`filesystem/<system>/transfer/upload[|download]`, which is designed to handle large data transfer, using another API or transfer service to perform the operation, which doesn't block FirecREST API.
9
+
10
+
In this section we discuss the latter.
11
+
12
+
## Types of data transfer using FirecREST
13
+
14
+
FirecREST presents various types of data transfer that can be selected using the [Data Operation](../../../setup/conf/#dataoperation) configuration.
15
+
16
+
### `S3DataTransfer`
17
+
18
+
!!! Note
19
+
Configuration for S3 Data Transfer can be found in this [link](../../../setup/conf/#s3datatransfer)
20
+
21
+
FirecREST enables users to upload and download large data files of [up to 5TB each](https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html), utilizing S3 buckets as a data buffer.
5
22
Users requesting data uploads or downloads to the HPC infrastructure receive [presigned URLs](https://docs.aws.amazon.com/AmazonS3/latest/userguide/ShareObjectPreSignedURL.html) to transfer data to or from the S3 storage.
6
23
7
24
Ownership of buckets and data remains with the FirecREST service account, but FirecREST creates one [bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#CoreConcepts) per user. Each file transferred (uploaded or downloaded) is stored in a unique identified [data object](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingObjects.html) into the user's bucket. Data objects within the buckets are retained for a configurable period, managed through S3's [lifecycle expiration](https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-expire-general-considerations.html) functionality. This expiration period, expressed in days, can be specified using the [`bucket_lifecycle_configuration`](../../../setup/conf/README.md#bucketlifecycleconfiguration) parameter.
8
25
9
26
The S3 storage can be either on-premises or cloud-based. In any case it is required an valid service account having sufficient permissions to handle buckets creation and the generation of presigned URLs.
10
27
11
-
###External Data Upload
28
+
#### `S3DataTransfer` Upload
12
29
13
30
Uploading data from the extern of the HPC infrastructure requires users to apply the [multipart upload protocol](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html), therefore the data file shall be divided into parts. The size limit of each part is defined in the response of FirecREST to the upload call.
14
31
@@ -20,7 +37,7 @@ After the user completes the upload process, an already scheduled job transfers
20
37
21
38
The diagram below illustrates the sequence of calls required to correctly process an upload.
1. The user calls API resource `transfer/upload` of endpoint `filesystem` with the parameters
26
43
-`path`: destination of the file in the HPC cluster
@@ -40,8 +57,7 @@ The diagram below illustrates the sequence of calls required to correctly proces
40
57
6. The data transfer job detects the upload completion
41
58
7. The data transfer job downloads the incoming data from S3 to the destination specified by the user
42
59
43
-
44
-
### Download Data From Extern
60
+
#### `S3DataTransfer` Download
45
61
46
62
Exporting large data file from the HPC cluster to external systems begins with a user's request to download data. FirecREST returns a presigned URL to access the S3 object and then it schedules a job to upload the data to an S3 object into the user's data bucket. The user must wait until the upload process within the HPC infrastructure is fully complete before accessing the data on S3.
47
63
@@ -51,7 +67,7 @@ Once the presigned URL is provided by FirecREST, users can access the S3 bucket
51
67
52
68
The diagram below illustrates the sequence of calls required to correctly process a download.
1. The user calls API resource `transfer/download` of the `filesystem` endpoint providing the following parameter
57
73
-`path`: source of the file in the HPC cluster
@@ -64,4 +80,52 @@ The diagram below illustrates the sequence of calls required to correctly proces
64
80
65
81
Although single-file download is an option, S3 supports [HTTP Range Request](https://www.rfc-editor.org/rfc/rfc9110.html#name-range-requests), which can be used to parallelly download chunks of a file stored in the S3 bucket.
66
82
83
+
### `StreamerDataTransfer`
84
+
85
+
!!! Note
86
+
Configuration for Streamer Data Transfer can be found in this [link](../../../setup/conf/#streamerdatatransfer)
87
+
88
+
When requested, FirecREST creates a scheduler job that opens a [websocket](https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API) on a port from a range of available ports on a compute node of the cluster.
89
+
90
+
Once opened, this websocket is able to receive or transmit chunks of data using the [`firecrest-streamer`](https://pypi.org/project/firecrest-streamer/) Python library, which is developed and maintained by the FirecREST team.
91
+
92
+
#### Features
93
+
94
+
When compared to the `S3DataTransfer`, this method has a number of advantages:
95
+
96
+
- Data transfer is perfomed as point-to-point between the user and the target remote filesystem
97
+
- The staging area is no longer needed, which prevents writing the data twice for one operation
98
+
- There is no limit on the amount of data to be transferred, this is an improvement compared with the 5TB of `S3DataTransfer`
99
+
- There is no need for splitting the file before the upload when it's larger than 5 GB
100
+
- To avoid that an idle transfer occupies a shared resource such as a compute node, an `wait_timeout` parameter can be configured. Once this timeout is achieved, the job is cancelled automatically.
101
+
- Additionally, to prevent that the data transferred exceeds the capacity supported by the HPC centre, the parameter `inbound_transfer_limit` limits the amount of data that can be received.
102
+
103
+
#### Limitations
104
+
105
+
It's important to mention that using this data transfer type assumes that the compute nodes where the websocket are opened has public IP or DNS address and the range of ports selected for the data streaming are opened to external networks as well.
106
+
107
+
Additionally, users must use the `firecrest-streamer` python library (or CLI tool) in order to perform the data transfer.
1. User calls the API resource `transfer/download` or `transfer/upload`, requesting the data transfer of a file.
114
+
2. FirecREST creates a data transfer job using the scheduler and launching the `firecrest-streamer` server.
115
+
3. This server opens an available port (from the range of ports that are configured) and returns an unique "coordinate", which acts as a shared secret between the server and the user client
116
+
4. FirecREST response holds the "coordinates" to perform the poit-to-point data transfer between the user and the remote filesystem
117
+
5. Using the `firecrest-streamer` client, the user performs the upload or download.
118
+
119
+
### `WormholeDataTransfer`
120
+
121
+
!!! Note
122
+
Configuration for Wormhole Data Transfer can be found in this [link](../../../setup/conf/#wormholedatatransfer)
123
+
124
+
This data transfer type enables [Magic Wormhole](https://magic-wormhole.readthedocs.io/en/latest/) integration for data transfer through FirecREST.
125
+
126
+
The idea behind is the same as with the [StreamerDataTransfer]: a job is created on the scheduler and it creates a Magic Wormhole server that can `receive` or `send` chunks of data using a Magic Wormhole Relay Server and a Rendezvous Server.
127
+
128
+
!!! Note
129
+
For more information on Magic Wormhole servers, refer to this [link](https://magic-wormhole.readthedocs.io/en/latest/welcome.html#relays)
67
130
131
+
As with the Streamer data transfer type, users must use a python client or a CLI, in this case provided by the developers of Magic Wormhole, in order to provide point-to-point communication between client and server.
@@ -139,52 +139,133 @@ All asynchronous endpoints are located under `/transfer` and follow this path s
139
139
140
140
## File transfer
141
141
142
-
FirecREST provides two methods for transferring files:
142
+
FirecREST provides two resources for transferring files:
143
+
144
+
-[`/filesystem/{system_name}/ops/download`](https://eth-cscs.github.io/firecrest-v2/openapi/#/filesystem/get_download_filesystem__system_name__ops_download_get)[`[|upload]`](https://eth-cscs.github.io/firecrest-v2/openapi/#/filesystem/post_upload_filesystem__system_name__ops_upload_post) for small files (up to 5MB by [default](../setup/conf/#dataoperation)) that can be uploaded or downloaded directly, and
145
+
-[`/filesystem/{system_name}/transfer/download`](https://eth-cscs.github.io/firecrest-v2/openapi/#/filesystem/post_download_filesystem__system_name__transfer_download_post)[`[|upload]`](https://eth-cscs.github.io/firecrest-v2/openapi/#/filesystem/post_upload_filesystem__system_name__transfer_upload_post) for large files that can be transferred depending the `transfer_method` chosen (if configured in the FirecREST instalation).
146
+
147
+
It creates a job in the scheduler to make an asynchronous data transfer managed by the HPC center. Supported values for `transfer_method` are
148
+
149
+
-`s3`: files must first be transferred to a staging storage system (e.g., S3) before being moved to their final location on the HPC filesystem.
150
+
-`streamer`: it's a point-to-point data transfer using the [`firecrest-streamer`](https://pypi.org/project/firecrest-streamer/) client
151
+
-`wormhole`: it's a point-to-point data transfer using the [`Magic Wormhole`](https://magic-wormhole.readthedocs.io/en/latest/welcome.html) client
152
+
153
+
!!! Note
154
+
Availability of the transfer methods in the FirecREST installation depends on the configuration. You can check the [`status/systems`](https://eth-cscs.github.io/firecrest-v2/openapi/#/status/get_systems_status_systems_get) endpoint to get information about which `data_transfer` method is supported by your HPC provider.
155
+
156
+
When requesting a large file download, FirecREST returns a `jobId` and information about how to download the file. This information will be shown depending on the transfer method used:
157
+
158
+
### Using `s3` transfer method
159
+
160
+
#### S3 download
161
+
162
+
Once the remote job is completed, the file is temporary stored in the S3 object storage. Then, users can retrieve the file using the provided `download_url` directly from the S3 interface.
163
+
164
+
!!! example "Download a file using `streamer` transfer method"
165
+
```sh
166
+
$ curl --request POST <firecrest_url>/filesystem/<system>/transfer/download \
- Small files (up to 5MB by [default](../setup/conf/#dataoperation)) can be uploaded or downloaded directly.
145
-
- Large files must first be transferred to a staging storage system (e.g., S3) before being moved to their final location on the HPC filesystem.
190
+
Given that FirecREST utilizes a storage service based on [S3 as staging area](../setup/arch/external_storage/), the upload is limited by the constraints on S3 server. In this case, for files larger than 5GB the file to be uploaded needs to be split in chunks, which complicates the file upload.
146
191
147
-
Small file transfer endpoints:
192
+
To address this, we have created a set of examples in different programming and scripting languages, described bellow:
-`s3` Upload with Python3: this is the easiest way of using FirecREST. See [FirecREST SDK section](#firecrest-sdk) below for more information and detailed examples.
151
195
152
-
Large file transfer endpoints:
196
+
-`s3` Upload with Bash: [Detailed example.](file_transfer_bash/README.md)
-`s3` Upload with .NET: [Detailed example.](file_transfer_dotnet/README.md)
156
199
157
-
### Downloading Large Files
200
+
!!! info "Need more examples?"
201
+
If you need examples for your particular S3 use case (ie, using a different language than the listed above), feel free to open an [issue on GitHub](https://github.com/eth-cscs/firecrest-v2/issues/new). We'd be happy to create one for you.
158
202
159
-
When requesting a large file download, FirecREST returns a download URL and a `jobId`. Once the remote job is completed, the user can retrieve the file using the provided URL.
203
+
### Using `streamer` transfer method
160
204
161
-
###Uploading Large Files
205
+
#### Streamer download {#streamer-download}
162
206
163
-
Given that FirecREST utilizes a storage service based on [S3 as staging area](../setup/arch/external_storage/), the upload is limited by the constraints on S3 server. In this case, for files larger than 5GB the file to be uploaded needs to be splitted in chunks, which complicates the file upload.
207
+
In order to use the `streamer` transfer method, users must install the [`firecrest-streamer`](https://pypi.org/project/firecrest-streamer/) tool.
164
208
165
-
For this, we have created a set of examples in different programming and scripting languages that we describe following:
209
+
!!! example "Download a file using `streamer` transfer method"
210
+
```sh
211
+
$ curl --request POST <firecrest_url>/filesystem/<system>/transfer/download \
This is the easiest way of using FirecREST. See [FirecREST SDK section](#firecrest-sdk) below for more information and detailed examples.
233
+
!!! info
234
+
The file selected will be available for downloading as long as the job is running in the scheduler. Additionally, users can check the `waitTimeout` and `inboundTransferLimit` parameters in the call to `GET /status/systems` to perform a better data transfer process.
170
235
171
-
#### Large Data Upload with Bash
236
+
After getting the response, you can use the secret `coordinates` in the execution of the `streamer` command to complete the download to the local system.
172
237
173
-
[Detailed example.](file_transfer_bash/README.md)
238
+
!!! warning
239
+
Keep the secret `coordinates` secured: these are used to uniquely transfer data between a `streamer` client and a specific file in the remote filesystem. If you share the credentials with somebody else, they could move the data on your behalf.
174
240
175
-
#### Large Data Upload with .NET
241
+
!!! example "Using `firecrest-streamer` tool to download a file from a remote system"
Using the same method as for the [download](#streamer-download) you can `send` data to upload files from your local system to the cluster.
180
252
181
-
If you need examples for your particular use case (ie, using a different language than the listed above), feel free to open an [issue on GitHub](https://github.com/eth-cscs/firecrest-v2/issues/new). We'd be happy to create one for you.
253
+
After receiving the secrets `coordinates`, you can use the `streamer` to upload the file to the requested target:
0 commit comments