Skip to content

Commit 125a6bd

Browse files
ekpghjswoodward
authored andcommitted
avere-vfxt: fix mrsync etc
1 parent ca42834 commit 125a6bd

File tree

1 file changed

+12
-13
lines changed

1 file changed

+12
-13
lines changed

articles/avere-vfxt/avere-vfxt-data-ingest.md

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -20,12 +20,12 @@ The ``cp`` or ``copy`` commands that are commonly used to using to transfer data
2020

2121
This article explains strategies for creating a multi-client, multi-threaded file copying system to move data to the Avere vFXT cluster. It explains file transfer concepts and decision points that can be used for efficient data copying using multiple clients and simple copy commands.
2222

23-
It also explains some utilities that can help. The ``msrsync`` utility can be used to partially automate the process of dividing a dataset into buckets and using rsync commands. The ``parallelcp`` script is another utility that reads the source directory and issues copy commands automatically.
23+
It also explains some utilities that can help. The ``msrsync`` utility can be used to partially automate the process of dividing a dataset into buckets and using ``rsync`` commands. The ``parallelcp`` script is another utility that reads the source directory and issues copy commands automatically. Also, the ``rsync`` tool can be used in two phases to provide a quicker copy that still provides data consistency.
2424

2525
Click the link to jump to a section:
2626

2727
* [Manual copy example](#manual-copy-example) - A thorough explanation using copy commands
28-
* [Two-phase rsync example](#use-a-two-phase-rsync-process-to-populate-cloud-storage)
28+
* [Two-phase rsync example](#use-a-two-phase-rsync-process)
2929
* [Partially automated (msrsync) example](#use-the-msrsync-utility)
3030
* [Parallel copy example](#use-the-parallel-copy-script)
3131

@@ -253,11 +253,11 @@ The above will give you *N* files, each with a copy command per line, that can b
253253

254254
The goal is to run multiple threads of these scripts concurrently per client in parallel on multiple clients.
255255

256-
## Use a two-phase rsync process to populate cloud storage
256+
## Use a two-phase rsync process
257257

258-
The standard ``rsync`` utility does not work well for populating cloud storage through the Avere vFXT for Azure system because it uses a large number of file create and rename operations to ensure data integrity. However, you can safely use the ``--inplace`` option to skip the more careful copying procedure and follow that with a second run that checks file integrity.
258+
The standard ``rsync`` utility does not work well for populating cloud storage through the Avere vFXT for Azure system because it generates a large number of file create and rename operations to guarantee data integrity. However, you can safely use the ``--inplace`` option with ``rsync`` to skip the more careful copying procedure if you follow that with a second run that checks file integrity.
259259

260-
A standard rsync copy operation creates a temporary file and fills it with data. If the data transfer completes successfully, the temporary file is renamed to the original filename. This method guarantees consistency even if the files are accessed during copy. But this method generates more write operations, which slows file movement through the cache.
260+
A standard ``rsync`` copy operation creates a temporary file and fills it with data. If the data transfer completes successfully, the temporary file is renamed to the original filename. This method guarantees consistency even if the files are accessed during copy. But this method generates more write operations, which slows file movement through the cache.
261261

262262
The option ``--inplace`` writes the new file directly in its final location. Files are not guaranteed to be consistent during transfer, but that is not important if you are priming a storage system for use later.
263263

@@ -279,14 +279,13 @@ The ``msrsync`` tool also can be used to move data to a backend core filer for t
279279

280280
Preliminary testing using a four-core VM showed best efficiency when using 64 processes. Use the ``msrsync`` option ``-p`` to set the number of processes to 64.
281281

282-
You also can use the ``--inplace`` argument with msrsync commands. If you use this option, consider running a second command (as with [rsync](#use-a-two-phase-rsync-process-to-populate-cloud-storage
283-
), described above) to ensure data integrity.
282+
You also can use the ``--inplace`` argument with ``msrsync`` commands. If you use this option, consider running a second command (as with [rsync](#use-a-two-phase-rsync-process), described above) to ensure data integrity.
284283

285284
Note that ``msrsync`` can only write to and from local volumes. The source and destination must be accessible as local mounts in the cluster’s virtual network.
286285

287-
To use msrsync to populate an Azure cloud volume with an Avere cluster, follow these instructions:
286+
To use ``msrsync`` to populate an Azure cloud volume with an Avere cluster, follow these instructions:
288287

289-
1. Install msrsync and its prerequisites (rsync and Python 2.6 or later)
288+
1. Install ``msrsync`` and its prerequisites (rsync and Python 2.6 or later)
290289
1. Determine the total number of files and directories to be copied.
291290

292291
For example, use the Avere utility ``prime.py`` with arguments ```prime.py --directory /path/to/some/directory``` (available by downloading url <https://github.com/Azure/Avere/blob/master/src/clientapps/dataingestor/prime.py>).
@@ -301,21 +300,21 @@ To use msrsync to populate an Azure cloud volume with an Avere cluster, follow t
301300

302301
1. Divide the number of items by 64 to determine the number of items per process. Use this number with the ``-f`` option to set the size of the buckets when you run the command.
303302

304-
1. Issue the msrsync command to copy files:
303+
1. Issue the ``msrsync`` command to copy files:
305304

306305
```bash
307-
msrsync -P --stats -p64 -f <ITEMS_DIV_64> --rsync "-ahv" <SOURCE_PATH> <DESTINATION_PATH>
306+
msrsync -P --stats -p 64 -f <ITEMS_DIV_64> --rsync "-ahv" <SOURCE_PATH> <DESTINATION_PATH>
308307
```
309308

310309
If using ``--inplace``, add a second execution without the option to check that the data is correctly copied:
311310

312311
```bash
313-
msrsync -P --stats -p64 -f <ITEMS_DIV_64> --rsync "-ahv --inplace" <SOURCE_PATH> <DESTINATION_PATH> && msrsync -P --stats -p64 -f <ITEMS_DIV_64> --rsync "-ahv" <SOURCE_PATH> <DESTINATION_PATH>
312+
msrsync -P --stats -p 64 -f <ITEMS_DIV_64> --rsync "-ahv --inplace" <SOURCE_PATH> <DESTINATION_PATH> && msrsync -P --stats -p 64 -f <ITEMS_DIV_64> --rsync "-ahv" <SOURCE_PATH> <DESTINATION_PATH>
314313
```
315314

316315
For example, this command is designed to move 11,000 files in 64 processes from /test/source-repository to /mnt/vfxt/repository:
317316

318-
``mrsync -P --stats -p64 -f 170 --rsync "-ahv --inplace" /test/source-repository/ /mnt/vfxt/repository && mrsync -P --stats -p64 -f 170 --rsync "-ahv --inplace" /test/source-repository/ /mnt/vfxt/repository``
317+
``msrsync -P --stats -p 64 -f 170 --rsync "-ahv --inplace" /test/source-repository/ /mnt/vfxt/repository && msrsync -P --stats -p 64 -f 170 --rsync "-ahv --inplace" /test/source-repository/ /mnt/vfxt/repository``
319318

320319
## Use the parallel copy script
321320

0 commit comments

Comments
 (0)