|
1 | 1 | ---
|
2 | 2 | title: Introduction to Microsoft Spark utilities
|
3 | 3 | description: "Tutorial: MSSparkutils in Azure Synapse Analytics notebooks"
|
4 |
| -author: ruixinxu |
| 4 | +author: JeneZhang |
5 | 5 | ms.service: synapse-analytics
|
6 | 6 | ms.topic: reference
|
7 | 7 | ms.subservice: spark
|
8 | 8 | ms.date: 09/10/2020
|
9 |
| -ms.author: ruxu |
| 9 | +ms.author: jingzh |
10 | 10 | zone_pivot_groups: programming-languages-spark-all-minus-sql
|
11 | 11 | ms.custom: subject-rbac-steps, devx-track-python
|
12 | 12 | ---
|
@@ -390,6 +390,17 @@ mssparkutils.fs.cp('source file or directory', 'destination file or directory',
|
390 | 390 | ```
|
391 | 391 | ::: zone-end
|
392 | 392 |
|
| 393 | +### Performant copy file |
| 394 | + |
| 395 | +This method provides a faster way of copying or moving files, especially large volumes of data. |
| 396 | + |
| 397 | +```python |
| 398 | +mssparkutils.fs.fastcp('source file or directory', 'destination file or directory', True) # Set the third parameter as True to copy all files and directories recursively |
| 399 | +``` |
| 400 | + |
| 401 | +> [!NOTE] |
| 402 | +> The method only supports in Spark 3.3 and Spark 3.4. |
| 403 | +
|
393 | 404 | ### Preview file content
|
394 | 405 |
|
395 | 406 | Returns up to the first 'maxBytes' bytes of the given file as a String encoded in UTF-8.
|
@@ -605,6 +616,75 @@ After the run finished, you will see a snapshot link named '**View notebook run:
|
605 | 616 |
|
606 | 617 | 
|
607 | 618 |
|
| 619 | +### Reference run multiple notebooks in parallel |
| 620 | + |
| 621 | +The method `mssparkutils.notebook.runMultiple()` allows you to run multiple notebooks in parallel or with a predefined topological structure. The API is using a multi-thread implementation mechanism within a spark session, which means the compute resources are shared by the reference notebook runs. |
| 622 | + |
| 623 | +With `mssparkutils.notebook.runMultiple()`, you can: |
| 624 | + |
| 625 | +- Execute multiple notebooks simultaneously, without waiting for each one to finish. |
| 626 | + |
| 627 | +- Specify the dependencies and order of execution for your notebooks, using a simple JSON format. |
| 628 | + |
| 629 | +- Optimize the use of Spark compute resources and reduce the cost of your Synapse projects. |
| 630 | + |
| 631 | +- View the Snapshots of each notebook run record in the output, and debug/monitor your notebook tasks conveniently. |
| 632 | + |
| 633 | +- Get the exit value of each executive activity and use them in downstream tasks. |
| 634 | + |
| 635 | +You can also try to run the mssparkutils.notebook.help("runMultiple") to find the example and detailed usage. |
| 636 | + |
| 637 | +Here's a simple example of running a list of notebooks in parallel using this method: |
| 638 | + |
| 639 | +```python |
| 640 | + |
| 641 | +mssparkutils.notebook.runMultiple(["NotebookSimple", "NotebookSimple2"]) |
| 642 | + |
| 643 | +``` |
| 644 | + |
| 645 | +The execution result from the root notebook is as follows: |
| 646 | + |
| 647 | +:::image type="content" source="media\microsoft-spark-utilities\spark-utilities-run-notebook-list.png" alt-text="Screenshot of reference a list of notebooks." lightbox="media\microsoft-spark-utilities\spark-utilities-run-notebook-list.png"::: |
| 648 | + |
| 649 | +The following is an example of running notebooks with topological structure using `mssparkutils.notebook.runMultiple()`. Use this method to easily orchestrate notebooks through a code experience. |
| 650 | + |
| 651 | +```python |
| 652 | +# run multiple notebooks with parameters |
| 653 | +DAG = { |
| 654 | + "activities": [ |
| 655 | + { |
| 656 | + "name": "NotebookSimple", # activity name, must be unique |
| 657 | + "path": "NotebookSimple", # notebook path |
| 658 | + "timeoutPerCellInSeconds": 90, # max timeout for each cell, default to 90 seconds |
| 659 | + "args": {"p1": "changed value", "p2": 100}, # notebook parameters |
| 660 | + }, |
| 661 | + { |
| 662 | + "name": "NotebookSimple2", |
| 663 | + "path": "NotebookSimple2", |
| 664 | + "timeoutPerCellInSeconds": 120, |
| 665 | + "args": {"p1": "changed value 2", "p2": 200} |
| 666 | + }, |
| 667 | + { |
| 668 | + "name": "NotebookSimple2.2", |
| 669 | + "path": "NotebookSimple2", |
| 670 | + "timeoutPerCellInSeconds": 120, |
| 671 | + "args": {"p1": "changed value 3", "p2": 300}, |
| 672 | + "retry": 1, |
| 673 | + "retryIntervalInSeconds": 10, |
| 674 | + "dependencies": ["NotebookSimple"] # list of activity names that this activity depends on |
| 675 | + } |
| 676 | + ] |
| 677 | +} |
| 678 | +mssparkutils.notebook.runMultiple(DAG) |
| 679 | + |
| 680 | +``` |
| 681 | + |
| 682 | +> [!NOTE] |
| 683 | +> |
| 684 | +> - The method only supports in Spark 3.3 and Spark 3.4. |
| 685 | +> - The parallelism degree of the multiple notebook run is restricted to the total available compute resource of a Spark session. |
| 686 | +
|
| 687 | + |
608 | 688 | ### Exit a notebook
|
609 | 689 | Exits a notebook with a value. You can run nesting function calls in a notebook interactively or in a pipeline.
|
610 | 690 |
|
|
0 commit comments