Skip to content

Commit 44f06de

Browse files
authored
Merge pull request #119147 from SharonZhang1/update0124
add method
2 parents cd5099e + 5f75509 commit 44f06de

File tree

2 files changed

+82
-2
lines changed

2 files changed

+82
-2
lines changed
21.1 KB
Loading

articles/synapse-analytics/spark/microsoft-spark-utilities.md

Lines changed: 82 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
---
22
title: Introduction to Microsoft Spark utilities
33
description: "Tutorial: MSSparkutils in Azure Synapse Analytics notebooks"
4-
author: ruixinxu
4+
author: JeneZhang
55
ms.service: synapse-analytics
66
ms.topic: reference
77
ms.subservice: spark
88
ms.date: 09/10/2020
9-
ms.author: ruxu
9+
ms.author: jingzh
1010
zone_pivot_groups: programming-languages-spark-all-minus-sql
1111
ms.custom: subject-rbac-steps, devx-track-python
1212
---
@@ -390,6 +390,17 @@ mssparkutils.fs.cp('source file or directory', 'destination file or directory',
390390
```
391391
::: zone-end
392392

393+
### Performant copy file
394+
395+
This method provides a faster way of copying or moving files, especially large volumes of data.
396+
397+
```python
398+
mssparkutils.fs.fastcp('source file or directory', 'destination file or directory', True) # Set the third parameter as True to copy all files and directories recursively
399+
```
400+
401+
> [!NOTE]
402+
> The method only supports in Spark 3.3 and Spark 3.4.
403+
393404
### Preview file content
394405

395406
Returns up to the first 'maxBytes' bytes of the given file as a String encoded in UTF-8.
@@ -605,6 +616,75 @@ After the run finished, you will see a snapshot link named '**View notebook run:
605616

606617
![Screenshot of a snap link python](./media/microsoft-spark-utilities/spark-utilities-run-notebook-snap-link-sample-python.png)
607618

619+
### Reference run multiple notebooks in parallel
620+
621+
The method `mssparkutils.notebook.runMultiple()` allows you to run multiple notebooks in parallel or with a predefined topological structure. The API is using a multi-thread implementation mechanism within a spark session, which means the compute resources are shared by the reference notebook runs.
622+
623+
With `mssparkutils.notebook.runMultiple()`, you can:
624+
625+
- Execute multiple notebooks simultaneously, without waiting for each one to finish.
626+
627+
- Specify the dependencies and order of execution for your notebooks, using a simple JSON format.
628+
629+
- Optimize the use of Spark compute resources and reduce the cost of your Synapse projects.
630+
631+
- View the Snapshots of each notebook run record in the output, and debug/monitor your notebook tasks conveniently.
632+
633+
- Get the exit value of each executive activity and use them in downstream tasks.
634+
635+
You can also try to run the mssparkutils.notebook.help("runMultiple") to find the example and detailed usage.
636+
637+
Here's a simple example of running a list of notebooks in parallel using this method:
638+
639+
```python
640+
641+
mssparkutils.notebook.runMultiple(["NotebookSimple", "NotebookSimple2"])
642+
643+
```
644+
645+
The execution result from the root notebook is as follows:
646+
647+
:::image type="content" source="media\microsoft-spark-utilities\spark-utilities-run-notebook-list.png" alt-text="Screenshot of reference a list of notebooks." lightbox="media\microsoft-spark-utilities\spark-utilities-run-notebook-list.png":::
648+
649+
The following is an example of running notebooks with topological structure using `mssparkutils.notebook.runMultiple()`. Use this method to easily orchestrate notebooks through a code experience.
650+
651+
```python
652+
# run multiple notebooks with parameters
653+
DAG = {
654+
"activities": [
655+
{
656+
"name": "NotebookSimple", # activity name, must be unique
657+
"path": "NotebookSimple", # notebook path
658+
"timeoutPerCellInSeconds": 90, # max timeout for each cell, default to 90 seconds
659+
"args": {"p1": "changed value", "p2": 100}, # notebook parameters
660+
},
661+
{
662+
"name": "NotebookSimple2",
663+
"path": "NotebookSimple2",
664+
"timeoutPerCellInSeconds": 120,
665+
"args": {"p1": "changed value 2", "p2": 200}
666+
},
667+
{
668+
"name": "NotebookSimple2.2",
669+
"path": "NotebookSimple2",
670+
"timeoutPerCellInSeconds": 120,
671+
"args": {"p1": "changed value 3", "p2": 300},
672+
"retry": 1,
673+
"retryIntervalInSeconds": 10,
674+
"dependencies": ["NotebookSimple"] # list of activity names that this activity depends on
675+
}
676+
]
677+
}
678+
mssparkutils.notebook.runMultiple(DAG)
679+
680+
```
681+
682+
> [!NOTE]
683+
>
684+
> - The method only supports in Spark 3.3 and Spark 3.4.
685+
> - The parallelism degree of the multiple notebook run is restricted to the total available compute resource of a Spark session.
686+
687+
608688
### Exit a notebook
609689
Exits a notebook with a value. You can run nesting function calls in a notebook interactively or in a pipeline.
610690

0 commit comments

Comments
 (0)