Skip to content

Commit 6959953

Browse files
authored
Merge pull request #122706 from RobinLin666/dev/update-mssparkutils-doc
Update mssparkutils doc
2 parents d842946 + 29827f8 commit 6959953

File tree

1 file changed

+36
-37
lines changed

1 file changed

+36
-37
lines changed

articles/synapse-analytics/spark/synapse-file-mount-api.md

Lines changed: 36 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ The example assumes that you have one Data Lake Storage Gen2 account named `stor
3838

3939
![Screenshot of a Data Lake Storage Gen2 storage account.](./media/synapse-file-mount-api/gen2-storage-account.png)
4040

41-
To mount the container called `mycontainer`, `mssparkutils` first needs to check whether you have the permission to access the container. Currently, Azure Synapse Analytics supports three authentication methods for the trigger mount operation: `LinkedService`, `accountKey`, and `sastoken`.
41+
To mount the container called `mycontainer`, `mssparkutils` first needs to check whether you have the permission to access the container. Currently, Azure Synapse Analytics supports three authentication methods for the trigger mount operation: `linkedService`, `accountKey`, and `sastoken`.
4242

4343
### Mount by using a linked service (recommended)
4444

@@ -72,7 +72,7 @@ After you create linked service successfully, you can easily mount the container
7272
mssparkutils.fs.mount(
7373
"abfss://mycontainer@<accountname>.dfs.core.windows.net",
7474
"/test",
75-
{"LinkedService":"mygen2account"}
75+
{"linkedService": "mygen2account"}
7676
)
7777
```
7878

@@ -81,22 +81,22 @@ mssparkutils.fs.mount(
8181
> ```python
8282
> from notebookutils import mssparkutils
8383
> ```
84-
> Mount parameters:
85-
> - fileCacheTimeout: Blobs will be cached in the local temp folder for 120 seconds by default. During this time, blobfuse won't check whether the file is up to date or not. The parameter could be set to change the default timeout time. When multiple clients modify files at the same time, in order to avoid inconsistencies between local and remote files, we recommend shortening the cache time, or even changing it to 0, and always getting the latest files from the server.
86-
> - timeout: The mount operation timeout is 120 seconds by default. The parameter could be set to change the default timeout time. When there are too many executors or when the mount times out, we recommend increasing the value.
87-
> - scope: The scope parameter is used to specify the scope of the mount. The default value is "job." If the scope is set to "job," the mount is visible only to the current cluster. If the scope is set to "workspace," the mount is visible to all notebooks in the current workspace, and the mount point is automatically created if it doesn't exist. Add the same parameters to the unmount API to unmount the mount point. The workspace level mount is only supported for linked service authentication.
88-
>
89-
> You can use these parameters like this:
90-
> ```python
91-
> mssparkutils.fs.mount(
92-
> "abfss://mycontainer@<accountname>.dfs.core.windows.net",
93-
> "/test",
94-
> {"linkedService":"mygen2account", "fileCacheTimeout": 120, "timeout": 120}
95-
> )
96-
> ```
97-
>
9884
> We don't recommend that you mount a root folder, no matter which authentication method you use.
9985
86+
Mount parameters:
87+
- fileCacheTimeout: Blobs will be cached in the local temp folder for 120 seconds by default. During this time, blobfuse won't check whether the file is up to date or not. The parameter could be set to change the default timeout time. When multiple clients modify files at the same time, in order to avoid inconsistencies between local and remote files, we recommend shortening the cache time, or even changing it to 0, and always getting the latest files from the server.
88+
- timeout: The mount operation timeout is 120 seconds by default. The parameter could be set to change the default timeout time. When there are too many executors or when the mount times out, we recommend increasing the value.
89+
- scope: The scope parameter is used to specify the scope of the mount. The default value is "job." If the scope is set to "job," the mount is visible only to the current cluster. If the scope is set to "workspace," the mount is visible to all notebooks in the current workspace, and the mount point is automatically created if it doesn't exist. Add the same parameters to the unmount API to unmount the mount point. The workspace level mount is only supported for linked service authentication.
90+
91+
You can use these parameters like this:
92+
```python
93+
mssparkutils.fs.mount(
94+
"abfss://mycontainer@<accountname>.dfs.core.windows.net",
95+
"/test",
96+
{"linkedService":"mygen2account", "fileCacheTimeout": 120, "timeout": 120}
97+
)
98+
```
99+
100100

101101
### Mount via shared access signature token or account key
102102

@@ -166,47 +166,44 @@ f.close()
166166

167167
The main purpose of the mount operation is to let customers access the data stored in a remote storage account by using a local file system API. You can also access the data by using the `mssparkutils fs` API with a mounted path as a parameter. The path format used here is a little different.
168168

169-
Assume that you mounted the Data Lake Storage Gen2 container `mycontainer` to `/test` by using the mount API. When you access the data by using a local file system API, the path format is like this:
170-
171-
`/synfs/{jobId}/test/{filename}`
169+
Assuming you've mounted the Data Lake Storage Gen2 container mycontainer to /test using the mount API. When accessing the data through a local file system API:
170+
- For Spark versions less than or equal to 3.3, the path format is `/synfs/{jobId}/test/{filename}`.
171+
- For Spark versions greater than or equal to 3.4, the path format is `/synfs/notebook/{jobId}/test/{filename}`.
172172

173173
We recommend using a `mssparkutils.fs.getMountPath()` to get the accurate path:
174174

175175
```python
176-
path = mssparkutils.fs.getMountPath("/test") # equals to /synfs/{jobId}/test
176+
path = mssparkutils.fs.getMountPath("/test")
177177
```
178178

179-
When you want to access the data by using the `mssparkutils fs` API, the path format is like this:
180-
181-
`synfs:/{jobId}/test/{filename}`
179+
> [!NOTE]
180+
> When you mount the storage with `workspace` scope, the mount point is created under the `/synfs/workspace` folder. And you need to use `mssparkutils.fs.getMountPath("/test", "workspace")` to get the accurate path.
182181
183-
You can see that `synfs` is used as the schema in this case, instead of a part of the mounted path.
182+
When you want to access the data by using the `mssparkutils fs` API, the path format is like this: `synfs:/notebook/{jobId}/test/{filename}`. You can see that `synfs` is used as the schema in this case, instead of a part of the mounted path. Of course, you can also use the local file system schema to access the data. For example, `file:/synfs/notebook/{jobId}/test/{filename}`.
184183

185-
The following three examples show how to access a file with a mount point path by using `mssparkutils fs`. In the examples, `49` is a Spark job ID that we got from calling `mssparkutils.env.getJobId()`.
184+
The following three examples show how to access a file with a mount point path by using `mssparkutils fs`.
186185

187186
+ List directories:
188187

189188
```python
190-
mssparkutils.fs.ls("synfs:/49/test")
189+
mssparkutils.fs.ls(f'file:{mssparkutils.fs.getMountPath("/test")}')
191190
```
192191

193192
+ Read file content:
194193

195194
```python
196-
mssparkutils.fs.head("synfs:/49/test/myFile.txt")
195+
mssparkutils.fs.head(f'file:{mssparkutils.fs.getMountPath("/test")}/myFile.csv')
197196
```
198197

199198
+ Create a directory:
200199

201200
```python
202-
mssparkutils.fs.mkdirs("synfs:/49/test/newdir")
201+
mssparkutils.fs.mkdirs(f'file:{mssparkutils.fs.getMountPath("/test")}/myDir')
203202
```
204203

205204
## Access files under the mount point by using the Spark read API
206205

207-
You can provide a parameter to access the data through the Spark read API. The path format here is the same when you use the `mssparkutils fs` API:
208-
209-
`synfs:/{jobId}/test/{filename}`
206+
You can provide a parameter to access the data through the Spark read API. The path format here is the same when you use the `mssparkutils fs` API.
210207

211208
<a id="read-file-from-a-mounted-gen2-storage-account"></a>
212209
### Read a file from a mounted Data Lake Storage Gen2 storage account
@@ -216,7 +213,7 @@ The following example assumes that a Data Lake Storage Gen2 storage account was
216213
```python
217214
%%pyspark
218215

219-
df = spark.read.load("synfs:/49/test/myFile.csv", format='csv')
216+
df = spark.read.load(f'file:{mssparkutils.fs.getMountPath("/test")}/myFile.csv', format='csv')
220217
df.show()
221218
```
222219

@@ -242,15 +239,15 @@ If you mounted a Blob Storage account and want to access it by using `mssparkuti
242239
mssparkutils.fs.mount(
243240
"wasbs://mycontainer@<blobStorageAccountName>.blob.core.windows.net",
244241
"/test",
245-
Map("LinkedService" -> "myblobstorageaccount")
242+
Map("linkedService" -> "myblobstorageaccount")
246243
)
247244
```
248245

249246
3. Mount the Blob Storage container, and then read the file by using a mount path through the local file API:
250247

251248
```python
252249
# mount the Blob Storage container, and then read the file by using a mount path
253-
with open("/synfs/64/test/myFile.txt") as f:
250+
with open(mssparkutils.fs.getMountPath("/test") + "/myFile.txt") as f:
254251
print(f.read())
255252
```
256253

@@ -259,7 +256,7 @@ If you mounted a Blob Storage account and want to access it by using `mssparkuti
259256
```python
260257
%%spark
261258
// mount blob storage container and then read file using mount path
262-
val df = spark.read.text("synfs:/49/test/myFile.txt")
259+
val df = spark.read.text(f'file:{mssparkutils.fs.getMountPath("/test")}/myFile.txt')
263260
df.show()
264261
```
265262

@@ -273,12 +270,14 @@ mssparkutils.fs.unmount("/test")
273270

274271
## Known limitations
275272

276-
+ The `mssparkutils fs help` function hasn't added the description about the mount/unmount part yet.
277-
278273
+ The unmount mechanism is not automatic. When the application run finishes, to unmount the mount point to release the disk space, you need to explicitly call an unmount API in your code. Otherwise, the mount point will still exist in the node after the application run finishes.
279274

280275
+ Mounting a Data Lake Storage Gen1 storage account is not supported for now.
281276

277+
## Known issues:
278+
279+
+ In Spark 3.4, the mount points might be unavailable when there are multiple active sessions running in parallel in the same cluster. You can mount with `workspace` scope to avoid this issue.
280+
282281
## Next steps
283282

284283
- [Get started with Azure Synapse Analytics](../get-started.md)

0 commit comments

Comments
 (0)