Merge pull request #122706 from RobinLin666/dev/update-mssparkutils-doc

Court72 · web-flow · commit 6959953fe8d9 · 2024-05-23T07:55:50.000-07:00
Update mssparkutils doc
diff --git a/articles/synapse-analytics/spark/synapse-file-mount-api.md b/articles/synapse-analytics/spark/synapse-file-mount-api.md
@@ -38,7 +38,7 @@ The example assumes that you have one Data Lake Storage Gen2 account named `stor
 
 ![Screenshot of a Data Lake Storage Gen2 storage account.](./media/synapse-file-mount-api/gen2-storage-account.png)
 
-To mount the container called `mycontainer`, `mssparkutils` first needs to check whether you have the permission to access the container. Currently, Azure Synapse Analytics supports three authentication methods for the trigger mount operation: `LinkedService`, `accountKey`, and `sastoken`. 
+To mount the container called `mycontainer`, `mssparkutils` first needs to check whether you have the permission to access the container. Currently, Azure Synapse Analytics supports three authentication methods for the trigger mount operation: `linkedService`, `accountKey`, and `sastoken`. 
 
 ### Mount by using a linked service (recommended)
 
@@ -72,7 +72,7 @@ After you create linked service successfully, you can easily mount the container
 mssparkutils.fs.mount( 
     "abfss://mycontainer@<accountname>.dfs.core.windows.net", 
     "/test", 
-    {"LinkedService":"mygen2account"} 
+    {"linkedService": "mygen2account"} 
 ) 
 ``` 
 
@@ -81,22 +81,22 @@ mssparkutils.fs.mount(
 > ```python
 > from notebookutils import mssparkutils 
 > ```
-> Mount parameters:
-> - fileCacheTimeout: Blobs will be cached in the local temp folder for 120 seconds by default. During this time, blobfuse won't check whether the file is up to date or not. The parameter could be set to change the default timeout time. When multiple clients modify files at the same time, in order to avoid inconsistencies between local and remote files, we recommend shortening the cache time, or even changing it to 0, and always getting the latest files from the server.
-> - timeout: The mount operation timeout is 120 seconds by default. The parameter could be set to change the default timeout time. When there are too many executors or when the mount times out, we recommend increasing the value.
-> - scope: The scope parameter is used to specify the scope of the mount. The default value is "job." If the scope is set to "job," the mount is visible only to the current cluster. If the scope is set to "workspace," the mount is visible to all notebooks in the current workspace, and the mount point is automatically created if it doesn't exist. Add the same parameters to the unmount API to unmount the mount point. The workspace level mount is only supported for linked service authentication.
->
-> You can use these parameters like this:
-> ```python
-> mssparkutils.fs.mount(
->    "abfss://mycontainer@<accountname>.dfs.core.windows.net",
->    "/test",
->    {"linkedService":"mygen2account", "fileCacheTimeout": 120, "timeout": 120}
-> )
-> ```
-> 
 > We don't recommend that you mount a root folder, no matter which authentication method you use.
 
+Mount parameters:
+- fileCacheTimeout: Blobs will be cached in the local temp folder for 120 seconds by default. During this time, blobfuse won't check whether the file is up to date or not. The parameter could be set to change the default timeout time. When multiple clients modify files at the same time, in order to avoid inconsistencies between local and remote files, we recommend shortening the cache time, or even changing it to 0, and always getting the latest files from the server.
+- timeout: The mount operation timeout is 120 seconds by default. The parameter could be set to change the default timeout time. When there are too many executors or when the mount times out, we recommend increasing the value.
+- scope: The scope parameter is used to specify the scope of the mount. The default value is "job." If the scope is set to "job," the mount is visible only to the current cluster. If the scope is set to "workspace," the mount is visible to all notebooks in the current workspace, and the mount point is automatically created if it doesn't exist. Add the same parameters to the unmount API to unmount the mount point. The workspace level mount is only supported for linked service authentication.
+
+You can use these parameters like this:
+```python
+mssparkutils.fs.mount(
+    "abfss://mycontainer@<accountname>.dfs.core.windows.net",
+    "/test",
+    {"linkedService":"mygen2account", "fileCacheTimeout": 120, "timeout": 120}
+)
+```
+
 
 ### Mount via shared access signature token or account key  
 
@@ -166,47 +166,44 @@ f.close()
 
 The main purpose of the mount operation is to let customers access the data stored in a remote storage account by using a local file system API. You can also access the data by using the `mssparkutils fs` API with a mounted path as a parameter. The path format used here is a little different. 
 
-Assume that you mounted the Data Lake Storage Gen2 container `mycontainer` to `/test` by using the mount API. When you access the data by using a local file system API, the path format is like this: 
-
-`/synfs/{jobId}/test/{filename}`
+Assuming you've mounted the Data Lake Storage Gen2 container mycontainer to /test using the mount API. When accessing the data through a local file system API:
+- For Spark versions less than or equal to 3.3, the path format is `/synfs/{jobId}/test/{filename}`.
+- For Spark versions greater than or equal to 3.4, the path format is `/synfs/notebook/{jobId}/test/{filename}`.
 
 We recommend using a `mssparkutils.fs.getMountPath()` to get the accurate path:
 
 ```python
-path = mssparkutils.fs.getMountPath("/test") # equals to /synfs/{jobId}/test
+path = mssparkutils.fs.getMountPath("/test")
 ```
 
-When you want to access the data by using the `mssparkutils fs` API, the path format is like this: 
-
-`synfs:/{jobId}/test/{filename}`
+> [!NOTE]
+> When you mount the storage with `workspace` scope, the mount point is created under the `/synfs/workspace` folder. And you need to use `mssparkutils.fs.getMountPath("/test", "workspace")` to get the accurate path.
 
-You can see that `synfs` is used as the schema in this case, instead of a part of the mounted path. 
+When you want to access the data by using the `mssparkutils fs` API, the path format is like this: `synfs:/notebook/{jobId}/test/{filename}`. You can see that `synfs` is used as the schema in this case, instead of a part of the mounted path. Of course, you can also use the local file system schema to access the data. For example, `file:/synfs/notebook/{jobId}/test/{filename}`.
 
-The following three examples show how to access a file with a mount point path by using `mssparkutils fs`. In the examples, `49` is a Spark job ID that we got from calling `mssparkutils.env.getJobId()`. 
+The following three examples show how to access a file with a mount point path by using `mssparkutils fs`.
 
 + List directories:  
 
     ```python 
-    mssparkutils.fs.ls("synfs:/49/test") 
+    mssparkutils.fs.ls(f'file:{mssparkutils.fs.getMountPath("/test")}') 
     ``` 
 
 + Read file content: 
 
     ```python 
-    mssparkutils.fs.head("synfs:/49/test/myFile.txt") 
+    mssparkutils.fs.head(f'file:{mssparkutils.fs.getMountPath("/test")}/myFile.csv') 
     ``` 
 
 + Create a directory: 
 
     ```python 
-    mssparkutils.fs.mkdirs("synfs:/49/test/newdir") 
+    mssparkutils.fs.mkdirs(f'file:{mssparkutils.fs.getMountPath("/test")}/myDir') 
     ``` 
 
 ## Access files under the mount point by using the Spark read API 
 
-You can provide a parameter to access the data through the Spark read API. The path format here is the same when you use the `mssparkutils fs` API: 
-
-`synfs:/{jobId}/test/{filename}`
+You can provide a parameter to access the data through the Spark read API. The path format here is the same when you use the `mssparkutils fs` API.
 
 <a id="read-file-from-a-mounted-gen2-storage-account"></a>
 ### Read a file from a mounted Data Lake Storage Gen2 storage account 
@@ -216,7 +213,7 @@ The following example assumes that a Data Lake Storage Gen2 storage account was
 ```python 
 %%pyspark 
 
-df = spark.read.load("synfs:/49/test/myFile.csv", format='csv') 
+df = spark.read.load(f'file:{mssparkutils.fs.getMountPath("/test")}/myFile.csv', format='csv') 
 df.show() 
 ``` 
 
@@ -242,15 +239,15 @@ If you mounted a Blob Storage account and want to access it by using `mssparkuti
     mssparkutils.fs.mount( 
         "wasbs://mycontainer@<blobStorageAccountName>.blob.core.windows.net", 
         "/test", 
-        Map("LinkedService" -> "myblobstorageaccount") 
+        Map("linkedService" -> "myblobstorageaccount") 
     ) 
     ``` 
 
 3. Mount the Blob Storage container, and then read the file by using a mount path through the local file API:
 
     ```python
         # mount the Blob Storage container, and then read the file by using a mount path
-        with open("/synfs/64/test/myFile.txt") as f:
+        with open(mssparkutils.fs.getMountPath("/test") + "/myFile.txt") as f:
         print(f.read())
     ```
 
@@ -259,7 +256,7 @@ If you mounted a Blob Storage account and want to access it by using `mssparkuti
     ```python
     %%spark
     // mount blob storage container and then read file using mount path
-    val df = spark.read.text("synfs:/49/test/myFile.txt")
+    val df = spark.read.text(f'file:{mssparkutils.fs.getMountPath("/test")}/myFile.txt')
     df.show()
     ```
 
@@ -273,12 +270,14 @@ mssparkutils.fs.unmount("/test")
 
 ## Known limitations
 
-+ The `mssparkutils fs help` function hasn't added the description about the mount/unmount part yet. 
-
 + The unmount mechanism is not automatic. When the application run finishes, to unmount the mount point to release the disk space, you need to explicitly call an unmount API in your code. Otherwise, the mount point will still exist in the node after the application run finishes. 
 
 + Mounting a Data Lake Storage Gen1 storage account is not supported for now. 
 
+## Known issues:
+
++ In Spark 3.4, the mount points might be unavailable when there are multiple active sessions running in parallel in the same cluster. You can mount with `workspace` scope to avoid this issue.
+
 ## Next steps
 
 - [Get started with Azure Synapse Analytics](../get-started.md)