|
1 | 1 | ---
|
2 | 2 | title: Introduction to file APIs in Azure Synapse Analytics
|
3 | 3 | description: This tutorial describes how to use the file mount and file unmount APIs in Azure Synapse Analytics, for both Azure Data Lake Storage Gen2 and Azure Blob Storage.
|
4 |
| -author: ruixinxu |
| 4 | +author: JeneZhang |
5 | 5 | services: synapse-analytics
|
6 | 6 | ms.service: synapse-analytics
|
7 | 7 | ms.topic: reference
|
8 | 8 | ms.subservice: spark
|
9 | 9 | ms.date: 07/27/2022
|
10 |
| -ms.author: ruxu |
| 10 | +ms.author: jingzh |
11 | 11 | ms.reviewer: wiassaf
|
12 | 12 | ms.custom: subject-rbac-steps
|
13 | 13 | ---
|
@@ -80,8 +80,22 @@ mssparkutils.fs.mount(
|
80 | 80 | > [!NOTE]
|
81 | 81 | > You might need to import `mssparkutils` if it's not available:
|
82 | 82 | > ```python
|
83 |
| -> From notebookutils import mssparkutils |
84 |
| -> ``` |
| 83 | +> from notebookutils import mssparkutils |
| 84 | +> ``` |
| 85 | +> Mount parameters: |
| 86 | +> - fileCacheTimeout: Blobs will be cached in the local temp folder for 120 seconds by default. During this time, blobfuse won't check whether the file is up to date or not. The parameter could be set to change the default timeout time. When multiple clients modify files at the same time, in order to avoid inconsistencies between local and remote files, we recommend shortening the cache time, or even changing it to 0, and always getting the latest files from the server. |
| 87 | +> - timeout: The mount operation timeout is 120 seconds by default. The parameter could be set to change the default timeout time. When there are too many executors or when the mount times out, we recommend increasing the value. |
| 88 | +> - scope: The scope parameter is used to specify the scope of the mount. The default value is "job." If the scope is set to "job," the mount is visible only to the current cluster. If the scope is set to "workspace," the mount is visible to all notebooks in the current workspace, and the mount point is automatically created if it doesn't exist. Add the same parameters to the unmount API to unmount the mount point. The workspace level mount is only supported for linked service authentication. |
| 89 | +> |
| 90 | +> You can use these parameters like this: |
| 91 | +> ```python |
| 92 | +> mssparkutils.fs.mount( |
| 93 | +> "abfss://mycontainer@<accountname>.dfs.core.windows.net", |
| 94 | +> "/test", |
| 95 | +> {"linkedService":"mygen2account", "fileCacheTimeout": 120, "timeout": 120} |
| 96 | +> ) |
| 97 | +> ``` |
| 98 | +> |
85 | 99 | > We don't recommend that you mount a root folder, no matter which authentication method you use.
|
86 | 100 |
|
87 | 101 |
|
@@ -149,14 +163,20 @@ f.close()
|
149 | 163 | ```
|
150 | 164 | --->
|
151 | 165 |
|
152 |
| -## Access files under the mount point by using the mssparktuils fs API |
| 166 | +## Access files under the mount point by using the mssparkutils fs API |
153 | 167 |
|
154 | 168 | The main purpose of the mount operation is to let customers access the data stored in a remote storage account by using a local file system API. You can also access the data by using the `mssparkutils fs` API with a mounted path as a parameter. The path format used here is a little different.
|
155 | 169 |
|
156 | 170 | Assume that you mounted the Data Lake Storage Gen2 container `mycontainer` to `/test` by using the mount API. When you access the data by using a local file system API, the path format is like this:
|
157 | 171 |
|
158 | 172 | `/synfs/{jobId}/test/{filename}`
|
159 | 173 |
|
| 174 | +We recommend using a `mssparkutils.fs.getMountPath()` to get the accurate path: |
| 175 | + |
| 176 | +```python |
| 177 | +path = mssparkutils.fs.getMountPath("/test") # equals to /synfs/{jobId}/test |
| 178 | +``` |
| 179 | + |
160 | 180 | When you want to access the data by using the `mssparkutils fs` API, the path format is like this:
|
161 | 181 |
|
162 | 182 | `synfs:/{jobId}/test/{filename}`
|
@@ -201,6 +221,9 @@ df = spark.read.load("synfs:/49/test/myFile.csv", format='csv')
|
201 | 221 | df.show()
|
202 | 222 | ```
|
203 | 223 |
|
| 224 | +> [!NOTE] |
| 225 | +> When you mount the storage using a linked service, you should always explicitly set spark linked service configuration before using synfs schema to access the data. Refer to [ADLS Gen2 storage with linked services](./apache-spark-secure-credentials-with-tokenlibrary.md#adls-gen2-storage-without-linked-services) for details. |
| 226 | +
|
204 | 227 | ### Read a file from a mounted Blob Storage account
|
205 | 228 |
|
206 | 229 | If you mounted a Blob Storage account and want to access it by using `mssparkutils` or the Spark API, you need to explicitly configure the SAS token via Spark configuration before you try to mount the container by using the mount API:
|
|
0 commit comments