Skip to content

Commit 6a40966

Browse files
authored
feat: azure backend (#143)
* feat: azure backend * fix(azure): use adl (gen2 system) in constructor, clean up * fix(azure): wait abfs is gen2, change back * docs: add board_azure to docs
1 parent a67f622 commit 6a40966

File tree

10 files changed

+56
-18
lines changed

10 files changed

+56
-18
lines changed

.github/workflows/ci.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,8 @@ jobs:
6969
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
7070
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
7171
AWS_REGION: "us-east-1"
72+
AZURE_STORAGE_ACCOUNT_NAME: ${{ secrets.AZURE_STORAGE_ACCOUNT_NAME }}
73+
AZURE_STORAGE_ACCOUNT_KEY: ${{ secrets.AZURE_STORAGE_ACCOUNT_KEY }}
7274
PYTEST_OPTS: ${{ matrix.pytest_opts }}
7375
REQUIREMENTS: ${{ matrix.requirements }}
7476
ACTION_OS: ${{ matrix.os }}

docs/api/constructors.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,5 +11,6 @@ Board Constructors
1111
~board_temp
1212
~board_s3
1313
~board_gcs
14+
~board_azure
1415
~board_rsconnect
1516
~board

docs/api/index.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,10 @@ Boards abstract over different storage backends, making it easy to share data in
3838
- Use RStudio Connect as a board
3939
* - :func:`.board_s3`
4040
- Use an S3 bucket as a board
41+
* - :func:`.board_gcs`
42+
- Use an Google Cloud Storage bucket as a board
43+
* - :func:`.board_azure`
44+
- Use an Azure Datalake storage container as a board.
4145
* - :func:`.board`
4246
- Generic board constructor
4347

docs/getting_started.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ Getting Started
2222

2323
The pins package helps you publish data sets, models, and other Python objects, making it easy to share them across projects and with your colleagues.
2424
You can pin objects to a variety of "boards", including local folders (to share on a networked drive or with DropBox), RStudio connect, Amazon S3,
25-
Google Cloud Storage, and more.
25+
Google Cloud Storage, Azure Datalake, and more.
2626
This vignette will introduce you to the basics of pins.
2727

2828
```{python}

docs/intro.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ kernelspec:
2121
```
2222

2323
The pins package publishes data, models, and other Python objects, making it easy to share
24-
You can pin objects to a variety of pin *boards*, including folders (to share on a networked drive or with services like DropBox), RStudio Connect, Amazon S3, and Google Cloud Storage.
24+
You can pin objects to a variety of pin *boards*, including folders (to share on a networked drive or with services like DropBox), RStudio Connect, Amazon S3, Google Cloud Storage, and Azure Datalake.
2525
Pins can be automatically versioned, making it straightforward to track changes, re-run analyses on historical data, and undo mistakes.
2626

2727
## Installation
@@ -96,5 +96,6 @@ board.pin_read("hadley/sales-summary")
9696

9797
You can easily control who gets to access the data using the RStudio Connect permissions pane.
9898

99-
The pins package also includes boards that allow you to share data on services like Amazon's S3 (`board_s3()`) and Google Cloud Storage (`board_gcs()`).
99+
The pins package also includes boards that allow you to share data on services like
100+
Amazon's S3 (`board_s3()`), Google Cloud Storage (`board_gcs()`), and Azure Datalake (`board_azure()`).
100101
Learn more in [getting started](getting_started.Rmd).

pins/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
board_github,
1919
board_urls,
2020
board_rsconnect,
21+
board_azure,
2122
board_s3,
2223
board_gcs,
2324
board,

pins/constructors.py

Lines changed: 35 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -111,8 +111,8 @@ def board(
111111
board_factory:
112112
An optional board class to use as the constructor.
113113
114-
Note
115-
----
114+
Notes
115+
-----
116116
Many fsspec implementations of filesystems cache the searching of files, which may
117117
cause you to not see pins saved by other people. Disable this on these file systems
118118
with `storage_options = {"listings_expiry_time": 0}` on s3, or `{"cache_timeout": 0}`
@@ -256,8 +256,8 @@ def board_github(
256256
**kwargs:
257257
Passed to the pins.board function.
258258
259-
Note
260-
----
259+
Notes
260+
-----
261261
This board is read only.
262262
263263
@@ -374,12 +374,14 @@ def board_s3(path, versioned=True, cache=DEFAULT, allow_pickle_read=None):
374374
**kwargs:
375375
Passed to the pins.board function.
376376
377-
Note
378-
----
377+
Notes
378+
-----
379379
The s3 board uses the fsspec library (s3fs) to handle interacting with s3.
380380
In order to authenticate, set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,
381381
and (optionally) AWS_REGION environment variables.
382382
383+
See https://github.com/fsspec/s3fs
384+
383385
"""
384386
# TODO: user should be able to specify storage options here?
385387

@@ -397,8 +399,8 @@ def board_gcs(path, versioned=True, cache=DEFAULT, allow_pickle_read=None):
397399
**kwargs:
398400
Passed to the pins.board function.
399401
400-
Note
401-
----
402+
Notes
403+
-----
402404
The gcs board uses the fsspec library (gcsfs) to handle interacting with
403405
google cloud storage. Currently, its default mode of authentication
404406
is supported.
@@ -410,3 +412,28 @@ def board_gcs(path, versioned=True, cache=DEFAULT, allow_pickle_read=None):
410412
# fixes it under the hood
411413
opts = {"cache_timeout": 0}
412414
return board("gcs", path, versioned, cache, allow_pickle_read, storage_options=opts)
415+
416+
417+
def board_azure(path, versioned=True, cache=DEFAULT, allow_pickle_read=None):
418+
"""Create a board to read and write pins from an Google Cloud Storage bucket folder.
419+
420+
Parameters
421+
----------
422+
path:
423+
Path of form <bucket_name>/<optional>/<subdirectory>.
424+
**kwargs:
425+
Passed to the pins.board function.
426+
427+
Notes
428+
-----
429+
The azure board uses the fsspec library (adlfs) to handle interacting with
430+
Azure Datalake Filesystem (abfs). Currently, its default mode of authentication
431+
is supported.
432+
433+
See https://github.com/fsspec/adlfs
434+
"""
435+
436+
opts = {"use_listings_cache": False}
437+
return board(
438+
"abfs", path, versioned, cache, allow_pickle_read, storage_options=opts
439+
)

pins/tests/conftest.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,14 @@
1414

1515

1616
# Based on https://github.com/machow/siuba/blob/main/siuba/tests/helpers.py
17-
BACKEND_MARKS = ["fs_s3", "fs_file", "fs_gcs", "fs_rsc"]
17+
BACKEND_MARKS = ["fs_s3", "fs_file", "fs_gcs", "fs_abfs", "fs_rsc"]
1818

1919
# parameters that can be used more than once per session
2020
params_safe = [
2121
pytest.param(lambda: BoardBuilder("file"), id="file", marks=m.fs_file),
2222
pytest.param(lambda: BoardBuilder("s3"), id="s3", marks=m.fs_s3),
23-
pytest.param(lambda: BoardBuilder("gcs"), id="s3", marks=m.fs_gcs),
23+
pytest.param(lambda: BoardBuilder("gcs"), id="gcs", marks=m.fs_gcs),
24+
pytest.param(lambda: BoardBuilder("abfs"), id="abfs", marks=m.fs_abfs),
2425
]
2526

2627
# rsc should only be used once, because users are created at docker setup time

pins/tests/helpers.py

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,8 @@
2525
"file": {"path": ["PINS_TEST_FILE__PATH", None]},
2626
"s3": {"path": ["PINS_TEST_S3__PATH", "ci-pins"]},
2727
"gcs": {"path": ["PINS_TEST_GCS__PATH", "ci-pins"]},
28+
"abfs": {"path": ["PINS_TEST_AZURE__PATH", "ci-pins"]},
2829
"rsc": {"path": ["PINS_TEST_RSC__PATH", RSC_SERVER_URL]},
29-
# TODO(question): R pins has the whole server a board
30-
# but it's a bit easier to test by (optionally) allowing a user
31-
# or something else to be a board
32-
# "rsc": {"path": ["PINS_TEST_RSC__PATH", ""]}
3330
}
3431

3532
# TODO: Backend initialization should be independent of helpers, but these
@@ -121,7 +118,7 @@ def create_tmp_board(self, src_board=None) -> BaseBoard:
121118
if self.fs_name == "gcs":
122119
opts = {"cache_timeout": 0}
123120
else:
124-
opts = {"listings_expiry_time": 0}
121+
opts = {"use_listings_cache": False}
125122

126123
fs = filesystem(self.fs_name, **opts)
127124
temp_name = str(uuid.uuid4())

pins/tests/test_constructors.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,8 @@ def construct_from_board(board):
4545
board = c.board_rsconnect(
4646
server_url=board.fs.api.server_url, api_key=board.fs.api.api_key
4747
)
48+
elif fs_name == "abfs":
49+
board = c.board_azure(board.board)
4850
else:
4951
board = getattr(c, f"board_{fs_name}")(board.board)
5052

@@ -214,6 +216,8 @@ def test_constructor_boards_multi_user(board2, df_csv, tmp_cache):
214216
# TODO: RSConnect writes pin names like <user>/<name>, so would need to
215217
# modify test
216218
pytest.skip()
219+
elif fs_name == "abfs":
220+
fs_name = "azure"
217221

218222
first = construct_from_board(board2)
219223

0 commit comments

Comments
 (0)