Skip to content

Commit 7c2c81f

Browse files
authored
Merge branch 'main' into feat-rsc-public
2 parents d3027eb + 6a40966 commit 7c2c81f

File tree

12 files changed

+106
-33
lines changed

12 files changed

+106
-33
lines changed

.github/workflows/ci.yml

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ jobs:
2222
matrix:
2323
python: ["3.7", "3.8", "3.9", "3.10"]
2424
os: ["ubuntu-latest"]
25-
pytest_ops: [""]
25+
pytest_opts: ["--workers 4 --tests-per-worker 1"]
2626
requirements: [""]
2727
include:
2828
- os: "ubuntu-latest"
@@ -32,13 +32,18 @@ jobs:
3232
python: "3.10"
3333
# ignore doctests, as they involve calls to github, and all mac machines
3434
# use the same IP address
35+
pytest_opts: "--workers 4 --tests-per-worker 1 -k pins/tests"
36+
- os: "windows-latest"
37+
python: "3.10"
38+
# ignore doctests
3539
pytest_opts: "-k pins/tests"
3640
steps:
3741
- uses: actions/checkout@v2
3842
- uses: actions/setup-python@v2
3943
with:
4044
python-version: ${{ matrix.python }}
4145
- name: Install dependencies
46+
shell: bash
4247
run: |
4348
python -m pip install --upgrade pip
4449
@@ -57,14 +62,18 @@ jobs:
5762
export_default_credentials: true
5863

5964
- name: Run tests
65+
shell: bash
6066
run: |
61-
pytest pins -m 'not fs_rsc and not skip_on_github' --workers 4 --tests-per-worker 1 $PYTEST_OPTS
67+
pytest pins -m 'not fs_rsc and not skip_on_github' $PYTEST_OPTS
6268
env:
6369
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
6470
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
6571
AWS_REGION: "us-east-1"
72+
AZURE_STORAGE_ACCOUNT_NAME: ${{ secrets.AZURE_STORAGE_ACCOUNT_NAME }}
73+
AZURE_STORAGE_ACCOUNT_KEY: ${{ secrets.AZURE_STORAGE_ACCOUNT_KEY }}
6674
PYTEST_OPTS: ${{ matrix.pytest_opts }}
6775
REQUIREMENTS: ${{ matrix.requirements }}
76+
ACTION_OS: ${{ matrix.os }}
6877
# fixes error on macosx virtual machine with pytest-parallel
6978
# https://github.com/browsertron/pytest-parallel/issues/93
7079
no_proxy: "*"

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ board.pin_write(mtcars.head(), "mtcars", type="csv")
5555

5656
Above, we saved the data as a CSV, but depending on
5757
what you’re saving and who else you want to read it, you might use the
58-
`type` argument to instead save it as a `joblib` or `arrow` file (NOTE: arrow is not yet supported).
58+
`type` argument to instead save it as a `joblib` or `arrow` file.
5959

6060
You can later retrieve the pinned data with `.pin_read()`:
6161

docs/api/constructors.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ Board Constructors
1111
~board_temp
1212
~board_s3
1313
~board_gcs
14+
~board_azure
1415
~board_rsconnect
1516
~board_url
1617
~board

docs/getting_started.Rmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ Getting Started
2222

2323
The pins package helps you publish data sets, models, and other Python objects, making it easy to share them across projects and with your colleagues.
2424
You can pin objects to a variety of "boards", including local folders (to share on a networked drive or with DropBox), RStudio connect, Amazon S3,
25-
Google Cloud Storage, and more.
25+
Google Cloud Storage, Azure Datalake, and more.
2626
This vignette will introduce you to the basics of pins.
2727

2828
```{python}
@@ -70,10 +70,10 @@ But you can choose another option depending on your goals:
7070

7171
- `type = "csv"` uses `to_csv()` from pandas to create a `.csv` file. CSVs can read by any application, but only support simple columns (e.g. numbers, strings, dates), can take up a lot of disk space, and can be slow to read.
7272
- `type = "joblib"` uses `joblib.dump()` to create a binary python data file. See the [joblib docs](https://joblib.readthedocs.io/en/latest/) for more information.
73+
- `type = "arrow"` uses `pyarrow` to create an arrow/feather file. [Arrow](https://arrow.apache.org) is a modern, language-independent, high-performance file format designed for data science. Not every tool can read arrow files, but support is growing rapidly.
7374

7475
🚧 Data formats TODO 🚧
7576

76-
- `type = "arrow"` uses `arrow::write_feather()` to create an arrow/feather file. [Arrow](https://arrow.apache.org) is a modern, language-independent, high-performance file format designed for data science. Not every tool can read arrow files, but support is growing rapidly.
7777
- `type = "json"` uses `jsonlite::write_json()` to create a `.json` file. Pretty much every programming language can read json files, but they only work well for nested lists.
7878

7979
After you've pinned an object, you can read it back with `pin_read()`:

docs/intro.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ kernelspec:
2121
```
2222

2323
The pins package publishes data, models, and other Python objects, making it easy to share
24-
You can pin objects to a variety of pin *boards*, including folders (to share on a networked drive or with services like DropBox), RStudio Connect, Amazon S3, and Google Cloud Storage.
24+
You can pin objects to a variety of pin *boards*, including folders (to share on a networked drive or with services like DropBox), RStudio Connect, Amazon S3, Google Cloud Storage, and Azure Datalake.
2525
Pins can be automatically versioned, making it straightforward to track changes, re-run analyses on historical data, and undo mistakes.
2626

2727
## Installation
@@ -96,5 +96,6 @@ board.pin_read("hadley/sales-summary")
9696

9797
You can easily control who gets to access the data using the RStudio Connect permissions pane.
9898

99-
The pins package also includes boards that allow you to share data on services like Amazon's S3 (`board_s3()`) and Google Cloud Storage (`board_gcs()`).
99+
The pins package also includes boards that allow you to share data on services like
100+
Amazon's S3 (`board_s3()`), Google Cloud Storage (`board_gcs()`), and Azure Datalake (`board_azure()`).
100101
Learn more in [getting started](getting_started.Rmd).

pins/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
board_urls, # DEPRECATED
2020
board_url,
2121
board_rsconnect,
22+
board_azure,
2223
board_s3,
2324
board_gcs,
2425
board,

pins/constructors.py

Lines changed: 35 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -111,8 +111,8 @@ def board(
111111
board_factory:
112112
An optional board class to use as the constructor.
113113
114-
Note
115-
----
114+
Notes
115+
-----
116116
Many fsspec implementations of filesystems cache the searching of files, which may
117117
cause you to not see pins saved by other people. Disable this on these file systems
118118
with `storage_options = {"listings_expiry_time": 0}` on s3, or `{"cache_timeout": 0}`
@@ -256,8 +256,8 @@ def board_github(
256256
**kwargs:
257257
Passed to the pins.board function.
258258
259-
Note
260-
----
259+
Notes
260+
-----
261261
This board is read only.
262262
263263
@@ -410,12 +410,14 @@ def board_s3(path, versioned=True, cache=DEFAULT, allow_pickle_read=None):
410410
**kwargs:
411411
Passed to the pins.board function.
412412
413-
Note
414-
----
413+
Notes
414+
-----
415415
The s3 board uses the fsspec library (s3fs) to handle interacting with s3.
416416
In order to authenticate, set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,
417417
and (optionally) AWS_REGION environment variables.
418418
419+
See https://github.com/fsspec/s3fs
420+
419421
"""
420422
# TODO: user should be able to specify storage options here?
421423

@@ -433,8 +435,8 @@ def board_gcs(path, versioned=True, cache=DEFAULT, allow_pickle_read=None):
433435
**kwargs:
434436
Passed to the pins.board function.
435437
436-
Note
437-
----
438+
Notes
439+
-----
438440
The gcs board uses the fsspec library (gcsfs) to handle interacting with
439441
google cloud storage. Currently, its default mode of authentication
440442
is supported.
@@ -446,3 +448,28 @@ def board_gcs(path, versioned=True, cache=DEFAULT, allow_pickle_read=None):
446448
# fixes it under the hood
447449
opts = {"cache_timeout": 0}
448450
return board("gcs", path, versioned, cache, allow_pickle_read, storage_options=opts)
451+
452+
453+
def board_azure(path, versioned=True, cache=DEFAULT, allow_pickle_read=None):
454+
"""Create a board to read and write pins from an Google Cloud Storage bucket folder.
455+
456+
Parameters
457+
----------
458+
path:
459+
Path of form <bucket_name>/<optional>/<subdirectory>.
460+
**kwargs:
461+
Passed to the pins.board function.
462+
463+
Notes
464+
-----
465+
The azure board uses the fsspec library (adlfs) to handle interacting with
466+
Azure Datalake Filesystem (abfs). Currently, its default mode of authentication
467+
is supported.
468+
469+
See https://github.com/fsspec/adlfs
470+
"""
471+
472+
opts = {"use_listings_cache": False}
473+
return board(
474+
"abfs", path, versioned, cache, allow_pickle_read, storage_options=opts
475+
)

pins/rsconnect/api.py

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -371,15 +371,13 @@ def post_content_bundle(self, guid, fname, gzip=True) -> Bundle:
371371
if p.is_dir() and gzip:
372372
import tarfile
373373

374-
with tempfile.NamedTemporaryFile(mode="wb", suffix=".tar.gz") as tmp:
375-
with tarfile.open(fileobj=tmp.file, mode="w:gz") as tar:
376-
tar.add(str(p.absolute()), arcname="")
374+
with tempfile.TemporaryDirectory() as tmp_dir:
375+
p_archive = Path(tmp_dir) / "bundle.tar.gz"
377376

378-
# close the underlying file. note we don't call the top-level
379-
# close method, since that would delete the temporary file
380-
tmp.file.close()
377+
with tarfile.open(p_archive, mode="w:gz") as tar:
378+
tar.add(str(p.absolute()), arcname="")
381379

382-
with open(tmp.name, "rb") as f:
380+
with open(p_archive, "rb") as f:
383381
result = f_request(data=f)
384382
else:
385383
with open(str(p.absolute()), "rb") as f:

pins/tests/conftest.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,14 @@
1414

1515

1616
# Based on https://github.com/machow/siuba/blob/main/siuba/tests/helpers.py
17-
BACKEND_MARKS = ["fs_s3", "fs_file", "fs_gcs", "fs_rsc"]
17+
BACKEND_MARKS = ["fs_s3", "fs_file", "fs_gcs", "fs_abfs", "fs_rsc"]
1818

1919
# parameters that can be used more than once per session
2020
params_safe = [
2121
pytest.param(lambda: BoardBuilder("file"), id="file", marks=m.fs_file),
2222
pytest.param(lambda: BoardBuilder("s3"), id="s3", marks=m.fs_s3),
23-
pytest.param(lambda: BoardBuilder("gcs"), id="s3", marks=m.fs_gcs),
23+
pytest.param(lambda: BoardBuilder("gcs"), id="gcs", marks=m.fs_gcs),
24+
pytest.param(lambda: BoardBuilder("abfs"), id="abfs", marks=m.fs_abfs),
2425
]
2526

2627
# rsc should only be used once, because users are created at docker setup time

pins/tests/helpers.py

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,8 @@
2525
"file": {"path": ["PINS_TEST_FILE__PATH", None]},
2626
"s3": {"path": ["PINS_TEST_S3__PATH", "ci-pins"]},
2727
"gcs": {"path": ["PINS_TEST_GCS__PATH", "ci-pins"]},
28+
"abfs": {"path": ["PINS_TEST_AZURE__PATH", "ci-pins"]},
2829
"rsc": {"path": ["PINS_TEST_RSC__PATH", RSC_SERVER_URL]},
29-
# TODO(question): R pins has the whole server a board
30-
# but it's a bit easier to test by (optionally) allowing a user
31-
# or something else to be a board
32-
# "rsc": {"path": ["PINS_TEST_RSC__PATH", ""]}
3330
}
3431

3532
# TODO: Backend initialization should be independent of helpers, but these
@@ -121,7 +118,7 @@ def create_tmp_board(self, src_board=None) -> BaseBoard:
121118
if self.fs_name == "gcs":
122119
opts = {"cache_timeout": 0}
123120
else:
124-
opts = {"listings_expiry_time": 0}
121+
opts = {"use_listings_cache": False}
125122

126123
fs = filesystem(self.fs_name, **opts)
127124
temp_name = str(uuid.uuid4())

0 commit comments

Comments
 (0)