Merge branch 'main' into feat-rsc-public

machow · web-flow · commit 7c2c81fa1c44 · 2022-09-13T11:12:30.000-04:00
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -22,7 +22,7 @@ jobs:
       matrix:
         python: ["3.7", "3.8", "3.9", "3.10"]
         os: ["ubuntu-latest"]
-        pytest_ops: [""]
+        pytest_opts: ["--workers 4 --tests-per-worker 1"]
         requirements: [""]
         include:
           - os: "ubuntu-latest"
@@ -32,13 +32,18 @@ jobs:
             python: "3.10"
             # ignore doctests, as they involve calls to github, and all mac machines
             # use the same IP address
+            pytest_opts: "--workers 4 --tests-per-worker 1 -k pins/tests"
+          - os: "windows-latest"
+            python: "3.10"
+            # ignore doctests
             pytest_opts: "-k pins/tests"
     steps:
       - uses: actions/checkout@v2
       - uses: actions/setup-python@v2
         with:
           python-version: ${{ matrix.python }}
       - name: Install dependencies
+        shell: bash
         run: |
           python -m pip install --upgrade pip
 
@@ -57,14 +62,18 @@ jobs:
           export_default_credentials: true
 
       - name: Run tests
+        shell: bash
         run: |
-          pytest pins -m 'not fs_rsc and not skip_on_github' --workers 4 --tests-per-worker 1 $PYTEST_OPTS
+          pytest pins -m 'not fs_rsc and not skip_on_github' $PYTEST_OPTS
         env:
           AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
           AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
           AWS_REGION: "us-east-1"
+          AZURE_STORAGE_ACCOUNT_NAME: ${{ secrets.AZURE_STORAGE_ACCOUNT_NAME }}
+          AZURE_STORAGE_ACCOUNT_KEY: ${{ secrets.AZURE_STORAGE_ACCOUNT_KEY }}
           PYTEST_OPTS: ${{ matrix.pytest_opts }}
           REQUIREMENTS: ${{ matrix.requirements }}
+          ACTION_OS: ${{ matrix.os }}
           # fixes error on macosx virtual machine with pytest-parallel
           # https://github.com/browsertron/pytest-parallel/issues/93
           no_proxy: "*"
diff --git a/README.md b/README.md
@@ -55,7 +55,7 @@ board.pin_write(mtcars.head(), "mtcars", type="csv")
 
 Above, we saved the data as a CSV, but depending on
 what you’re saving and who else you want to read it, you might use the
-`type` argument to instead save it as a `joblib` or `arrow` file (NOTE: arrow is not yet supported).
+`type` argument to instead save it as a `joblib` or `arrow` file.
 
 You can later retrieve the pinned data with `.pin_read()`:
 
diff --git a/docs/api/constructors.rst b/docs/api/constructors.rst
@@ -11,6 +11,7 @@ Board Constructors
    ~board_temp
    ~board_s3
    ~board_gcs
+   ~board_azure
    ~board_rsconnect
    ~board_url
    ~board
diff --git a/docs/getting_started.Rmd b/docs/getting_started.Rmd
@@ -22,7 +22,7 @@ Getting Started
 
 The pins package helps you publish data sets, models, and other Python objects, making it easy to share them across projects and with your colleagues.
 You can pin objects to a variety of "boards", including local folders (to share on a networked drive or with DropBox), RStudio connect, Amazon S3,
-Google Cloud Storage, and more.
+Google Cloud Storage, Azure Datalake, and more.
 This vignette will introduce you to the basics of pins.
 
 ```{python}
@@ -70,10 +70,10 @@ But you can choose another option depending on your goals:
 
 -   `type = "csv"` uses `to_csv()` from pandas to create a `.csv` file. CSVs can read by any application, but only support simple columns (e.g. numbers, strings, dates), can take up a lot of disk space, and can be slow to read.
 -   `type = "joblib"` uses `joblib.dump()` to create a binary python data file. See the [joblib docs](https://joblib.readthedocs.io/en/latest/) for more information.
+-   `type = "arrow"` uses `pyarrow` to create an arrow/feather file. [Arrow](https://arrow.apache.org) is a modern, language-independent, high-performance file format designed for data science. Not every tool can read arrow files, but support is growing rapidly.
 
 🚧 Data formats TODO 🚧
 
--   `type = "arrow"` uses `arrow::write_feather()` to create an arrow/feather file. [Arrow](https://arrow.apache.org) is a modern, language-independent, high-performance file format designed for data science. Not every tool can read arrow files, but support is growing rapidly.
 -   `type = "json"` uses `jsonlite::write_json()` to create a `.json` file. Pretty much every programming language can read json files, but they only work well for nested lists.
 
 After you've pinned an object, you can read it back with `pin_read()`:
diff --git a/docs/intro.md b/docs/intro.md
@@ -21,7 +21,7 @@ kernelspec:
 ```
 
 The pins package publishes data, models, and other Python objects, making it easy to share
-You can pin objects to a variety of pin *boards*, including folders (to share on a networked drive or with services like DropBox), RStudio Connect, Amazon S3, and Google Cloud Storage.
+You can pin objects to a variety of pin *boards*, including folders (to share on a networked drive or with services like DropBox), RStudio Connect, Amazon S3, Google Cloud Storage, and Azure Datalake.
 Pins can be automatically versioned, making it straightforward to track changes, re-run analyses on historical data, and undo mistakes.
 
 ## Installation
@@ -96,5 +96,6 @@ board.pin_read("hadley/sales-summary")
 
 You can easily control who gets to access the data using the RStudio Connect permissions pane.
 
-The pins package also includes boards that allow you to share data on services like Amazon's S3 (`board_s3()`) and Google Cloud Storage (`board_gcs()`).
+The pins package also includes boards that allow you to share data on services like
+Amazon's S3 (`board_s3()`), Google Cloud Storage (`board_gcs()`), and Azure Datalake (`board_azure()`).
 Learn more in [getting started](getting_started.Rmd).
diff --git a/pins/__init__.py b/pins/__init__.py
@@ -19,6 +19,7 @@
     board_urls,  # DEPRECATED
     board_url,
     board_rsconnect,
+    board_azure,
     board_s3,
     board_gcs,
     board,
diff --git a/pins/constructors.py b/pins/constructors.py
@@ -111,8 +111,8 @@ def board(
     board_factory:
         An optional board class to use as the constructor.
 
-    Note
-    ----
+    Notes
+    -----
     Many fsspec implementations of filesystems cache the searching of files, which may
     cause you to not see pins saved by other people. Disable this on these file systems
     with `storage_options = {"listings_expiry_time": 0}` on s3, or `{"cache_timeout": 0}`
@@ -256,8 +256,8 @@ def board_github(
     **kwargs:
         Passed to the pins.board function.
 
-    Note
-    ----
+    Notes
+    -----
     This board is read only.
 
 
@@ -410,12 +410,14 @@ def board_s3(path, versioned=True, cache=DEFAULT, allow_pickle_read=None):
     **kwargs:
         Passed to the pins.board function.
 
-    Note
-    ----
+    Notes
+    -----
     The s3 board uses the fsspec library (s3fs) to handle interacting with s3.
     In order to authenticate, set the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,
     and (optionally) AWS_REGION environment variables.
 
+    See https://github.com/fsspec/s3fs
+
     """
     # TODO: user should be able to specify storage options here?
 
@@ -433,8 +435,8 @@ def board_gcs(path, versioned=True, cache=DEFAULT, allow_pickle_read=None):
     **kwargs:
         Passed to the pins.board function.
 
-    Note
-    ----
+    Notes
+    -----
     The gcs board uses the fsspec library (gcsfs) to handle interacting with
     google cloud storage. Currently, its default mode of authentication
     is supported.
@@ -446,3 +448,28 @@ def board_gcs(path, versioned=True, cache=DEFAULT, allow_pickle_read=None):
     # fixes it under the hood
     opts = {"cache_timeout": 0}
     return board("gcs", path, versioned, cache, allow_pickle_read, storage_options=opts)
+
+
+def board_azure(path, versioned=True, cache=DEFAULT, allow_pickle_read=None):
+    """Create a board to read and write pins from an Google Cloud Storage bucket folder.
+
+    Parameters
+    ----------
+    path:
+        Path of form <bucket_name>/<optional>/<subdirectory>.
+    **kwargs:
+        Passed to the pins.board function.
+
+    Notes
+    -----
+    The azure board uses the fsspec library (adlfs) to handle interacting with
+    Azure Datalake Filesystem (abfs). Currently, its default mode of authentication
+    is supported.
+
+    See https://github.com/fsspec/adlfs
+    """
+
+    opts = {"use_listings_cache": False}
+    return board(
+        "abfs", path, versioned, cache, allow_pickle_read, storage_options=opts
+    )
diff --git a/pins/rsconnect/api.py b/pins/rsconnect/api.py
@@ -371,15 +371,13 @@ def post_content_bundle(self, guid, fname, gzip=True) -> Bundle:
         if p.is_dir() and gzip:
             import tarfile
 
-            with tempfile.NamedTemporaryFile(mode="wb", suffix=".tar.gz") as tmp:
-                with tarfile.open(fileobj=tmp.file, mode="w:gz") as tar:
-                    tar.add(str(p.absolute()), arcname="")
+            with tempfile.TemporaryDirectory() as tmp_dir:
+                p_archive = Path(tmp_dir) / "bundle.tar.gz"
 
-                # close the underlying file. note we don't call the top-level
-                # close method, since that would delete the temporary file
-                tmp.file.close()
+                with tarfile.open(p_archive, mode="w:gz") as tar:
+                    tar.add(str(p.absolute()), arcname="")
 
-                with open(tmp.name, "rb") as f:
+                with open(p_archive, "rb") as f:
                     result = f_request(data=f)
         else:
             with open(str(p.absolute()), "rb") as f:
diff --git a/pins/tests/conftest.py b/pins/tests/conftest.py
@@ -14,13 +14,14 @@
 
 
 # Based on https://github.com/machow/siuba/blob/main/siuba/tests/helpers.py
-BACKEND_MARKS = ["fs_s3", "fs_file", "fs_gcs", "fs_rsc"]
+BACKEND_MARKS = ["fs_s3", "fs_file", "fs_gcs", "fs_abfs", "fs_rsc"]
 
 # parameters that can be used more than once per session
 params_safe = [
     pytest.param(lambda: BoardBuilder("file"), id="file", marks=m.fs_file),
     pytest.param(lambda: BoardBuilder("s3"), id="s3", marks=m.fs_s3),
-    pytest.param(lambda: BoardBuilder("gcs"), id="s3", marks=m.fs_gcs),
+    pytest.param(lambda: BoardBuilder("gcs"), id="gcs", marks=m.fs_gcs),
+    pytest.param(lambda: BoardBuilder("abfs"), id="abfs", marks=m.fs_abfs),
 ]
 
 # rsc should only be used once, because users are created at docker setup time
diff --git a/pins/tests/helpers.py b/pins/tests/helpers.py
@@ -25,11 +25,8 @@
     "file": {"path": ["PINS_TEST_FILE__PATH", None]},
     "s3": {"path": ["PINS_TEST_S3__PATH", "ci-pins"]},
     "gcs": {"path": ["PINS_TEST_GCS__PATH", "ci-pins"]},
+    "abfs": {"path": ["PINS_TEST_AZURE__PATH", "ci-pins"]},
     "rsc": {"path": ["PINS_TEST_RSC__PATH", RSC_SERVER_URL]},
-    # TODO(question): R pins has the whole server a board
-    # but it's a bit easier to test by (optionally) allowing a user
-    # or something else to be a board
-    # "rsc": {"path": ["PINS_TEST_RSC__PATH", ""]}
 }
 
 # TODO: Backend initialization should be independent of helpers, but these
@@ -121,7 +118,7 @@ def create_tmp_board(self, src_board=None) -> BaseBoard:
         if self.fs_name == "gcs":
             opts = {"cache_timeout": 0}
         else:
-            opts = {"listings_expiry_time": 0}
+            opts = {"use_listings_cache": False}
 
         fs = filesystem(self.fs_name, **opts)
         temp_name = str(uuid.uuid4())
diff --git a/pins/tests/test_cache.py b/pins/tests/test_cache.py
@@ -10,11 +10,23 @@
 )
 
 from fsspec import filesystem
+from pathlib import Path
+
+# NOTE: windows time.time() implementation appears to have 16 millisecond precision, so
+# we need to add a small delay, in order to avoid prune checks appearing to happen at the
+# exact same moment something earlier was created / accessed.
+# see: https://stackoverflow.com/a/1938096/1144523
 
 
 # Utilities ===================================================================
 
 
+def _sleep():
+    # time-based issues keep arising erratically in windows checks, so try to shoot
+    # well past
+    time.sleep(0.3)
+
+
 @pytest.fixture
 def some_file(tmp_dir2):
     p = tmp_dir2 / "some_file.txt"
@@ -34,7 +46,7 @@ def test_touch_access_time_manual(some_file):
 def test_touch_access_time_auto(some_file):
     orig_access = some_file.stat().st_atime
 
-    time.sleep(0.2)
+    _sleep()
     new_time = touch_access_time(some_file)
 
     assert some_file.stat().st_atime == new_time
@@ -55,9 +67,14 @@ def test_pins_cache_url_hash_name():
     cache = PinsUrlCache(fs=filesystem("file"))
     hashed = cache.hash_name("http://example.com/a.txt", True)
 
+    p_hash = Path(hashed)
+
     # should have form <url_hash>/<version_placeholder>/<filename>
-    assert hashed.endswith("/a.txt")
-    assert hashed.count("/") == 2
+    assert p_hash.name == "a.txt"
+
+    # count parent dirs, excluding root (e.g. "." or "/")
+    n_parents = len(p_hash.parents) - 1
+    assert n_parents == 2
 
 
 @pytest.mark.skip("TODO")
@@ -106,6 +123,8 @@ def pin2_v3(a_cache):
 
 
 def test_cache_pruner_old_versions_none(a_cache, pin1_v1):
+    _sleep()
+
     pruner = CachePruner(a_cache)
 
     old = pruner.old_versions(days=1)
@@ -114,6 +133,8 @@ def test_cache_pruner_old_versions_none(a_cache, pin1_v1):
 
 
 def test_cache_pruner_old_versions_days0(a_cache, pin1_v1):
+    _sleep()
+
     pruner = CachePruner(a_cache)
     old = pruner.old_versions(days=0)
 
@@ -122,6 +143,8 @@ def test_cache_pruner_old_versions_days0(a_cache, pin1_v1):
 
 
 def test_cache_pruner_old_versions_some(a_cache, pin1_v1, pin1_v2):
+    _sleep()
+
     # create: tmp_dir/pin1/version1
 
     pruner = CachePruner(a_cache)
@@ -133,6 +156,8 @@ def test_cache_pruner_old_versions_some(a_cache, pin1_v1, pin1_v2):
 
 
 def test_cache_pruner_old_versions_multi_pins(a_cache, pin1_v2, pin2_v3):
+    _sleep()
+
     pruner = CachePruner(a_cache)
     old = pruner.old_versions(days=1)
 
@@ -141,6 +166,8 @@ def test_cache_pruner_old_versions_multi_pins(a_cache, pin1_v2, pin2_v3):
 
 
 def test_cache_prune_prompt(a_cache, pin1_v1, pin2_v3, monkeypatch):
+    _sleep()
+
     cache_prune(days=1, cache_root=a_cache.parent, prompt=False)
 
     versions = list(a_cache.glob("*/*"))
diff --git a/pins/tests/test_constructors.py b/pins/tests/test_constructors.py
@@ -25,7 +25,14 @@ def check_dir_writable(p_dir):
 
 
 def check_cache_file_path(p_file, p_cache):
-    assert str(p_file.relative_to(p_cache)).count("/") == 2
+    rel_path = p_file.relative_to(p_cache)
+
+    # parents has every entry you'd get if you called .parents all the way to some root.
+    # for a relative path, the root is likely ".", so we subtract 1 to get the number
+    # of parent directories.
+    # note this essentially counts slashes, in a inter-OS friendly way.
+    n_parents = len(rel_path.parents) - 1
+    assert n_parents == 2
 
 
 def construct_from_board(board):
@@ -38,6 +45,8 @@ def construct_from_board(board):
         board = c.board_rsconnect(
             server_url=board.fs.api.server_url, api_key=board.fs.api.api_key
         )
+    elif fs_name == "abfs":
+        board = c.board_azure(board.board)
     else:
         board = getattr(c, f"board_{fs_name}")(board.board)
 
@@ -207,6 +216,8 @@ def test_constructor_boards_multi_user(board2, df_csv, tmp_cache):
         # TODO: RSConnect writes pin names like <user>/<name>, so would need to
         # modify test
         pytest.skip()
+    elif fs_name == "abfs":
+        fs_name = "azure"
 
     first = construct_from_board(board2)