Skip to content

Conversation

@dbkegley
Copy link
Collaborator

@dbkegley dbkegley commented Feb 14, 2025

Closes #382

These changes make the separation between the different credential strategies more explicit and tries to be better about detecting the difference between local and workbench content.

It also combines the Content and Viewer strategy into a single new class: ConnectStrategy. If user-session-token is provided then we use the viewer implementation. If not then we should fall back to the content credentials (service account) implementation.

I also modified some of the naming in the hopes of the choice of which strategy to use more obvious to end users of these helpers.

This example shows how to construct a Databricks SDK Config that is compatible with:

  • Databricks CLI authentication for local development
  • Workbench-managed Databricks Credentials in Posit Workbench
  • Viewer OAuth integration authentication in Posit Connect
import os

from databricks.sdk.core import ApiClient
from databricks.sdk.core.credentials_provider import databricks_cli
from databricks.sdk.service.iam import CurrentUserAPI
from shiny import reactive
from shiny.express import render, session

from posit.workbench.external.databricks import WorkbenchStrategy
from posit.connect.external.databricks import (
    ConnectStrategy,
    databricks_config,
)


@reactive.calc
def cfg():
    session_token = session.http_conn.headers.get("Posit-Connect-User-Session-Token")
    return databricks_config(
        posit_default_strategy=databricks_cli,
        posit_workbench_strategy=WorkbenchStrategy(),
        posit_connect_strategy=ConnectStrategy(user_session_token=session_token),
        host=os.getenv("DATABRICKS_HOST"),
    )


@render.text
def text():
    databricks_user_info = CurrentUserAPI(ApiClient(cfg())).me()
    return f"Hello, {databricks_user_info.display_name}!"

Testing

This needs QA in Connect and Workbench if possible.

Workbench should test using the WorkbenchStrategy() with Workbench-managed credentials.

Connect should test with ConnectStrategy() with both a Viewer and a Service Account OAuth integration.

It would be nice to test the azure_service_principal and oauth_service_principal strategies as a replacement for the deprecated PositLocalContentCredentialsProvider helper that was removed.

@dbkegley dbkegley changed the title Implement posit workbench credentials strategy and make credentials strategy fallback options more explicit [feat] Implement posit workbench credentials strategy and make credentials strategy fallback options more explicit Feb 14, 2025
Comment on lines 260 to 177
"""
def __init__(self,
client: Optional[Client] = None,
user_session_token: Optional[str] = None,
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
def __init__(self,
client: Optional[Client] = None,
user_session_token: Optional[str] = None,
):
"""
def __init__(self,
client: Optional[Client] = None,
*,
user_session_token: Optional[str] = None,
content_session_token: Optional[str] = None,
):

Would keep these explicitly keyword arguments and then allows the dev to pass the token with an appropriately named param.

@kmasiello
Copy link

My review is not of the code itself, but from the perspective of the developer/publisher trying to understand the correct usage of this helper. Some of my comments may reflect naivety in Python, but this may be representative of many of our users and we want them to be able to use this helper without significant mental overhead.

Here's a markup of the questions and confusion I had in trying to understand the shiny example. I will summarize each main point (numbered) separately below for discussion. (and no, i don't usually do PR comments with images 😂)
image

@kmasiello
Copy link

(1). "Posit", "Credentials", "Strategy", "posit_strategy", "credentials_strategy", "connect_strategy", "PositCredentials", "PositConnectCredentials", "PositWorkbenchCredentials" ...
This is very distracting and overwhelming, making it difficult to follow the logic path for what we're doing here.

@kmasiello
Copy link

(2). session_token = session.http_conn.headers.get("Posit-Connect-User-Session-Token") or session_token = flask.request.headers.get("Posit-Connect-User-Session-Token"). The viewer's session token is always going to be retrieved at this header, so don't make me have to write the code to go get it.

@kmasiello
Copy link

(3). workbench_strategy=PositWorkbenchCredentialsStrategy(Config(profile="workbench")), I don't love this because it's a lot to type out. Why not just workbench_strategy=Config(profile="workbench") or set this by default so I can just do workbench_strategy=PositWorkbenchCredentialsStrategy() but if I happen to have a different profile defined (maybe because I'm using a service account and need M2M) then I can change to a different profile.

@kmasiello
Copy link

(4) connect_strategy=PositConnectCredentialsStrategy(user_session_token=session_token). Again, this is a lot to type out. and it would align better with my mental model of credential methods on connect if I could instead specify connect_strategy=PositConnectCredentialsStrategy(type=[viewer_oath | service_account_oath | envvars])

@kmasiello
Copy link

(5). PositWorkbenchCredentialsStrategy and PositConnectCredentialsStrategy - again, word soup. Maybe we call these credential methods or credential types? I think the helper should have one "Strategy", not a PositStrategy and a Strategy

@kmasiello
Copy link

(6). posit_strategy isn't descriptive of what it actually is. It's a strategy for handling credentials, not for handling Posit.

@kmasiello
Copy link

(7) here's where I completely lost the logic trail.
Part of my confusion was not realizing that credentials_provider is a valid argument to sql.connect. I had only used access_token= before. So among the word soup from posit_sdk and now the databricks sql connector adding a related term, it was hard to follow. I've stared at this for an hour and I still don't understand the flow here.

Shiny for Python example application that shows user information and
the first few rows from a table hosted in Databricks.
"""
session_token = session.http_conn.headers.get("Posit-Connect-User-Session-Token")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit, and this applies to all the examples:
From the developer’s POV, I am in local/workbench first. It feels out of order to have the Connect-specific session token defined so early. My mental model is to get through the parts about defining how to handle creds in development vs deployment, then I’d start putting Connect-specific info.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a constraint given the custom nature of this integration. The way it is shown here, this code can be written once and then adapt to the environment it is running it. But that means there is a lowest common denominator in terms of DX as a tradeoff. Ideally, devs are thinking about how their code would work in production as well if that is where they plan to deploy to.

Referencing your other comments though, if we had a way to grab the user session token for the dev if/when it is needed then that could be done behind the scenes at the appropriate time simplifying this immensely.

Copy link
Collaborator

@mconflitti-pbc mconflitti-pbc Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

posit_strategy = PositCredentialsStrategy(
        local_strategy=databricks_cli,
        workbench_strategy=PositWorkbenchCredentialsStrategy(Config(profile="workbench")),
        connect_strategy=PositConnectCredentialsStrategy(user_session_token=session_token),
    )

could become:

posit_strategy = PositCredentialsStrategy()

allowing for overrides to defaults, but otherwise it just grabs what it needs under the hood.

@kmasiello
Copy link

🪱 🥫
Databricks SQL Connector is great. But what about authenticating to compute clusters using Spark and databricks-connect ?

from posit.connect.external.databricks import (
PositCredentialsStrategy,
PositConnectCredentialsStrategy,
PositWorkbenchCredentialsStrategy,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a passing observation: it's odd to see PositWorkbenchCredentialsStrategy inside the posit.connect module. They aren't really relevant for Connect, right? Is it time to create posit.workbench to contain these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move this before the PR becomes final. For now it's easier to iterate with everything inside a single module.

@mconflitti-pbc
Copy link
Collaborator

(2). session_token = session.http_conn.headers.get("Posit-Connect-User-Session-Token") or session_token = flask.request.headers.get("Posit-Connect-User-Session-Token"). The viewer's session token is always going to be retrieved at this header, so don't make me have to write the code to go get it.

Great point! I think the trouble with doing this for people though is that its location is dependent on the app/framework they are using. Would be neat to automatically find it though, I agree.

@dbkegley
Copy link
Collaborator Author

Databricks SQL Connector is great. But what about authenticating to compute clusters using Spark and databricks-connect ?

Any of the various CredentialsStrategy implementations from this PR or from the databricks-sdk should also work with databricks-connect

@dbkegley
Copy link
Collaborator Author

dbkegley commented Feb 18, 2025

(1). "Posit", "Credentials", "Strategy", "posit_strategy", "credentials_strategy", "connect_strategy", "PositCredentials", "PositConnectCredentials", "PositWorkbenchCredentials" ...
This is very distracting and overwhelming, making it difficult to follow the logic path for what we're doing here.

We can do my best to hide some of this from the user but this is an artifact of the databricks SDK, not something we are imposing in our client.

https://github.com/databricks/databricks-sdk-py/blob/998a117c43a7bc901710d263a7a7ab0d66ae8b8c/databricks/sdk/config.py#L103-L121

(2). session_token = session.http_conn.headers.get("Posit-Connect-User-Session-Token") or session_token = flask.request.headers.get("Posit-Connect-User-Session-Token"). The viewer's session token is always going to be retrieved at this header, so don't make me have to write the code to go get it.

Right. As Matt said, this is framework-dependent. We could consider adding a helper for each framework but that feels like it will lead to even more confusion like we see with (1)

(3). workbench_strategy=PositWorkbenchCredentialsStrategy(Config(profile="workbench")), I don't love this because it's a lot to type out. Why not just workbench_strategy=Config(profile="workbench") or set this by default so I can just do workbench_strategy=PositWorkbenchCredentialsStrategy() but if I happen to have a different profile defined (maybe because I'm using a service account and need M2M) then I can change to a different profile.

Good call. I think we can do workbench_strategy=PositWorkbenchCredentialsStrategy() pretty easily.

(4) connect_strategy=PositConnectCredentialsStrategy(user_session_token=session_token). Again, this is a lot to type out. and it would align better with my mental model of credential methods on connect if I could instead specify connect_strategy=PositConnectCredentialsStrategy(type=[viewer_oath | service_account_oath | envvars])

We can revisit this but if the choice is viewer_oauth then someone has to pass the token from the header into the PositConnectCredentialsStrategy.

The way this is implemented at the moment, if user_session_token is not provided then we attempt to default to Service Account auth, so the presence of user_session_token is what drives the type of auth used. This way if the publisher changes the oauth integration association in Connect from a Viewer integration to a Service Account integration, then they don't need to update any of their content code. If we want to make this an explicit choice in the code then we can definitely do that instead. I tend to prefer explicit options but this was one area where I tried to make the user do less typing.

(5). PositWorkbenchCredentialsStrategy and PositConnectCredentialsStrategy - again, word soup. Maybe we call these credential methods or credential types? I think the helper should have one "Strategy", not a PositStrategy and a Strategy

If you want to write content that only works on workbench then you don't need the PositStrategy, simply pass a PositWorkbenchStrategy when constructing the Config(credentials_strategy=PositWorkbenchStrategy()). The PositStrategy is useful when you want to author content that works in all 3 environments without making any code changes.

More broadly, the word choices "CredentialsStrategy" and "CredentialsProvider" are constructs from the databricks-sdk, not something we came up with here. We tried to be consistent with their naming in the SDK to avoid confusion but we can call these things whatever we want.

(6). posit_strategy isn't descriptive of what it actually is. It's a strategy for handling credentials, not for handling Posit.

This one is just a var name so an easy fix. How about we just call this creds or credentials? edit: Although I do think it's a little confusing to call it "posit_credentials" - These aren't posit's credentials, they are using posit's strategy when obtaining the user's databricks credentials.

(7) here's where I completely lost the logic trail.
Part of my confusion was not realizing that credentials_provider is a valid argument to sql.connect. I had only used access_token= before. So among the word soup from posit_sdk and now the databricks sql connector adding a related term, it was hard to follow. I've stared at this for an hour and I still don't understand the flow here.

This is a major part of the friction with implementing these helpers. We are trying to make something easy to do for our users when some of these libraries (databricks-sdk and databricks-sql) aren't even compatible inside of Databricks' own ecosystem of tools.

databricks/databricks-sql-python#148 (comment)

@dbkegley
Copy link
Collaborator Author

Oh an regarding the circular reference mentioned in (7). Yes. It's another reason this is so challenging. Config.credentials_strategy is a Callable which is called with Config as an argument.

@dbkegley dbkegley force-pushed the kegs-databricks-workbench branch from d695abf to 3d0e55b Compare February 18, 2025 20:53
@dbkegley
Copy link
Collaborator Author

dbkegley commented Feb 18, 2025

@kmasiello could you take another look at this when you have a minute?

I've done some refactoring based on your feedback. One of the main issues we had before was that there was this circular dependency between a databricks.Config and the credentials_strategy. I've tried to remove this by adding a new_config helper for constructing a databricks config that is all set up with the right strategy. The defaults should now work in most environments but can be overridden explicitly if desired. I still need to test all the different combinations but the basic gist is:

import streamlit as st
from databricks.sdk.core import ApiClient
from databricks.sdk.service.iam import CurrentUserAPI

from posit.connect.external.databricks import (
    new_config,
    ConnectStrategy,
)

session_token = st.context.headers.get("Posit-Connect-User-Session-Token")
cfg = new_config(
    posit_connect_strategy=ConnectStrategy(
        user_session_token=session_token
    ),
)

databricks_user = CurrentUserAPI(ApiClient(cfg)).me()
st.write(f"Hello, {databricks_user.display_name}!")

Unfortunately we can't really get rid of the need to pass in the session_token arg when specifying the connect strategy using viewer auth but hopefully this is an improvement.

If you want to use a Service Account oauth integration when running on Connect then the empty default configuration should be sufficient. This code should work locally, on workbench, and on Connect:

import streamlit as st
from databricks.sdk.core import ApiClient
from databricks.sdk.service.iam import CurrentUserAPI
from posit.connect.external.databricks import new_config

cfg = new_config()

databricks_user = CurrentUserAPI(ApiClient(cfg)).me()
st.write(f"Hello, {databricks_user.display_name}!")

@dbkegley dbkegley force-pushed the kegs-databricks-workbench branch from d747ea0 to 349a888 Compare March 4, 2025 23:04
@dbkegley dbkegley marked this pull request as ready for review March 4, 2025 23:04
@dbkegley dbkegley requested a review from tdstein as a code owner March 4, 2025 23:04
@dbkegley dbkegley force-pushed the kegs-databricks-workbench branch 2 times, most recently from 47c15ca to a291455 Compare March 4, 2025 23:17
@github-actions
Copy link

github-actions bot commented Mar 4, 2025

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
1860 1751 94% 0% 🟢

New Files

File Coverage Status
src/posit/workbench/_init_.py 100% 🟢
src/posit/workbench/external/_init_.py 100% 🟢
src/posit/workbench/external/databricks.py 84% 🟢
TOTAL 95% 🟢

Modified Files

File Coverage Status
src/posit/_init_.py 100% 🟢
src/posit/connect/_utils.py 100% 🟢
src/posit/connect/client.py 99% 🟢
src/posit/connect/external/databricks.py 94% 🟢
TOTAL 98% 🟢

updated for commit: 6d5cfe5 by action🐍

@dbkegley dbkegley force-pushed the kegs-databricks-workbench branch 3 times, most recently from 9cccf77 to 072472d Compare March 4, 2025 23:26
@dbkegley dbkegley changed the title [feat] Implement posit workbench credentials strategy and make credentials strategy fallback options more explicit feat: implement posit workbench credentials strategy and make credentials strategy fallback options more explicit Mar 4, 2025
Copy link
Collaborator

@tdstein tdstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just a few minor questions.

Comment on lines 133 to 134
databricks_user_info = CurrentUserAPI(ApiClient(cfg())).me()
return f"Hello, {databricks_user_info.display_name}!"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; extract variables to improve readability.

Suggested change
databricks_user_info = CurrentUserAPI(ApiClient(cfg())).me()
return f"Hello, {databricks_user_info.display_name}!"
current_user_api = CurrentUserAPI(ApiClient(cfg()))
databricks_user_info = current_user_api.me()
return f"Hello, {databricks_user_info.display_name}!"



class PositContentCredentialsProvider:
class _PositConnectContentCredentialsProvider:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this inherit from CredentialsProvider?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CredentialsProvider is a type not a base class

CredentialsProvider = Callable[[], Dict[str, str]]

user_session_token: Optional[str] = None,
):
self._local_strategy = local_strategy
self._cp = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; add a type signature for clarity on the non-null state.

Suggested change
self._cp = None
self._cp: CredentialsProvider | None = None

@abc.abstractmethod
def __call__(self, *args, **kwargs) -> CredentialsProvider:
raise NotImplementedError
logger = logging.getLogger("posit.sdk")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we can use logger = logging.getLogger(__name__) here. Which will be equivalent to getLogger("posit.sdk.external.databricks").

https://docs.python.org/3/howto/logging.html#advanced-logging-tutorial

@nealrichardson
Copy link
Collaborator

This still feels like a lot of complexity for an app developer to have to wrestle with (or, more likely, just cargo cult):

    return databricks_config(
        posit_default_strategy=databricks_cli,
        posit_workbench_strategy=WorkbenchStrategy(),
        posit_connect_strategy=ConnectStrategy(user_session_token=session_token),
        host=os.getenv("DATABRICKS_HOST"),
        warehouse_id=os.getenv("DATABRICKS_WAREHOUSE_ID"),
    )

What if instead it were

    return databricks_config(user_session_token = session_token)

and everything else were set by defaults in the function signature? Seems totally reasonable to use conventions for env vars for defaults, and as for posit_default_strategy and posit_workbench_strategy, would anyone ever want something other than the default you provided? (If so, they could still pass it as an argument.)

@tdstein
Copy link
Collaborator

tdstein commented Mar 5, 2025

This still feels like a lot of complexity for an app developer to have to wrestle with

Agreed. You have all the building blocks now. Applying sensible defaults to databricks_config would be a nice UX.

@dbkegley
Copy link
Collaborator Author

dbkegley commented Mar 5, 2025

This still feels like a lot of complexity for an app developer to have to wrestle with (or, more likely, just cargo cult):

    return databricks_config(
        posit_default_strategy=databricks_cli,
        posit_workbench_strategy=WorkbenchStrategy(),
        posit_connect_strategy=ConnectStrategy(user_session_token=session_token),
        host=os.getenv("DATABRICKS_HOST"),
        warehouse_id=os.getenv("DATABRICKS_WAREHOUSE_ID"),
    )

What if instead it were

    return databricks_config(user_session_token = session_token)

and everything else were set by defaults in the function signature? Seems totally reasonable to use conventions for env vars for defaults, and as for posit_default_strategy and posit_workbench_strategy, would anyone ever want something other than the default you provided? (If so, they could still pass it as an argument.)

That's basically how this is implemented with a minor difference. The example in the PR description shows how to override the default values. The env vars also also already handled the way that you describe, but it's useful to show them in our docs because if they aren't set then you get a rather opaque error message from Databricks SDK/SQL.

The minimal viewer auth example would be:

return databricks_config(
  posit_connect_strategy=ConnectStrategy(user_session_token=session_token)
)

And the minimal service account auth example would be:

return databricks_config()

However note that the minimal examples would not work when Workbench-managed Databricks credentials are used. You have to enable it by passing `WorkbenchStrategy(). There's some awkwardness with how the Databricks Config object is initialized.

Databricks sdk helpers now include a helper `databricks_config()`
which allows for more fine grained control over how the
credentials_strategy is selected in content that needs to run in
multiple environments (Workbench, Connect, local laptop). We now do our
best to "discover" the user's environment before selecting a strategy.

If no strategy was provided for the current environment then we use the
Databricks SDK's `DefaultCredentials()` class as a fallback option.

This is a breaking change but should simplify the API considerably for
users.
@dbkegley dbkegley force-pushed the kegs-databricks-workbench branch from 3a5027a to 6d5cfe5 Compare March 18, 2025 16:11
@dbkegley
Copy link
Collaborator Author

@tdstein FYI - I've moved the workbench credentials helper into a new module posit.workbench.external

Planning to merge this later today and update the release notes for the next release.

@dbkegley dbkegley changed the base branch from main to staging/0.9.0 March 18, 2025 17:25
@dbkegley dbkegley merged commit 37b58ab into staging/0.9.0 Mar 18, 2025
39 checks passed
@dbkegley dbkegley deleted the kegs-databricks-workbench branch March 18, 2025 17:28
@tdstein tdstein mentioned this pull request Mar 27, 2025
tdstein added a commit that referenced this pull request Mar 27, 2025
These changes have been reviewed at
#384

---------

Co-authored-by: David <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement WorkbenchManagedCredentialsStrategy for Databricks helpers

7 participants