Skip to content

Addition of a new SSH-based backend for the origin#3077

Draft
bbockelm wants to merge 16 commits intoPelicanPlatform:mainfrom
bbockelm:ssh_origin
Draft

Addition of a new SSH-based backend for the origin#3077
bbockelm wants to merge 16 commits intoPelicanPlatform:mainfrom
bbockelm:ssh_origin

Conversation

@bbockelm
Copy link
Collaborator

@bbockelm bbockelm commented Feb 5, 2026

Permits the origin to launch a helper over SSH which connects back and allows the origin to serve out the helper's filesystem.

@bbockelm bbockelm added origin Issue relating to the origin component enhancement New feature or request labels Feb 7, 2026
@bbockelm bbockelm linked an issue Feb 7, 2026 that may be closed by this pull request
Permits the origin to launch a helper over SSH which connects back
and allows the origin to serve out the helper's filesystem.
No real stress test of the code.  Still need to try password auth
via separate login.
@brianaydemir brianaydemir self-assigned this Feb 12, 2026
@brianaydemir brianaydemir self-requested a review February 12, 2026 00:30
@brianaydemir brianaydemir added this to the v7.25 milestone Feb 12, 2026
Comment on lines 181 to 190
Copy link
Contributor

@brianaydemir brianaydemir Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is inconsistent with origin_serve/handlers.go.

Here, we always advertise the data URL as /api/v1.0/origin/data. In origin_serve, we register handlers for that route only if the director is also enabled for the server.

If the federation's director and the origin are running as separate services, while a naïve curl to the origin works as expected

[root@dev app]# curl https://origin-0:8444/public/data/0.0
0.0.28177

the director sends the client to a non-working endpoint:

[root@dev app]# ./pelican object get --debug --direct pelican://director:8444/public/data/0.0 asdf
...
DEBUG[2026-02-16T21:13:53Z] Trying the object servers: [https://origin-0:8444/api/v1.0/origin/data/public/data/0.0] 
...
DEBUG[2026-02-16T21:13:53Z] Failed to download from https://origin-0:8444/api/v1.0/origin/data/public/data/0.0 : request failed (HTTP status 404): 404 page not found: Specification.FileNotFound Error: Error code 5011: server returned 404 Not Found  job=019c684d-782f-764b-a861-9c2bee9ec718 url="https://origin-0:8444/api/v1.0/origin/data/public/data/0.0"

Comment on lines +118 to +119
// getOriginURL returns the origin URL from the flag, address file, or config
func getOriginURL() (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both runSSHAuthLogin and runSSHAuthStatus suffer from a problem where they rely on this function to return the origin's URL, the problem being that there's no guarantee that Viper has been configured, which these functions rely on (indirectly).

Or in the words of Copilot:

The origin ssh-auth login command is calling config.ReadAddressFile() which uses getServerRuntimeDir().
  This function reads from viper.GetString(param.RuntimeDir.GetName()), but viper is not initialized when running the CLI
  command.
  The problem is:
   1. The ssh-auth login command runs as a standalone CLI command
   2. It calls config.ReadAddressFile() at line 126 of /Users/baydemir/Ivalice/GitHub/pelican/cmd/origin_ssh_auth.go
   3. ReadAddressFile() uses getServerRuntimeDir() which relies on viper configuration
   4. But the CLI command hasn't initialized the configuration, so RuntimeDir is empty
   5. This causes ReadAddressFile() to fail with "runtime directory is not configured"
  The fix: The CLI command needs to initialize the configuration before trying to read the address file. You need to call 
  config.InitClient() or similar configuration initialization in the runSSHAuthLogin and runSSHAuthStatus functions before
  calling getOriginURL().

Emperically, Copilot is not entirely wrong. Where I disagree with it: I think InitServer is more appropriate.

Comment on lines 73 to 75
// The websocket is under /api/v1.0/origin/ssh/auth for admin access
router.GET("/api/v1.0/origin/ssh/auth", handleWebSocket(ctx))
router.GET("/api/v1.0/origin/ssh/status", handleSSHStatus(ctx))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned here that anyone can run pelican(-server) origin ssh-auth login --origin {url} without the origin authenticating the caller or otherwise enforcing some sort of constraint. (I don't see anything currently that enforces "admin access".)

I'd feel better if there was an obvious, defined policy for who or what can interact with SSH.

Or in the words of Copilot, while I experimented with what it could come up with:

I've updated /Users/baydemir/Ivalice/GitHub/pelican/ssh_posixv2/websocket.go to allow connections from the server's own IP address:
  Changes made:
   1. Added net import - needed for net.LookupHost()
   2. Added param import - needed to access param.Server_Hostname
   3. Created isLocalConnection() helper function that:
    - Checks standard localhost addresses (127.0.0.1, ::1, localhost)
    - Looks up the server's hostname using param.Server_Hostname.GetString()
    - Resolves that hostname to IP addresses using net.LookupHost()
    - Returns true if the client IP matches any of the server's IPs
   4. Updated localhostOnlyMiddleware() to use the new helper function
  Now administrators can connect to the SSH auth endpoints from:
   - Standard localhost addresses (127.0.0.1, ::1, localhost)
   - The server's own IP address(es) based on its configured hostname
  This is useful when the origin is accessed via its actual IP address or hostname rather than localhost, while still maintaining security by not allowing arbitrary remote connections.

- Make sure we advertise correct data URLs when run separately.
- Make sure SSH auth websocket requires admin access
u.Path = "/api/v1.0/origin/ssh/auth"
}

log.Infof("Connecting to WebSocket: %s", u.String())

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information High

Sensitive data returned by an access to S3SecretKeyfile
flows to a logging call.
Sensitive data returned by an access to PasswordFile
flows to a logging call.
Sensitive data returned by an access to PrivateKeyPassphraseFile
flows to a logging call.
Sensitive data returned by an access to UIPasswordFile
flows to a logging call.
Sensitive data returned by an access to PasswordLocation
flows to a logging call.
@bbockelm
Copy link
Collaborator Author

@brianaydemir - can you take another look at this?

@patrickbrophy patrickbrophy self-requested a review February 25, 2026 22:14
Copy link
Contributor

@patrickbrophy patrickbrophy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using @brianaydemir's Pelican test framework, I was able to set up an SSH backed origin and pull a file. It should be noted that I ran into issues with the helper installation due to Pelican's reliance on glibc. This was fixed when I switched my ssh storage server from an alpine based image to an Alma based image.

I am now going to be focusing on the code itself. Given that the PR is quite large I will be paying closer attention to the intersection of existing code.

@bbockelm
Copy link
Collaborator Author

Interesting! There should be no glibc dependency, right?

Copy link
Contributor

@patrickbrophy patrickbrophy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While running the docker compose setup, I noticed after a little while that the Origin panicked with the following:

origin-ssh-1  | panic: runtime error: invalid memory address or nil pointer dereference
origin-ssh-1  | [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x123926c]
origin-ssh-1  |
origin-ssh-1  | goroutine 2975 [running]:
origin-ssh-1  | github.com/pelicanplatform/pelican/ssh_posixv2.(*SSHConnection).readHelperStdout(0x4000fecf70, {0x2996660, 0x4001332d70})
origin-ssh-1  |         /pelican-build/ssh_posixv2/helper.go:231 +0xcc
origin-ssh-1  | github.com/pelicanplatform/pelican/ssh_posixv2.(*SSHConnection).StartHelper.func2()
origin-ssh-1  |         /pelican-build/ssh_posixv2/helper.go:185 +0x24
origin-ssh-1  | golang.org/x/sync/errgroup.(*Group).Go.func1()
origin-ssh-1  |         /root/go/pkg/mod/golang.org/x/sync@v0.18.0/errgroup/errgroup.go:93 +0x4c
origin-ssh-1  | created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 369
origin-ssh-1  |         /root/go/pkg/mod/golang.org/x/sync@v0.18.0/errgroup/errgroup.go:78 +0x90

The linked comments describe how this likely happened.

Comment on lines 269 to 275
sessionCtx, sessionCancel := context.WithTimeout(ctx, sessionEstablishTimeout)
conn := NewSSHConnection(sshConfig)
backend.AddConnection(sshConfig.Host, conn)

// Try to establish the connection
err := runConnection(sessionCtx, conn, exports, authCookie)
sessionCancel() // Cancel the session context when done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sessionEstablishTimeout is meant to bound only the connection setup phase (connect, detect platform, transfer binary, start helper), but the timeout context is passed to the entire runConnection lifecycle, including the indefinite "wait for helper to exit" phase. This causes the timeout to fire every 5 minutes, killing a healthy, actively-serving helper process. The retry loop then treats this as a failure and increments the failure counter toward the max retry limit. Instead we should, cancel the session establishment timeout after the helper is confirmed ready, and use the parent context for the long-running wait phase.


// StopHelper stops the remote helper process.
// It first tries a clean shutdown via stdin message, then falls back to signals.
func (c *SSHConnection) StopHelper(ctx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug described in ssh_posixv2/backend.go triggers a race condition in StopHelper.

  1. When runConnection passes ctx (expired session contex) to StopHelper, so the cleanShutdownCtx derived from it is immediately expired. The 3-second grace period for clean shutdown never actually happens.
  2. After the SIGKILL path, StopHelper sets c.helperIO = nil without waiting for the errgroup goroutines to finish. The readHelperStdout goroutine, still running between its ctx.Done() check and the c.helperIO dereference, hits a nil pointer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

broker enhancement New feature or request origin Issue relating to the origin component

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add new SSH backend for origin

3 participants