Skip to content

feat: single entrypoint#223

Open
KAJdev wants to merge 15 commits intomainfrom
zeke/single-entrypoint
Open

feat: single entrypoint#223
KAJdev wants to merge 15 commits intomainfrom
zeke/single-entrypoint

Conversation

@KAJdev
Copy link
Contributor

@KAJdev KAJdev commented Feb 25, 2026

Unified Endpoint API

Replaces 8 resource config classes (LiveServerless, CpuLiveServerless, LiveLoadBalancer, CpuLiveLoadBalancer, ServerlessEndpoint, CpuServerlessEndpoint, LoadBalancerSlsResource, CpuLoadBalancerSlsResource) and the @remote decorator with a single Endpoint class.

Fixes AE-2259

Queue-based

  @Endpoint(name="worker", gpu=GpuType.ANY, dependencies=["torch"])
  async def predict(input_data: dict) -> dict:
      ...

Load-balanced

  api = Endpoint(name="service", cpu="cpu3c-1-2", workers=(1, 3))

  @api.post("/predict")
  async def predict(data: dict) -> dict:
      ...

Client mode

  ep = Endpoint(id="ep-abc123")
  job = await ep.run({"prompt": "hello"})
  await job.wait()
  print(job.output)

What changed

  • Endpoint is a facade that internally creates the old resource config objects, so the existing deployment/provisioning/handler pipeline continues working unchanged
  • QB vs LB is inferred from usage pattern (decorator vs route registration)
  • GPU vs CPU is a parameter (gpu= / cpu=), not a class choice
  • EndpointJob wraps job responses with status(), wait(), cancel(), and property access (job.id, job.output, job.error, job.done)
  • Scanner, manifest builder, and resource discovery all recognize Endpoint patterns
  • Legacy classes and @Remote emit DeprecationWarning on import/use
  • Skeleton templates (flash init) generate the new API

@KAJdev KAJdev requested a review from deanq February 25, 2026 23:15
@KAJdev KAJdev marked this pull request as ready for review February 25, 2026 23:46
@runpod-Henrik
Copy link

pulled this down to verify with my examples a few notes:

  1. ServerlessScalerType not exposed

04_scaling_performance/01_autoscaling/gpu_worker.py configures scaling strategies:

scale_to_zero_config = LiveServerless(
name="04_01_scale_to_zero",
gpus=[GpuGroup.ANY],
workersMin=0, workersMax=3, idleTimeout=5,
scalerType=ServerlessScalerType.QUEUE_DELAY,
scalerValue=4,
)

This controls how autoscaling decides to add workers — QUEUE_DELAY scales based on how long jobs wait in queue,
REQUEST_COUNT scales based on pending request volume. The example shows three strategies side by side (scale-to-zero,
always-on, high-throughput) with different scalerType/scalerValue combos.

Endpoint() doesn't have these params, so there's no way to express this:

What we'd want:

@endpoint(name="worker", gpu=GpuGroup.ANY, workers=(0, 3),
scaler_type=ServerlessScalerType.QUEUE_DELAY, scaler_value=4)
async def scale_to_zero_inference(payload: dict) -> dict: ...

Could we add scaler_type / scaler_value (or a combined scaler= param)?


  1. PodTemplate features not surfaced
    (new example not checked in yet)
    03_advanced_workers/04_custom_images/gpu_worker.py uses PodTemplate for full Docker control:

template = PodTemplate(
name="03_04_custom_template",
imageName="runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04",
containerDiskInGb=30,
dockerArgs="--shm-size=2g",
startScript="echo 'Worker starting with custom image'",
ports="8080/http",
# containerRegistryAuthId="your-auth-id", # for private registries
)

gpu_config = ServerlessEndpoint(
name="03_04_custom_images",
gpus=[GpuGroup.ADA_24],
template=template,
workersMin=0, workersMax=2, idleTimeout=5,
)

Endpoint(image=) only takes the image name string. The other template features — dockerArgs (e.g. shared memory size),
startScript (pre-run setup), ports, containerDiskInGb, and containerRegistryAuthId (private registries) — have no
equivalent. These are important for real-world deployments where the default Flash image doesn't work (custom CUDA
versions, private model servers, etc.).

Could we either add a template= param that accepts a PodTemplate, or surface these as top-level kwargs on Endpoint?


  1. Class-based @endpoint?

Two examples use @Remote on a class for stateful workers. Here's the pattern from
05_data_workflows/01_network_volumes/gpu_worker.py:

@Remote(resource_config=gpu_config, dependencies=["diffusers", "torch", "transformers"])
class SimpleSD:
def init(self):
# Runs once at worker startup — loads 4GB model into GPU memory
self.pipe = StableDiffusionPipeline.from_pretrained(...)
self.pipe = self.pipe.to("cuda")

  async def generate_image(self, prompt: str) -> dict:
      # Uses self.pipe — already warm in GPU memory
      image = self.pipe(prompt=prompt, ...).images[0]
      return {"image_path": image_path}

The class is instantiated once when the worker boots. The model stays in GPU memory via self.pipe and every request
calls methods on the same instance — no re-loading a 4GB model per request.

With function-based @endpoint, there's no self to hold state:

@endpoint(name="worker", gpu=GpuGroup.ANY, dependencies=["diffusers", "torch"])
async def generate_image(prompt: str) -> dict:
# Re-loading a 4GB model on every request — 30+ seconds of overhead each time
pipe = StableDiffusionPipeline.from_pretrained(...)
pipe = pipe.to("cuda")
image = pipe(prompt=prompt, ...).images[0]
return {"image_path": image_path}

Does @endpoint(...) support decorating classes the same way @Remote does? If not, we'd need a workaround (module-level
global with lazy init) or keep these on the legacy API.


  1. GpuGroup vs GpuType

The PR's skeleton templates use GpuType:

@endpoint(name="gpu_worker", gpu=GpuType.ANY, dependencies=["torch"])
async def gpu_hello(input_data: dict) -> dict: ...

But existing examples all use GpuGroup:

@endpoint(name="worker", gpu=GpuGroup.ADA_24)
async def my_func(payload: dict) -> dict: ...

Both work — Endpoint(gpu=) accepts either. But they mean different things: GpuType is a specific GPU model (e.g. RTX
4090), GpuGroup is a family (e.g. all Ada 24GB cards: 4090, L4, etc.). For examples, which should we standardize on?
Current thinking:

  • GpuGroup for "give me any GPU in this tier" (most examples)
  • GpuType only for the GPU selection example that targets a specific card

@KAJdev
Copy link
Contributor Author

KAJdev commented Feb 26, 2026

This controls how autoscaling decides to add workers — QUEUE_DELAY scales based on how long jobs wait in queue,
REQUEST_COUNT scales based on pending request volume. The example shows three strategies side by side (scale-to-zero,
always-on, high-throughput) with different scalerType/scalerValue combos.

Endpoint() doesn't have these params, so there's no way to express this:

will work on adding those parameters

Endpoint(image=) only takes the image name string. The other template features — dockerArgs (e.g. shared memory size),
startScript (pre-run setup), ports, containerDiskInGb, and containerRegistryAuthId (private registries) — have no
equivalent. These are important for real-world deployments where the default Flash image doesn't work (custom CUDA
versions, private model servers, etc.).

Could we either add a template= param that accepts a PodTemplate, or surface these as top-level kwargs on Endpoint?

👍

Does https://github.com/endpoint(...) support decorating classes the same way https://github.com/Remote does? If not, we'd need a workaround (module-level
global with lazy init) or keep these on the legacy API.

Endpoint does support classes

Both work — Endpoint(gpu=) accepts either. But they mean different things: GpuType is a specific GPU model (e.g. RTX
4090), GpuGroup is a family (e.g. all Ada 24GB cards: 4090, L4, etc.). For examples, which should we standardize on?

we should prefer GpuType in simpler examples, since it is easier to understand, but expand to GpuGroup for situations when more scale is important

@runpod-Henrik
Copy link

QA Report

Status: WARN
PR: #223 — feat: single entrypoint
Agent: flash-qa (PR mode)

CI Status

All 6 Quality Gates pass (Python 3.10–3.14 + Build Package). No CI regressions detected.

Note: Unable to run local tests — worktree branch checkout was blocked by sandbox policy. All analysis below is from static diff review and CI results.

PR Scope

  • 13 source files changed/added (695-line endpoint.py is new)
  • 9 test files added with 161 test methods
  • Key changes: new Endpoint class, deprecation warnings on legacy classes/remote, scanner + discovery + manifest + provisioner updates for Endpoint patterns

Test File Summary

Test File Tests Coverage Area
test_endpoint.py ~55 Endpoint construction, init params, QB/LB decorators, resource config type matrix (2x2x2), caching
test_endpoint_client.py ~40 EndpointJob lifecycle, run/runsync/cancel, _ensure_endpoint_ready (id + image modes), LB client requests, end-to-end flows
test_deprecations.py ~20 Deprecation warnings for 8 legacy classes + remote decorator, non-deprecated names verified
test_discovery_endpoint.py ~10 ResourceDiscovery with Endpoint LB patterns, resolve, directory scan, mixed legacy+Endpoint
test_skeleton_endpoint.py 4 Skeleton templates use Endpoint API (gpu/cpu/lb workers + README)
test_scanner_endpoint.py ~20 Scanner: QB functions/classes, LB routes, all HTTP methods, mixed patterns, edge cases
test_manifest_endpoint.py ~6 Manifest building with Endpoint QB/LB metadata, deployment config extraction with unwrapping
test_run_endpoint.py ~7 flash run: scan + server generation for QB/LB/mixed Endpoint patterns
test_resource_provisioner.py +4 Endpoint resource_type resolution to correct internal classes (4 combinations)

PR Diff Analysis

  • No bare exceptions
  • No hardcoded secrets (RUNPOD_API_KEY properly popped from env dict)
  • No print() in library source (print() only in skeleton if __name__ == "__main__" blocks and README examples — acceptable)
  • Public API surface changes documented: Endpoint and EndpointJob added to __all__, TYPE_CHECKING imports updated
  • Deprecation warnings added with stacklevel=2 for correct caller attribution
  • _internal=True flag on remote() suppresses double-warnings when called from Endpoint internals
  • Resource config caching prevents redundant provisioning
  • remote() deprecation is a breaking behavioral change — all existing users importing remote will get DeprecationWarning. This is intentional but should be documented in release notes.

Observations & Issues

1. Dual-purpose methods create subtle API surface
The .get()/.post()/.put()/.delete()/.patch() methods return either a decorator (no data arg, non-client mode) or a coroutine (client mode). This is determined by self.is_client. While tested, this design could confuse users:

ep = Endpoint(name="my-api")
ep.post("/compute")          # returns a decorator
ep_client = Endpoint(id="x")
ep_client.post("/compute")   # returns a coroutine

The distinction is tested but the boundary between "no data arg = decorator" vs "data=None = client call" is not explicitly tested. A user calling ep.post("/compute", None) in decorator mode would get a coroutine instead of a decorator.

2. _is_live_provisioning() default heuristic
When FLASH_IS_LIVE_PROVISIONING is unset, the function defaults to live mode unless RUNPOD_ENDPOINT_ID or RUNPOD_POD_ID is set. This heuristic is reasonable but not tested — no test verifies the fallback behavior when the env var is missing.

3. Endpoint with name=None and routes
Endpoint(name=None, id=None) raises ValueError, but Endpoint() (no args) also triggers this. However, scanner edge case test uses Endpoint() without name= and it works because the test file calls Endpoint() which would raise at runtime. The scanner test at line that tests my_api = Endpoint() — this would raise ValueError("name or id is required") at import time during manifest extraction, though AST-only scanning avoids executing the code.

Test Quality Assessment

Strengths:

  • Full 2x2x2 resource config type matrix tested (qb/lb x gpu/cpu x live/deploy = 8 combinations)
  • Client mode end-to-end flows well covered (run, wait, cancel, timeout)
  • Edge cases: FastAPI @app.get() not falsely matched, unregistered variable routes ignored, nested directories, cross-call detection
  • Mixed legacy + Endpoint coexistence tested at scanner, discovery, and manifest levels
  • Assertion quality is good — specific field checks, not just len() assertions

Missing Coverage:

  • _is_live_provisioning() standalone tests — no test verifies the fallback heuristic when env var is unset
  • _normalize_gpu() / _normalize_cpu() error paths — invalid types (e.g., gpu="string") not tested
  • Endpoint.__call__ with invalid func — what happens if you @ep decorate a non-callable?
  • Client mode PUT/DELETE/PATCH calls — only GET and POST client calls tested in TestClientRequest; PUT, DELETE, PATCH use the same _client_request path but are not explicitly verified
  • EndpointJob.wait() backoff intervals — the exponential backoff logic (_POLL_INITIAL_INTERVAL, _POLL_BACKOFF_FACTOR, _POLL_MAX_INTERVAL) is not verified; tests only check correctness, not timing behavior
  • Thread safety of _cached_resource_config — no concurrent access test (low risk for typical usage)
  • Deprecation warning stacklevel — no test verifies the warning points to the caller's frame, not the internal frame

Suggested Improvements:

  1. Add a parametrized test for _normalize_gpu and _normalize_cpu with invalid inputs
  2. Add a test for _is_live_provisioning() with various env combinations (unset, "true", "false", RUNPOD_ENDPOINT_ID set)
  3. Consider adding a PUT/DELETE/PATCH client call test for completeness (even if trivially same path)
  4. The _mock_httpx_client helper is well-designed but could be moved to conftest for reuse

Review Comments Integration

The PR already addresses the 4 review items from @runpod-Henrik:

  1. scaler_type/scaler_value — added in commit 3787b4b (params on Endpoint + manifest extraction + provisioner support)
  2. PodTemplatetemplate= param added in same commit
  3. Class-based @endpoint — confirmed supported, tested in TestEndpointQBClass and TestScanEndpointWorkers.test_endpoint_class_discovered_as_qb
  4. GpuGroup vs GpuType — skeleton templates use GpuType.ANY, examples use GpuGroup — per author's preference

Recommendation

MERGE WITH NOTES

The PR is solid — 161 tests, CI green on all Python versions, comprehensive coverage of the new Endpoint API. The dual-purpose method design and _is_live_provisioning() heuristic warrant documentation but are not blockers. Two suggestions before merge:

  1. Add release notes documenting the remote() deprecation warning (all existing code will emit warnings)
  2. Consider adding 2-3 tests for the missing _is_live_provisioning() fallback and _normalize_gpu/_normalize_cpu error paths

Generated by flash-qa agent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants