feat: single entrypoint by KAJdev · Pull Request #223 · runpod/flash

KAJdev · 2026-02-25T23:15:14Z

Unified Endpoint API

Replaces 8 resource config classes (LiveServerless, CpuLiveServerless, LiveLoadBalancer, CpuLiveLoadBalancer, ServerlessEndpoint, CpuServerlessEndpoint, LoadBalancerSlsResource, CpuLoadBalancerSlsResource) and the @remote decorator with a single Endpoint class.

Fixes AE-2259

Queue-based

  @Endpoint(name="worker", gpu=GpuType.ANY, dependencies=["torch"])
  async def predict(input_data: dict) -> dict:
      ...

Load-balanced

  api = Endpoint(name="service", cpu="cpu3c-1-2", workers=(1, 3))

  @api.post("/predict")
  async def predict(data: dict) -> dict:
      ...

Client mode

  ep = Endpoint(id="ep-abc123")
  job = await ep.run({"prompt": "hello"})
  await job.wait()
  print(job.output)

What changed

Endpoint is a facade that internally creates the old resource config objects, so the existing deployment/provisioning/handler pipeline continues working unchanged
QB vs LB is inferred from usage pattern (decorator vs route registration)
GPU vs CPU is a parameter (gpu= / cpu=), not a class choice
EndpointJob wraps job responses with status(), wait(), cancel(), and property access (job.id, job.output, job.error, job.done)
Scanner, manifest builder, and resource discovery all recognize Endpoint patterns
Legacy classes and @Remote emit DeprecationWarning on import/use
Skeleton templates (flash init) generate the new API

…templates

runpod-Henrik · 2026-02-26T22:18:41Z

pulled this down to verify with my examples a few notes:

ServerlessScalerType not exposed

04_scaling_performance/01_autoscaling/gpu_worker.py configures scaling strategies:

scale_to_zero_config = LiveServerless(
name="04_01_scale_to_zero",
gpus=[GpuGroup.ANY],
workersMin=0, workersMax=3, idleTimeout=5,
scalerType=ServerlessScalerType.QUEUE_DELAY,
scalerValue=4,
)

This controls how autoscaling decides to add workers — QUEUE_DELAY scales based on how long jobs wait in queue,
REQUEST_COUNT scales based on pending request volume. The example shows three strategies side by side (scale-to-zero,
always-on, high-throughput) with different scalerType/scalerValue combos.

Endpoint() doesn't have these params, so there's no way to express this:

What we'd want:

@endpoint(name="worker", gpu=GpuGroup.ANY, workers=(0, 3),
scaler_type=ServerlessScalerType.QUEUE_DELAY, scaler_value=4)
async def scale_to_zero_inference(payload: dict) -> dict: ...

Could we add scaler_type / scaler_value (or a combined scaler= param)?

PodTemplate features not surfaced
(new example not checked in yet)
03_advanced_workers/04_custom_images/gpu_worker.py uses PodTemplate for full Docker control:

template = PodTemplate(
name="03_04_custom_template",
imageName="runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04",
containerDiskInGb=30,
dockerArgs="--shm-size=2g",
startScript="echo 'Worker starting with custom image'",
ports="8080/http",
# containerRegistryAuthId="your-auth-id", # for private registries
)

gpu_config = ServerlessEndpoint(
name="03_04_custom_images",
gpus=[GpuGroup.ADA_24],
template=template,
workersMin=0, workersMax=2, idleTimeout=5,
)

Endpoint(image=) only takes the image name string. The other template features — dockerArgs (e.g. shared memory size),
startScript (pre-run setup), ports, containerDiskInGb, and containerRegistryAuthId (private registries) — have no
equivalent. These are important for real-world deployments where the default Flash image doesn't work (custom CUDA
versions, private model servers, etc.).

Could we either add a template= param that accepts a PodTemplate, or surface these as top-level kwargs on Endpoint?

Class-based @endpoint?

Two examples use @Remote on a class for stateful workers. Here's the pattern from
05_data_workflows/01_network_volumes/gpu_worker.py:

@Remote(resource_config=gpu_config, dependencies=["diffusers", "torch", "transformers"])
class SimpleSD:
def init(self):
# Runs once at worker startup — loads 4GB model into GPU memory
self.pipe = StableDiffusionPipeline.from_pretrained(...)
self.pipe = self.pipe.to("cuda")

  async def generate_image(self, prompt: str) -> dict:
      # Uses self.pipe — already warm in GPU memory
      image = self.pipe(prompt=prompt, ...).images[0]
      return {"image_path": image_path}

The class is instantiated once when the worker boots. The model stays in GPU memory via self.pipe and every request
calls methods on the same instance — no re-loading a 4GB model per request.

With function-based @endpoint, there's no self to hold state:

@endpoint(name="worker", gpu=GpuGroup.ANY, dependencies=["diffusers", "torch"])
async def generate_image(prompt: str) -> dict:
# Re-loading a 4GB model on every request — 30+ seconds of overhead each time
pipe = StableDiffusionPipeline.from_pretrained(...)
pipe = pipe.to("cuda")
image = pipe(prompt=prompt, ...).images[0]
return {"image_path": image_path}

Does @endpoint(...) support decorating classes the same way @Remote does? If not, we'd need a workaround (module-level
global with lazy init) or keep these on the legacy API.

GpuGroup vs GpuType

The PR's skeleton templates use GpuType:

@endpoint(name="gpu_worker", gpu=GpuType.ANY, dependencies=["torch"])
async def gpu_hello(input_data: dict) -> dict: ...

But existing examples all use GpuGroup:

@endpoint(name="worker", gpu=GpuGroup.ADA_24)
async def my_func(payload: dict) -> dict: ...

Both work — Endpoint(gpu=) accepts either. But they mean different things: GpuType is a specific GPU model (e.g. RTX
4090), GpuGroup is a family (e.g. all Ada 24GB cards: 4090, L4, etc.). For examples, which should we standardize on?
Current thinking:

GpuGroup for "give me any GPU in this tier" (most examples)
GpuType only for the GPU selection example that targets a specific card

KAJdev · 2026-02-26T23:06:25Z

This controls how autoscaling decides to add workers — QUEUE_DELAY scales based on how long jobs wait in queue,
REQUEST_COUNT scales based on pending request volume. The example shows three strategies side by side (scale-to-zero,
always-on, high-throughput) with different scalerType/scalerValue combos.

Endpoint() doesn't have these params, so there's no way to express this:

will work on adding those parameters

Endpoint(image=) only takes the image name string. The other template features — dockerArgs (e.g. shared memory size),
startScript (pre-run setup), ports, containerDiskInGb, and containerRegistryAuthId (private registries) — have no
equivalent. These are important for real-world deployments where the default Flash image doesn't work (custom CUDA
versions, private model servers, etc.).

Could we either add a template= param that accepts a PodTemplate, or surface these as top-level kwargs on Endpoint?

👍

Does https://github.com/endpoint(...) support decorating classes the same way https://github.com/Remote does? If not, we'd need a workaround (module-level
global with lazy init) or keep these on the legacy API.

Endpoint does support classes

Both work — Endpoint(gpu=) accepts either. But they mean different things: GpuType is a specific GPU model (e.g. RTX
4090), GpuGroup is a family (e.g. all Ada 24GB cards: 4090, L4, etc.). For examples, which should we standardize on?

we should prefer GpuType in simpler examples, since it is easier to understand, but expand to GpuGroup for situations when more scale is important

runpod-Henrik · 2026-02-27T21:14:57Z

QA Report

Status: WARN
PR: #223 — feat: single entrypoint
Agent: flash-qa (PR mode)

CI Status

All 6 Quality Gates pass (Python 3.10–3.14 + Build Package). No CI regressions detected.

Note: Unable to run local tests — worktree branch checkout was blocked by sandbox policy. All analysis below is from static diff review and CI results.

PR Scope

13 source files changed/added (695-line endpoint.py is new)
9 test files added with 161 test methods
Key changes: new Endpoint class, deprecation warnings on legacy classes/remote, scanner + discovery + manifest + provisioner updates for Endpoint patterns

Test File Summary

Test File	Tests	Coverage Area
`test_endpoint.py`	~55	Endpoint construction, init params, QB/LB decorators, resource config type matrix (2x2x2), caching
`test_endpoint_client.py`	~40	EndpointJob lifecycle, run/runsync/cancel, _ensure_endpoint_ready (id + image modes), LB client requests, end-to-end flows
`test_deprecations.py`	~20	Deprecation warnings for 8 legacy classes + remote decorator, non-deprecated names verified
`test_discovery_endpoint.py`	~10	ResourceDiscovery with Endpoint LB patterns, resolve, directory scan, mixed legacy+Endpoint
`test_skeleton_endpoint.py`	4	Skeleton templates use Endpoint API (gpu/cpu/lb workers + README)
`test_scanner_endpoint.py`	~20	Scanner: QB functions/classes, LB routes, all HTTP methods, mixed patterns, edge cases
`test_manifest_endpoint.py`	~6	Manifest building with Endpoint QB/LB metadata, deployment config extraction with unwrapping
`test_run_endpoint.py`	~7	flash run: scan + server generation for QB/LB/mixed Endpoint patterns
`test_resource_provisioner.py`	+4	Endpoint resource_type resolution to correct internal classes (4 combinations)

PR Diff Analysis

No bare exceptions
No hardcoded secrets (RUNPOD_API_KEY properly popped from env dict)
No print() in library source (print() only in skeleton if __name__ == "__main__" blocks and README examples — acceptable)
Public API surface changes documented: Endpoint and EndpointJob added to __all__, TYPE_CHECKING imports updated
Deprecation warnings added with stacklevel=2 for correct caller attribution
_internal=True flag on remote() suppresses double-warnings when called from Endpoint internals
Resource config caching prevents redundant provisioning
remote() deprecation is a breaking behavioral change — all existing users importing remote will get DeprecationWarning. This is intentional but should be documented in release notes.

Observations & Issues

1. Dual-purpose methods create subtle API surface
The .get()/.post()/.put()/.delete()/.patch() methods return either a decorator (no data arg, non-client mode) or a coroutine (client mode). This is determined by self.is_client. While tested, this design could confuse users:

ep = Endpoint(name="my-api")
ep.post("/compute")          # returns a decorator
ep_client = Endpoint(id="x")
ep_client.post("/compute")   # returns a coroutine

The distinction is tested but the boundary between "no data arg = decorator" vs "data=None = client call" is not explicitly tested. A user calling ep.post("/compute", None) in decorator mode would get a coroutine instead of a decorator.

2. _is_live_provisioning() default heuristic
When FLASH_IS_LIVE_PROVISIONING is unset, the function defaults to live mode unless RUNPOD_ENDPOINT_ID or RUNPOD_POD_ID is set. This heuristic is reasonable but not tested — no test verifies the fallback behavior when the env var is missing.

3. Endpoint with name=None and routes
Endpoint(name=None, id=None) raises ValueError, but Endpoint() (no args) also triggers this. However, scanner edge case test uses Endpoint() without name= and it works because the test file calls Endpoint() which would raise at runtime. The scanner test at line that tests my_api = Endpoint() — this would raise ValueError("name or id is required") at import time during manifest extraction, though AST-only scanning avoids executing the code.

Test Quality Assessment

Strengths:

Full 2x2x2 resource config type matrix tested (qb/lb x gpu/cpu x live/deploy = 8 combinations)
Client mode end-to-end flows well covered (run, wait, cancel, timeout)
Edge cases: FastAPI @app.get() not falsely matched, unregistered variable routes ignored, nested directories, cross-call detection
Mixed legacy + Endpoint coexistence tested at scanner, discovery, and manifest levels
Assertion quality is good — specific field checks, not just len() assertions

Missing Coverage:

_is_live_provisioning() standalone tests — no test verifies the fallback heuristic when env var is unset
_normalize_gpu() / _normalize_cpu() error paths — invalid types (e.g., gpu="string") not tested
Endpoint.__call__ with invalid func — what happens if you @ep decorate a non-callable?
Client mode PUT/DELETE/PATCH calls — only GET and POST client calls tested in TestClientRequest; PUT, DELETE, PATCH use the same _client_request path but are not explicitly verified
EndpointJob.wait() backoff intervals — the exponential backoff logic (_POLL_INITIAL_INTERVAL, _POLL_BACKOFF_FACTOR, _POLL_MAX_INTERVAL) is not verified; tests only check correctness, not timing behavior
Thread safety of _cached_resource_config — no concurrent access test (low risk for typical usage)
Deprecation warning stacklevel — no test verifies the warning points to the caller's frame, not the internal frame

Suggested Improvements:

Add a parametrized test for _normalize_gpu and _normalize_cpu with invalid inputs
Add a test for _is_live_provisioning() with various env combinations (unset, "true", "false", RUNPOD_ENDPOINT_ID set)
Consider adding a PUT/DELETE/PATCH client call test for completeness (even if trivially same path)
The _mock_httpx_client helper is well-designed but could be moved to conftest for reuse

Review Comments Integration

The PR already addresses the 4 review items from @runpod-Henrik:

scaler_type/scaler_value — added in commit 3787b4b (params on Endpoint + manifest extraction + provisioner support)
PodTemplate — template= param added in same commit
Class-based @endpoint — confirmed supported, tested in TestEndpointQBClass and TestScanEndpointWorkers.test_endpoint_class_discovered_as_qb
GpuGroup vs GpuType — skeleton templates use GpuType.ANY, examples use GpuGroup — per author's preference

Recommendation

MERGE WITH NOTES

The PR is solid — 161 tests, CI green on all Python versions, comprehensive coverage of the new Endpoint API. The dual-purpose method design and _is_live_provisioning() heuristic warrant documentation but are not blockers. Two suggestions before merge:

Add release notes documenting the remote() deprecation warning (all existing code will emit warnings)
Consider adding 2-3 tests for the missing _is_live_provisioning() fallback and _normalize_gpu/_normalize_cpu error paths

Generated by flash-qa agent

KAJdev added 6 commits February 25, 2026 13:55

feat: add unified Endpoint class replacing 8 resource config classes

a6bea18

feat: wire Endpoint into scanner and flash run, add id= and client mode

c605303

feat: wire Endpoint into build pipeline and resource discovery

6be0140

feat: implement Endpoint client mode (run/runsync/status/HTTP methods)

021d512

feat: add EndpointJob with status()/wait()/cancel() and webhook support

b272e72

feat: deprecate legacy resource classes and @Remote, update skeleton …

dc375db

…templates

KAJdev requested a review from deanq February 25, 2026 23:15

KAJdev added 2 commits February 25, 2026 15:17

format

244c5d4

chore: fix lint errors and formatting

191ceb9

KAJdev marked this pull request as ready for review February 25, 2026 23:46

KAJdev added 4 commits February 25, 2026 16:01

fix: default to live provisioning when no explicit env signal is set

4c3ff83

fix: resolve Endpoint resource type in deploy provisioner

3be2f23

Merge branch 'main' into zeke/single-entrypoint

5eb7d57

Merge branch 'main' into zeke/single-entrypoint

335606e

KAJdev added 3 commits February 27, 2026 12:14

feat: add scaler_type, scaler_value, and template params to Endpoint

3787b4b

fix: suppress warnings from internal calls

ec3c7d7

Merge branch 'main' into zeke/single-entrypoint

a1c48eb

KAJdev mentioned this pull request Feb 27, 2026

Endpoint api migration runpod/flash-examples#36

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: single entrypoint#223

feat: single entrypoint#223
KAJdev wants to merge 15 commits intomainfrom
zeke/single-entrypoint

KAJdev commented Feb 25, 2026 •

edited

Loading

Uh oh!

runpod-Henrik commented Feb 26, 2026

Uh oh!

KAJdev commented Feb 26, 2026

Uh oh!

runpod-Henrik commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KAJdev commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unified Endpoint API

Queue-based

Load-balanced

Client mode

What changed

Uh oh!

runpod-Henrik commented Feb 26, 2026

What we'd want:

Uh oh!

KAJdev commented Feb 26, 2026

Uh oh!

runpod-Henrik commented Feb 27, 2026

QA Report

CI Status

PR Scope

Test File Summary

PR Diff Analysis

Observations & Issues

Test Quality Assessment

Review Comments Integration

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KAJdev commented Feb 25, 2026 •

edited

Loading