Skip to content

jroth1111/litellm-coolify-nvidia-multi-key

Repository files navigation

LiteLLM on Coolify for NVIDIA Multi-Key Inference

Deploy LiteLLM behind Coolify with a pooled set of NVIDIA API keys and a single OpenAI-compatible model alias: glm-5-nvidia.

This pack was validated against a real Coolify host with duplicate end-to-end deployments, health checks, and live inference requests.

What it includes

  • docker-compose.coolify.yml
    • Static reference compose for the current verified deployment shape.
    • Pins LiteLLM to a known working image digest.
    • Generates /app/config.yaml at container start.
  • render-compose.sh
    • Renders the deployable compose from .env.
    • Supports any number of NVIDIA_API_KEY_POOL_N entries.
  • upsert-via-api.sh
    • Creates or updates the Coolify service via the public API.
    • Preflights host-port collisions before mutation.
    • Waits for running:healthy before returning success.
  • smoke-test.sh
    • Verifies /v1/models and /v1/chat/completions.
  • .env.example
    • Required runtime variables with placeholder values.
  • factory-custom-model.example.json
    • Example Factory Droid custom model entry for the deployed LiteLLM service.

Runtime shape

  • Upstream provider: NVIDIA OpenAI-compatible API
  • Upstream model: z-ai/glm5
  • Exposed LiteLLM alias: glm-5-nvidia
  • Routing strategy: simple-shuffle
  • LiteLLM image: ghcr.io/berriai/litellm@sha256:d6580beba82a69e4cfb6598c300b7c524d9ea6f67592226fdec7f6a9aba34eb2

Deploy

  1. Copy .env.example to .env.
  2. Fill in LITELLM_MASTER_KEY, NVIDIA_API_BASE, and one or more NVIDIA_API_KEY_POOL_N values.
  3. Export the Coolify control-plane variables:
    • COOLIFY_BASE_URL
    • COOLIFY_API_TOKEN
    • PROJECT_UUID
    • ENVIRONMENT_UUID
    • SERVER_UUID
  4. Run:
sh ./upsert-via-api.sh

If SERVICE_UUID is set, the script updates that service. If SERVICE_UUID is unset, it searches by SERVICE_NAME and updates that match; otherwise it creates a new service.

For a second LiteLLM instance on the same server, use a unique service name and host port:

SERVICE_NAME=litellm-glm5-nvidia-pool-duplicate \
HOST_PORT=4010 \
sh ./upsert-via-api.sh

Verify

BASE_URL=http://127.0.0.1:4000/v1 \
API_KEY="$LITELLM_MASTER_KEY" \
sh ./smoke-test.sh

Expected result:

  • /v1/models returns only glm-5-nvidia
  • /v1/chat/completions returns smoke test ok

Operational notes

  • The scripts discover NVIDIA_API_KEY_POOL_N entries from the env file itself, not from ambient shell variables.
  • The deploy script fails loudly if multiple Coolify services match the same SERVICE_NAME.
  • The deploy script fails fast if HOST_PORT is already bound by another Coolify service.
  • The deploy script tolerates transient Coolify exited states during startup and only fails fast on explicit :unhealthy status.
  • Coolify can momentarily report stale health around start and restart transitions; the script includes a short post-action settle delay before health polling.
  • The smoke test defaults to a 120 second timeout because valid NVIDIA keys can still have variable latency.
  • Do not treat the generated runtime compose under /data/coolify/services/* as source of truth. It contains instance-specific labels, UUIDs, network names, and container names.

About

LiteLLM on Coolify for NVIDIA GLM-5 inference with multi-key pool routing

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages