A sample on how to configure and deploy a serverless LLM inference service using AWS SageMaker and Lambda.
It is comprised of a few core pieces:
- Amazon SageMaker as the service providing easy API based access to the Solar LLM model.
- Lambda as the serverless compute solution, handling the inference requests and streaming responses.
- Function URL for direct HTTP access to the Lambda function.
This project deploys a serverless LLM inference service using AWS SageMaker and Lambda. It provides a streaming API endpoint that accepts chat completion requests and returns responses in real-time.
This project implements Lambda as a compatibility layer to match OpenAI's Chat Completion API interface. This means you can use existing OpenAI client libraries by simply changing the base URL to your Lambda Function URL.
Components:
- LLM Inference Endpoint: Amazon SageMaker endpoint that hosts the Solar Pro model
- API Gateway Lambda: Lambda function with Function URL that handles API requests and responses
- Lambda function with streaming support
- Function URL for direct invocation
- SageMaker endpoint integration
- API key authentication
- OpenAI Chat Completion API compatible interface
- SageMaker Endpoint Timeout: 60 seconds per request (can be increased with custom containers)
- SageMaker Endpoint Scaling: Initial scaling depends on instance type and model size
- AWS CLI configured with appropriate credentials
- Node.js (>=14.x) and npm installed
- AWS CDK CLI installed (
npm install -g aws-cdk)
- Clone the repository and install dependencies:
git clone https://github.com/UpstageAI/cookbook
cd cookbook/aws/use_cases/solar-sagemaker-lambda
npm ci- Configure environment variables:
cp .env.sample .envEdit the .env file and set the required environment variables:
CDK_ACCOUNT_ID: Your AWS account IDCDK_DEFAULT_REGION: AWS region code (e.g., ap-northeast-2, us-east-1)API_KEY_VALUE: API key for authentication (at least 20 characters)
Deploy the stack:
cdk deploy SetupResourceStage/SolarLambdaStackThe deployment process will:
- Create a SageMaker endpoint for the Solar LLM model
- Create a Lambda function with streaming support
- Configure necessary IAM roles and permissions
- Create a Function URL for direct invocation
You'll be prompted to confirm the changes twice:
- First prompt: Confirm SageMaker stack deployment
- IAM role for SageMaker with full access
- Type
yto confirm
- Second prompt: Confirm Lambda stack deployment
- IAM role for Lambda with basic execution and SageMaker invoke permissions
- Function URL configuration for public access
- Type
yto confirm
After successful deployment, you'll see the following outputs:
- SageMaker endpoint name (e.g.,
solar-pro-1743643334446) - Lambda function URL (e.g.,
https://xxxxx.lambda-url.us-west-2.on.aws/)
Save these values for making inference requests.
Make an inference request using curl:
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer [API_KEY_VALUE]" \
-d '{
"model": "solar-pro",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, how are you?"}
]
}' \
https://[FUNCTION_URL]The response will be streamed in real-time.
from openai import OpenAI
client = OpenAI(
api_key="your-api-key", # API_KEY_VALUE from .env
base_url="https://[FUNCTION_URL]" # Your Lambda Function URL
)
response = client.chat.completions.create(
model="solar-pro",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is AWS Lambda?"}
],
stream=True # Supports streaming
)
for chunk in response:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")- The API follows OpenAI's Chat Completion API specification
- Supported parameters:
model: Always use "solar-pro"messages: Array of message objects withroleandcontentstream: Boolean for streaming responses (recommended)temperature: Controls randomness (0.0 to 2.0)max_tokens: Maximum number of tokens to generate
- Authentication uses the same header format:
Authorization: Bearer YOUR_API_KEY
npm run buildcompile typescript to jsnpm run watchwatch for changes and compilenpm run testperform the jest unit testscdk deploydeploy this stack to your default AWS account/regioncdk diffcompare deployed stack with current statecdk synthemits the synthesized CloudFormation template
